Site Reliability Engineer – Center of Excellence
Job Details
- A Site Reliability Engineer (SRE) focused on a Center of Excellence (CoE) is critical in driving best practices, standardization, and innovation in reliability, scalability, and operational efficiency across Maya and Maya Bank
- As a SRE-CoE, you ensure that reliability is not just a team-specific goal but an organizational-wide priority, driving long-term success and scalability
Skill
- Intermediate Kubernetes and Terraform skills
- Proficient in monitoring/observability tools like Dynatrace, Prometheus, and Grafana.
- Network security competency (NACLs, Security Groups, WAF)
- Strong scripting skills (Shell, Python, Go)
- Working GitOps/CI-CD (GitLab) and Service Mesh (Istio) knowledge.
- A solid foundational understanding of Cloud Solutions like AWS (preferred), Azure, and GCP.
- Foundational understanding of Service Level Indicators and Objectives
Expectations:
- Service Reliability: Oversee systems and implement enhancements to improve reliability across different environments.
- Operational Autonomy: Independently manage CI/CD pipelines and infrastructure deployments.
- System Enhancement: Implement monitoring improvements and develop automation to reduce manual tasks.
- Incident Resolution: Participate in root cause analysis (RCA) and execute corrective actions during incidents.
- Cloud Optimization: Apply cloud best practices for security, scalability, and cost optimization in public cloud.
Qualifications:
- Bachelor’s degree in Computer Science, Information Technology, or a related field (or equivalent experience).
- At least 2-3 years experience as DevOps or SRE.
Preferred Skills:
- AWS Associate Certification (or progress toward it)
- Familiarity with IDP (Internal Developer Platform) and self-service automation