As a Developer with a focus on Site Reliability Engineering (SRE), you will play a pivotal role in ensuring the availability, performance, and scalability of critical systems and services. You will work closely with developers and operations teams to improve system reliability through automation, observability, and robust infrastructure practices.
Core Responsibilities:
System Reliability & Uptime
Design and implement strategies for high availability and system performance.
Define and monitor SLOs (Service Level Objectives), SLIs (Service Level Indicators), and Error Budgets.
Incident Management & Troubleshooting
Respond to outages and lead incident resolution efforts.
Drive blameless post-mortems and implement preventive measures.
Develop runbooks and automate recovery processes.
Participate in on-call rotation.
Infrastructure as Code (IaC)
Build and manage infrastructure using Terraform or similar tools.
Ensure infrastructure is reproducible, version-controlled, and auditable.
Monitoring & Observability
Implement and maintain monitoring tools (preferably Splunk).
Set up alerts and dashboards to monitor service health and performance.
Automation & Tooling
Automate deployments, scaling, failovers, and backups.
Develop internal tools to support CI/CD pipelines and team workflows.
Collaboration
Work closely with dev & ops teams to design scalable, supportable systems.
Promote CI/CD best practices, testing strategies, and release automation.