Role Description:

As a Developer with a focus on Site Reliability Engineering (SRE), you will play a pivotal role in ensuring the availability, performance, and scalability of critical systems and services. You will work closely with developers and operations teams to improve system reliability through automation, observability, and robust infrastructure practices.

Core Responsibilities:

System Reliability & Uptime

Design and implement strategies for high availability and system performance.
Define and monitor SLOs (Service Level Objectives), SLIs (Service Level Indicators), and Error Budgets.

Incident Management & Troubleshooting

Respond to outages and lead incident resolution efforts.
Drive blameless post-mortems and implement preventive measures.
Develop runbooks and automate recovery processes.
Participate in on-call rotation.

Infrastructure as Code (IaC)

Build and manage infrastructure using Terraform or similar tools.
Ensure infrastructure is reproducible, version-controlled, and auditable.

Monitoring & Observability

Implement and maintain monitoring tools (preferably Splunk).
Set up alerts and dashboards to monitor service health and performance.

Automation & Tooling

Automate deployments, scaling, failovers, and backups.
Develop internal tools to support CI/CD pipelines and team workflows.

Collaboration

Work closely with dev & ops teams to design scalable, supportable systems.
Promote CI/CD best practices, testing strategies, and release automation.

Essential Skills:

SRE Concepts: Reliability, availability, performance optimization.
Infrastructure as Code: Terraform or similar.
Monitoring/Logging: Splunk or equivalent observability stacks.
Incident Response: On-call support, post-mortems, automation of recovery.

Desirable Skills:

Programming & Scripting

Languages: Python, Bash, or Ruby.
Build tools, automate tasks, debug production issues.

Cloud Platforms

Proficiency in GCP and/or Azure.
Experience with cloud-native services, networking, and security.

Systems & Platforms

Strong knowledge of Linux/Unix systems, and preferably Windows.
Expertise in system internals, performance tuning, and debugging.

Containers & Orchestration

Hands-on experience with Docker, Kubernetes, or equivalent platforms.

CI/CD & Automation

Familiarity with Jenkins, GitHub Actions, ArgoCD, or similar.
Experience building and managing deployment pipelines.

Security & Compliance

Knowledge of access control, secrets management, audit logging.

Soft Skills

Excellent communication and collaboration skills.
Enjoys mentoring junior members.
Stays calm under pressure, especially during incidents.
Strong analytical and problem-solving mindset.

Find Your Dream Job

Date Posted

Job Type

Technology

Work Setting

Salary Range

Experience Level

4330 matching jobs

Associate DevOps Engineer(Kubernetes, CI/CD, container orchestration)

DevOps Engineer - AWS

Devops Junior

Site Reliability Engineer

Site Reliability Engineer

Director Site Reliability Engineering

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer

Azure DevOps Engineer

Site Reliability Engineer

New SRE Jobs

For SRE Professionals

For Employers

Company