Overview

We are looking for a seasoned Associate Director of Site Reliability Engineering (SRE) to lead our AWS-focused SRE initiatives. In this role, you will be responsible for overseeing the reliability, scalability, and performance of critical applications and infrastructure hosted on AWS. You will lead a team of experienced SREs, drive strategic operational improvements, and ensure the seamless functioning of our cloud ecosystem to meet business and customer needs

Responsibilities

Leadership and Team Management:

Lead and mentor a team of SRE professionals, fostering a culture of innovation, collaboration, and accountability.
Develop and implement career development plans, provide coaching, and facilitate knowledge-sharing within the team.

Operational Excellence:

Drive the adoption of SRE principles, including SLAs, SLOs, and error budgets, to enhance system reliability and performance.
Oversee incident management processes, ensuring timely resolution and comprehensive root cause analysis.
Establish and monitor operational KPIs to measure and improve system availability and performance.

Automation and Tooling:

Champion the use of automation to reduce manual processes, improve efficiency, and enhance system reliability.
Implement and optimize Infrastructure as Code (IaC) using tools like Terraform, CloudFormation, or CDK.

AWS Infrastructure Management:

Design, build, and maintain scalable and secure AWS-based infrastructure to support current and future workloads.
Leverage AWS services such as EC2, RDS, Lambda, S3, CloudWatch, and others to enhance operational capabilities.

Collaboration and Stakeholder Engagement:

Partner with engineering, product, and DevOps teams to align SRE initiatives with business objectives.
Act as a key liaison between the SRE team and executive stakeholders, communicating updates on reliability and risks.

Risk and Security Management:

Ensure compliance with security standards and best practices within AWS environments.
Identify risks related to cloud infrastructure and implement strategies for mitigation.

Qualifications

Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
Should have 15+ years of experience with 10+ years of experience in cloud-based infrastructure and operations, with at least 4 years in a leadership role.
Deep expertise in AWS services, architecture, and tools, including hands-on experience with core AWS services (e.g., EC2, ECS, Lambda, S3, VPC, IAM).
Proficiency in automation scripting (e.g., Python, Bash) and Infrastructure as Code (e.g., Terraform, CloudFormation).
Strong knowledge of monitoring and observability tools like CloudWatch, Prometheus, Grafana, or Datadog.
Proven experience managing large-scale production environments, incident response, and operational scaling.
Hands-on experience with CI/CD pipelines and DevOps methodologies.

Preferred Qualifications

AWS certifications, such as AWS Certified Solutions Architect (Professional) or AWS Certified DevOps Engineer.
Experience with Kubernetes (EKS) and containerization technologies like Docker.
Familiarity with FinOps principles for cost optimization in AWS environments.
Strong analytical skills and a data-driven approach to decision-making.
Exceptional communication, leadership, and stakeholder management abilities.

Find Your Dream Job

Date Posted

Job Type

Technology

Work Setting

Salary Range

Experience Level

4330 matching jobs

Associate DevOps Engineer(Kubernetes, CI/CD, container orchestration)

DevOps Engineer - AWS

Devops Junior

Site Reliability Engineer

Site Reliability Engineer

Director Site Reliability Engineering

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer

Azure DevOps Engineer

Associate Director- AWS SRE

New SRE Jobs

For SRE Professionals

For Employers

Company