We are seeking an experienced and dynamic Site Reliability Engineering Lead to drive the reliability, scalability, and performance of our production systems in I-Gaming industry. In this role, you will lead a team of SREs, implement best practices, and bridge the gap between development and operations to ensure robust infrastructure and seamless deployment pipelines.

Key Responsibilities:

Team Leadership:

Lead and mentor a team of SREs, fostering a culture of ownership, collaboration, and continuous improvement.
Define clear goals, performance metrics, and development plans for the team.

System Reliability & Performance:

Design and implement strategies to improve system reliability, scalability, and performance.
Conduct root cause analysis of production incidents and develop preventive solutions.

Infrastructure Management:

Oversee the deployment, monitoring, and management of production environments.
Collaborate with development teams to design cloud-native infrastructure and architecture.

Automation & CI/CD:

Drive automation of operational processes, reducing manual intervention and response times.
Optimize CI/CD pipelines to ensure smooth and rapid deployments.

Incident Management:

Establish incident response protocols and lead efforts during major incidents.
Ensure robust monitoring and alerting systems are in place to proactively detect issues.

Collaboration & Communication:

Act as a liaison between engineering, operations, and other teams to align objectives.
Share insights and best practices with internal stakeholders to enhance overall system resilience.

Skills Requirement:

Technical Expertise:

Strong experience with cloud platforms (AWS, Azure, Google Cloud) and infrastructure-as-code tools (Terraform, Ansible, etc.).
Proficiency in programming/scripting languages (Python, Go, Shell, etc.).
Deep knowledge of Kubernetes, containerization, and distributed systems.

Leadership Skills:

Proven track record of leading SRE or DevOps teams and managing large-scale production environments.
Strong decision-making, prioritization, and problem-solving capabilities.

Monitoring & Metrics:

Expertise in implementing and using monitoring tools (Prometheus, Grafana, Datadog, etc.) and logging systems.
Familiarity with service-level objectives (SLOs), service-level agreements (SLAs), and error budgets.

Soft Skills:

Excellent communication and collaboration skills to work across cross-functional teams.
Ability to mentor and upskill team members, fostering a learning-oriented culture.

Experience:

At least 8 years of experience in SRE, DevOps, or related roles with a focus on reliability engineering

Find Your Dream Job

Date Posted

Job Type

Technology

Work Setting

Salary Range

Experience Level

4330 matching jobs

Associate DevOps Engineer(Kubernetes, CI/CD, container orchestration)

DevOps Engineer - AWS

Devops Junior

Site Reliability Engineer

Site Reliability Engineer

Director Site Reliability Engineering

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer

Azure DevOps Engineer

Site Reliability Engineering Lead

New SRE Jobs

For SRE Professionals

For Employers

Company