We are seeking an experienced and dynamic Site Reliability Engineering Lead to drive the reliability, scalability, and performance of our production systems in I-Gaming industry. In this role, you will lead a team of SREs, implement best practices, and bridge the gap between development and operations to ensure robust infrastructure and seamless deployment pipelines.
Key Responsibilities:
Team Leadership:
Lead and mentor a team of SREs, fostering a culture of ownership, collaboration, and continuous improvement.
Define clear goals, performance metrics, and development plans for the team.
System Reliability & Performance:
Design and implement strategies to improve system reliability, scalability, and performance.
Conduct root cause analysis of production incidents and develop preventive solutions.
Infrastructure Management:
Oversee the deployment, monitoring, and management of production environments.
Collaborate with development teams to design cloud-native infrastructure and architecture.
Automation & CI/CD:
Drive automation of operational processes, reducing manual intervention and response times.
Optimize CI/CD pipelines to ensure smooth and rapid deployments.
Incident Management:
Establish incident response protocols and lead efforts during major incidents.
Ensure robust monitoring and alerting systems are in place to proactively detect issues.
Collaboration & Communication:
Act as a liaison between engineering, operations, and other teams to align objectives.
Share insights and best practices with internal stakeholders to enhance overall system resilience.
Skills Requirement:
Technical Expertise:
Strong experience with cloud platforms (AWS, Azure, Google Cloud) and infrastructure-as-code tools (Terraform, Ansible, etc.).
Proficiency in programming/scripting languages (Python, Go, Shell, etc.).
Deep knowledge of Kubernetes, containerization, and distributed systems.
Leadership Skills:
Proven track record of leading SRE or DevOps teams and managing large-scale production environments.
Strong decision-making, prioritization, and problem-solving capabilities.
Monitoring & Metrics:
Expertise in implementing and using monitoring tools (Prometheus, Grafana, Datadog, etc.) and logging systems.
Familiarity with service-level objectives (SLOs), service-level agreements (SLAs), and error budgets.
Soft Skills:
Excellent communication and collaboration skills to work across cross-functional teams.
Ability to mentor and upskill team members, fostering a learning-oriented culture.
Experience:
At least 8 years of experience in SRE, DevOps, or related roles with a focus on reliability engineering