Our client is looking for a Site Reliability Engineer for RGM to join our dynamic team. The Site Reliability Engineer (SRE) for RGM Team is responsible for enhancing the reliability, performance, and scalability of products with RGM and Customer Data Exchange Platform. This role emphasizes proactive incident analysis, automation, observability, and collaboration with cross-functional teams to ensure seamless customer experiences. As an SRE, you will work closely with product team, DevSecOps, and support teams to implement best practices in operational excellence, automation, and incident management within a complex, multi-cloud environment.

RESPONSIBILITIES:
• Proactive Incident Analysis & Operational Improvements:
o Analyze incident patterns and trends to gain insights into recurring issues, collaborating with product teams to drive their resolution.
o Proactively manage alerts, identify potential problems, and work with cross-functional teams to enhance reliability and performance.
o Collaborate with product teams to prioritize operational user stories focused on reliability and performance improvements.
o Document operational workflows, and troubleshooting guides to support knowledge sharing and team efficiency.
• Complex Troubleshooting & Problem Management:
o Lead efforts to troubleshoot complex issues in collaboration with L3 and L4 support partners, ensuring swift resolution and minimal downtime.
o Participate in crisis management and response, including on-call rotations, to address critical incidents impacting the different products.
• Automation for Efficiency:
o Identify automation opportunities across operational tasks to improve efficiency and reduce manual workload.
o Collaborate with the cybersecurity team to integrate automated security enhancements into the products’ operations and infrastructure.
• Observability and Monitoring:
o Use insights from observability tools to optimize incident resolution times, improve product performance, and drive continuous improvement.
• Cross-functional Collaboration:
o Work closely with architects, DevOps, and engineering teams to improve product stability and reduce incidents through proactive solutions.
o Engage with the Central SRE team, SIAM (Service & Integration Management) manager, and the SRE Community of Practice to share best practices and leverage synergies.

Requirements

• Bachelor’s or Master’s degree in Computer Science, Information Technology, Engineering, or a related field.
• 5+ years of experience in Site Reliability Engineering, DevOps, or IT operations with a focus on application reliability and observability.
• Hands-on experience with complex technology stacks, including web and mobile applications, cloud platforms (e.g., Azure, AWS, Google Cloud), and databases.
• Proficiency in automation and scripting (e.g., Python, PowerShell, or similar) for operational efficiency and incident response.
• Knowledge of CI/CD pipelines and DevOps practices to streamline deployments and automate recovery processes.
• Strong problem-solving skills and the ability to work under pressure during incidents.
• Excellent communication and collaboration skills, with the ability to coordinate across cross-functional teams.
• Fluent in English, with strong written and verbal communication skills.

Find Your Dream Job

Date Posted

Job Type

Technology

Work Setting

Salary Range

Experience Level

4330 matching jobs

Associate DevOps Engineer(Kubernetes, CI/CD, container orchestration)

DevOps Engineer - AWS

Devops Junior

Site Reliability Engineer

Site Reliability Engineer

Director Site Reliability Engineering

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer

Azure DevOps Engineer

Site Reliability Engineer - RGM

New SRE Jobs

For SRE Professionals

For Employers

Company