EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are seeking a Senior Reliability Engineer to join our remote team. This role is crucial for ensuring our systems' ongoing stability and efficiency, focusing on minimizing downtime and maximizing performance. The ideal candidate will have a proven track record of improving system reliability and a strong technical acumen in managing complex infrastructures. Your expertise will help shape our operational strategies, ensuring our services are robust and resilient against disruptions.
Responsibilities
Lead initiatives to enhance system reliability, availability, and resilience
Design and implement robust monitoring solutions to proactively identify potential issues
Mentor junior engineers in reliability best practices and advanced troubleshooting techniques
Collaborate with cross-functional teams to ensure seamless deployments and operations
Develop automation scripts to streamline operational processes and reduce human error
Conduct detailed root cause analysis for critical incidents and drive continuous improvement
Establish and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure system performance
Advocate for and implement reliability-focused changes in the software development lifecycle
Requirements
Minimum of 3 years experience in a Reliability Engineer role
Advanced scripting skills in Python and PowerShell
Strong knowledge of cloud platforms, specifically Azure and GCP
Proficient with Azure DevOps pipelines for efficient CI/CD workflows
Expertise in debugging and troubleshooting complex systems
Experience with monitoring tools such as GCP Cloud Logging, Grafana, and Azure Logs
In-depth understanding of Site Reliability Engineering (SRE) principles
Fluent English communication skills at a B2 level or higher
Nice to have
Experience with Kubernetes and container orchestration platforms
Proven ability to lead projects focused on system scalability and disaster recovery planning
Familiarity with advanced data analytics and machine learning tools to predict system failures
We offer
International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn