We are looking for highly skilled professionals who can contribute to our team's success and help us maintain and improve the reliability of our systems.
Our customer is a multinational corporation with more than a century of history and offices in over 180 countries. Their most ambitious goal at the time is to introduce a range of Reduced-Risk Products (RRPs). The target audience is more than 1 billion of consumers around the globe.
Requirements:
- Must-Have Capabilities:
- Intermediate understanding of SRE principles and practices.
- Ability to handle more complex tasks and contribute to the improvement of processes.
- Intermediate troubleshooting and problem-solving skills.
- Intermediate knowledge of the following technologies:
- New Relic: Advanced monitoring and alerting setup, including custom dashboards.
- ELK: Advanced log management, analysis, and visualization.
- Opsgenie: Advanced alert management, including integration with other tools.
- Terraform/Terraform Enterprise: Advanced IaC tasks, including module creation and management.
- Bitbucket/GitHub: Advanced version control, including branching strategies and code reviews.
- Python: Advanced scripting and automation, including API integrations.
- JavaScript: Advanced scripting for automation tasks and tool integrations
- .Jenkins: Advanced CI/CD pipeline setup, including complex workflows and integrations
- .AWS: Intermediate understanding of cloud platforms and services
- .Should-Have Capabilities
- :Ability to mentor junior engineers and share knowledge
- .Strong communication and collaboration skills
- .Nice-to-Have Capabilities
- :Understanding of Node.js
- .Familiarity with container technologies (e.g., Docker, Kubernetes)
- .Familiarity with Ansible
.Responsibilities
- :Design, build, and maintain software delivery pipelines and infrastructure that support continuous integration, delivery, and deployment
- .Collaborate with development and operations teams to ensure that software is delivered with high quality, speed, and reliability
- .Automate manual processes, such as testing, deployment, and monitoring, to improve efficiency and reduce errors
- .Develop and maintain monitoring and alerting systems to proactively identify and address issues in production environments
- .Troubleshoot production issues, conducting root cause analysis, and implementing remediation plans
- .Manage and scale infrastructure resources, such as servers, databases, and cloud services, to ensure optimal performance and cost-effectiveness
- .Implement security best practices and ensure compliance with industry standards and regulations
- .Continuously learn and keep up to date with new technologies and industry trends to improve system performance, security, and efficiency
.