A Site Reliability Engineer (SRE) / DevOps and Infrastructure Engineer focuses on maintaining and improving the reliability, scalability, and efficiency of software services and infrastructure. This role ensures that development and operational standards are met across both cloud-based and on-premises infrastructure environments, including vSphere and other platforms like Nutanix.
Responsibilities:
- Infrastructure Management: Manage and scale infrastructure on both cloud platforms (such as Azure and AWS) and on-premises environments (like vSphere and Nutanix). This includes provisioning, configuration, and optimizing resources to meet service demands.
- Kubernetes Installation & Management: Install and manage Kubernetes clusters in on-prem environments, ensuring seamless integration with underlying platforms like vSphere. Handle networking, permissions, and infrastructure management from both vSphere and Kubernetes perspectives.
- Tool Proficiency: Leverage a variety of DevOps and SRE tools such as Kubernetes (K8s) for container orchestration, Service Mesh for microservices networking, and Vault for secrets management. Integrate security best practices into the infrastructure.
- CI/CD Pipeline Management: Design, implement, and manage CI/CD pipelines using tools like Jenkins, ensuring automated, smooth deployment processes that support frequent updates and minimize downtime.
- Production Release Management: Coordinate production software releases, including managing staging environments, release testing, and deployment to live environments.
- Monitoring and Incident Response: Implement monitoring tools to observe the health of applications and infrastructure. Respond quickly to production issues, ensuring minimal impact on availability and performance.
- Collaboration and Communication: Work closely with software developers, QA teams, and IT staff to ensure system reliability and performance. Foster communication across departments to address system-wide issues and improvements.
Key Tools and Technologies:
- Kubernetes (K8s): Managing containerized applications across clusters for operational efficiency and resource optimization.
- Service Mesh: Improving microservices communication by adding observability, security, and reliability without changing the microservices code.
- Vault by HashiCorp: Securing secrets and sensitive data with access control and auditing capabilities.
- SecOps Practices: Ensuring security is integrated into the infrastructure, safeguarding against potential threats.
- vSphere/Nutanix: Managing on-premises infrastructure, especially for Kubernetes deployment, networking, and permissions.
- CI/CD Tools: Tools like Jenkins for automating development, testing, and deployment processes.
- Shell & Node.js: Strong experience in shell scripting and Node.js for automation and development tasks.
- Enterprise-Grade Software: Extensive experience working with microservices-based, enterprise-grade software solutions that are robust, scalable, and deployed in production environments.
- Databases: Knowledge of installing, managing, and tuning databases like MongoDB and PostgreSQL.
Requirements
Requirements
BSc in Computer Science, Engineering, or a relevant field.
Proven experience as a DevOps or Site Reliability Engineer.
Expertise in managing both cloud and on-premises infrastructure.
Strong experience with Kubernetes (both cloud and on-prem), networking, and microservices.
Solid knowledge of Shell scripting and Node.js.
Experience with enterprise-grade software in production environments.
Experience with databases such as MongoDB and PostgreSQL, including installation and tuning.
Problem-solving attitude and strong team collaboration skills.