Description:
We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) to join our team. The ideal candidate will have a strong background in cloud platforms, DevOps practices, and modern software development frameworks. The SRE will play a critical role in designing, building, and maintaining highly scalable, fault-tolerant, and secure cloud infrastructure while ensuring operational excellence, high availability, and reliability.
Key Responsibilities:
- Cloud Infrastructure & Automation:
- Design, implement, and manage cloud-based infrastructure using platforms like AWS, Azure, or GCP
- Utilize Infrastructure-as-Code (IaC) tools such as Terraform, CloudFormation, and Ansible to automate deployments and configurations
- Create robust automation targeted at anomaly detection, toil reduction, recovery processes, and self-healing mechanisms, and optimize cloud costs
- DevSecOps & CI/CD:
- Deep understanding of DevSecOps principles and CI/CD pipelines using tools like GitLab, Jenkins, SonarQube, Nexus/Artifactory, and Docker
- Implement security best practices, including IAM roles, RBAC, vulnerability remediation, and SAST/DAST/SCA tools
- Observability & Incident Management:
- Design and implement monitoring, logging, and distributed tracing solutions using tools like AWS CloudWatch, Splunk/Signal FX, Dynatrace, and Open Telemetry
- Lead root cause analysis, blameless postmortems, and proactive incident management to minimize MTTR and MTTD
- Define and monitor SLOs, SLIs, and error budgets to ensure system reliability
- Microservices & API Management:
- Architect and manage microservices, serverless computing, and RESTful APIs
- Ensure fault tolerance and resilience using design patterns like Circuit Breaker, Retry, Timeout, and Bulkhead
- Chaos Engineering & Resiliency:
- Conduct chaos engineering experiments using tools like AWS FIS and Chaos Toolkit
- Perform resiliency assessments using Resilience Hub and implement self-healing solutions
- Database & Application Support:
- Manage and optimize database technologies such as PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift
- Provide production support, including incident response, problem management, and runbook creation. Participate in on-call rotations
- Collaboration & Communication:
- Collaborate with cross-functional teams to implement shift-left testing practices (BDD, TDD, Unit, Regression)
- Create and maintain architecture diagrams, knowledge articles, and disaster recovery plans
- Communicate effectively with stakeholders and demonstrate strong relationship management skills
Required Skills & Qualifications:
- Expertise in cloud platforms (AWS, Azure, or GCP) and container orchestration
- Proficiency in programming/scripting languages such as Python, Java, Node.js, Bash, and PowerShell
- Strong knowledge of database technologies (e.g., PostgreSQL, MongoDB, DynamoDB, Oracle, Redshift)
- Experience with DevOps tools (Jenkins, Docker, Nexus/Artifactory) and build tools (Maven, Gradle)
- Familiarity with AI/ML integrations, event-driven architectures, and distributed systems
- Expertise in observability, logging, and monitoring tools (AWS CloudWatch, Splunk, Dynatrace, Open Telemetry)
- Strong understanding of security practices, including IAM, RBAC, and vulnerability management
- Experience with chaos engineering, resiliency assessments, and disaster recovery planning
- Proficiency in performance testing tools (JMeter, LoadRunner) and capacity planning
- Excellent verbal and written communication skills, with the ability to collaborate across teams
- 8+ years of related experience in their specific area with experience leading teams on projects with similar scope and complexity
- Bachelor’s or master’s degree in computer science or equivalent
- Certifications: AWS Solutions Architect, Agile Certified Practitioner (ACP), or relevant cloud certifications
Preferred Qualifications:
- Experience with AI/ML libraries (e.g., NLTK, Transformers, Spacy, SciPy), Amazon SageMaker, and GenAI tools
- Familiarity with project management tools like JIRA, Confluence, and ServiceNow
Knowledge of utilities like AWS CLI, POSTMAN, and cur
Powered by JazzHR
4Vb0wDVD8q