We are looking for a Site Reliability Engineer (SRE) to join Lucidya Cloud Engineering team and contribute to improving the reliability, scalability, and automation of our cloud-based infrastructure. The ideal candidate will have hands-on experience with cloud environments, containerized workloads, automation tools, and monitoring systems, as well as a proactive mindset for enhancing system availability and performance.
Key Responsibilities: Key Requirements:
- Infrastructure Reliability
- Ensure high availability (HA) and scalability of critical infrastructure components (e.g., Redis, RabbitMQ, Kubernetes clusters)
- Proactively identify and eliminate single points of failure across the cloud environment
- Linux Systems Administration: Handle infrastructure management tasks such as patching, performance tuning, and monitoring of Linux-based systems
- Cloud Operations
- Manage and optimize cloud-based workloads across AWS, GCP, or Azure
- Automate provisioning, scaling, and maintenance tasks using Infrastructure as Code (IaC) tools such as Terraform, AWS CloudFormation, or similar
- Kubernetes Clusters
- Manage the day-to-day operations of Kubernetes clusters, including deployment, scaling, upgrades, and troubleshooting
- Monitoring and Incident Response
- Implement and standardize monitoring solutions using tools like Datadog, Prometheus, or Grafana to track golden metrics and improve alerting systems
- Participate in on-call rotations, troubleshoot incidents, and drive post-incident reviews to implement lasting solutions
- Automation and Scripting
- Develop and maintain automation scripts for routine operational tasks to reduce manual efforts and increase efficiency
- Advocate for AWX/Ansible adoption to automate configurations and deployments
- Collaboration and Best Practices
- Work closely with DevOps and Engineering teams to identify and resolve performance bottlenecks
- Contribute to the establishment of best practices for infrastructure and application reliability
- Experience and Knowledge
- : 3 years of experience in a similar SRE, DevOps, or Infrastructure Engineer role
- Strong experience with at least one major cloud provider (AWS, GCP, or Azure)
- Hands-on experience with Kubernetes and containerization (e.g., Docker)
- Technical Skills
- Proficient in scripting languages such as Python, Bash, or similar for automation
- Familiarity with Infrastructure as Code (IaC) tools like Terraform, Pulumi, or AWS CloudFormation
- Strong understanding of load balancers, networking (IP management, subnetting), and HA architecture
- Experience with CI/CD tools (e.g., Bitbucket Pipelines, Jenkins, GitHub Actions)
- Monitoring and Observability
- Experience with modern monitoring and observability tools (e.g., Datadog, ELK, Grafana)
- Ability to define and track golden metrics and establish meaningful alerting thresholds
- Problem Solving and Troubleshooting
- Strong analytical skills and ability to resolve complex technical issues
- Proven track record in root cause analysis and incident management
- Soft Skills
- Excellent communication and collaboration skills to work across teams
- Self-motivated and proactive in improving systems and processes