As a Site Reliability Engineer (SRE) at Centific, you will be responsible for implementing and maintaining highly available, scalable, and secure infrastructure and services. This role involves developing automation solutions to enhance reliability, performance, and incident response while ensuring operational efficiency. You will collaborate with software engineering, infrastructure, and DevOps teams to proactively identify potential issues, prevent system failures, and drive continuous improvements across cloud and on-prem environments.
This role is hands-on and requires expertise in system reliability, automation, cloud infrastructure, and incident response.
Implement scalable and highly available systems to improve system resilience.
Automate manual operational tasks using Python/Bash to improve system performance and reliability.
Develop and maintain Infrastructure as Code (IaC) solutions using Terraform/Ansible
Apply auto-scaling, load balancing, and failover strategies for cloud-based applications.
Work with cloud services such as AWS/Azure/GCP to optimize infrastructure provisioning and scaling.
Develop and deploy self-healing mechanisms for automated remediation of system failures.
Incident and Problem Management:
Follow incident response playbooks to streamline on-call troubleshooting and resolution.
Knowledge of ITIL V3 / V4
Orchestration automation using any ITSM Tool
Participate in production incident resolution, conduct root cause analysis (RCA), and assist in implementing permanent fixes.
Improve system fault tolerance using chaos engineering tools (Chaos Monkey/LitmusChaos) to test failure scenarios.
Support disaster recovery (DR) plans with backup, restore, and failover strategies.
Participate regular failover drills and game days to validate recovery strategies and incident handling efficiency.
Performance Optimization & Capacity Planning:
Assist in system performance analysis through capacity planning, latency tracking, and traffic analysis.
Support monitoring of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to ensure uptime and performance targets are met.
Work with DevOps and infrastructure teams to ensure systems are scalable and meet business growth demands.
Leverage predictive analytics to proactively detect capacity bottlenecks and optimize resource allocation.
Security, Compliance & Best Practices:
Follow security best practices in cloud and on-prem environments.
Support compliance such as GDPR, HIPAA, and ISO 27001 in reliability and monitoring solutions.
Adhere to role-based access controls (RBAC), encryption standards, and vulnerability assessments.
Knowledge of automated security scanning and monitoring to detect vulnerabilities and misconfigurations in real time.
Deploy and configure monitoring, logging, and alerting tools such as Prometheus/Grafana/Datadog/Splunk/Elastic Stack (ELK)/New Relic.
Establish real-time alerting mechanisms using Prometheus Alertmanager/PagerDuty/Opsgenie to proactively detect failures.
Work with developers and DevOps teams to instrument applications with OpenTelemetry/Jaeger/AWS X-Ray for distributed tracing.
Implement log aggregation pipelines using Fluentd/Graylog to centralize logs for troubleshooting and analytics.
Optimize metrics ingestion pipelines to maintain performance efficiency with minimal overhead.
Establish Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to ensure uptime and performance targets are met.
Work with DevOps and infrastructure teams to ensure systems are scalable and meet business growth demands.
Leverage predictive analytics to proactively detect capacity bottlenecks and optimize resource allocation.
CI/CD & DevOps Integration
Contribute to highly efficient CI/CD pipelines using Jenkins/GitHub Actions/GitLab CI/CD.
Work with developers to integrate reliability principles into software development workflows.
Assist in progressive delivery strategies such as blue-green deployments and canary releases to minimize production impact.
Automate deployment rollback mechanisms to improve system stability and reduce downtime.
Must-Have Qualifications
Education: Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
Experience: 3+ years of experience in site reliability engineering, cloud infrastructure, and DevOps automation.
Cloud & Infrastructure: Practical expertise in AWS/Azure/GCP, with experience in cloud networking, storage, and computing services.
Automation & Scripting: Proficiency in Python/Go/Bash to build automation scripts and tools.
CI/CD & Infrastructure Automation: Experience in managing CI/CD pipelines with Jenkins/GitHub Actions/GitLab CI/CD.
High Availability & Performance Optimization: Knowledge of auto-scaling, load balancing, and performance tuning.
Incident Response & RCA: Ability to assist in production incident response and RCA methodologies.
Good to Have Qualifications
Certifications: AWS Certified Solutions Architect, Google Cloud Professional Engineer, or Certified Kubernetes Administrator (CKA).
Chaos Engineering: Experience with Chaos Monkey, LitmusChaos for testing system resilience.
Kubernetes & Containerization: Familiarity with Kubernetes cluster management and container orchestration.
Security & Compliance: Experience in implementing security policies, access controls, and vulnerability assessments.
Experience with Predictive Analytics: Knowledge of AI/ML techniques for proactive failure detection and automated incident response.
Soft Skills
Strong problem-solving and analytical thinking to diagnose and troubleshoot complex system failures efficiently.
Ability to collaborate effectively with development, DevOps, and infrastructure teams to integrate reliability best practices.
Strong verbal and written communication skills to explain technical issues clearly to both engineering and non-technical teams.
Ability to remain calm under pressure during high-severity incidents and make well-reasoned decisions.
Adaptability to work in dynamic environments with evolving infrastructure, tools, and business requirements.
Resilience and stress management to handle on-call rotations, production outages, and critical system failures.
About Centific
Centific expertly engineers platforms and curates multimodal, multilingual data to empower the ‘Magnificent Seven’ and enterprise clients with safe, scalable Artificial Intelligence(AI) deployment. Our team includes over 150 PhDs and data scientists, along with more than 4,000 AI practitioners and engineers. We leverage an integrated ecosystem comprised of industry-leading partnerships, and 1.8 million vertical domain experts across 230 locales, to create high-quality pre-trained datasets, fine-tuned industry-specific Large Language Models(LLMs), and Retrieval-Augmented Generation (RAG) pipelines supported by vector databases. Our innovations can reduce Generative Artificial Intelligence(Gen AI) costs by up to 80% and bring Gen AI solutions to market 50% faster.
Applying to this job implies your agreement with: In compliance with the applicable data protection regulations (General Data Protection Regulation 679/2016), we at Centific, as data Controllers, inform you that the data collected will be processed to manage your application in the selection process. This processing is legitimized by the consent given upon submitting your application, and the data will be retained for a maximum of two years. You may exercise your rights of access, rectification, deletion, opposition, restriction of processing, and data portability by contacting [email protected] Furthermore, if deemed necessary, you may file a complaint with the relevant supervisory authority.
Your Authentic Self at Centific
Centific is committed to creating a diverse environment and is proud to be an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, pregnancy, disability, age, veteran status, or other characteristics. Centific is also committed to compliance with all fair employment practices regarding citizenship and immigration status.