We are seeking a highly skilled Site Reliability Engineer (SRE) to join our Data & Algorithm team, where you'll be pivotal in building and maintaining resilient, scalable, and high-performing systems. You will act as the bridge between development and operations—championing reliability, reducing operational toil, and driving excellence through observability, automation, and deep system-level expertise.
This is a hands-on, high-impact role for someone who thrives in a fast-paced, multitasking environment and has a strong foundation in infrastructure, automation, and modern cloud-native tools.
Key Responsibilities:
Design and implement resilient and scalable system architectures to ensure high availability.
Drive the adoption and monitoring of Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all critical services.
Develop automation tools and scripts (Python & Bash) to reduce manual interventions and operational toil.
Troubleshoot and resolve infrastructure and application issues, especially around Kubernetes, storage modules, and containerization.
Collaborate closely with engineering, data, and DevOps teams to implement best practices for system reliability and incident management.
Conduct root cause analysis and post-incident reviews, implementing improvements to prevent recurrence.
Use tools like Grafana to monitor system health, derive insights, and tune performance curves effectively.
Manage and maintain documentation for all systems, processes, and incident responses.
Support and troubleshoot key-value and NoSQL databases, as well as Kafka or BMQ (forked Kafka) for data streaming.
Handle multitasking under pressure, prioritize workloads, and maintain effective communication during high-stress scenarios.
Translate and convert data formats (CSV, JSON, etc.) using scripting to support analytics and system configurations
Requirements
Required Qualifications:
Strong programming/scripting skills in Python and Bash.
Deep understanding of Kubernetes internals, containerization, and troubleshooting at the infrastructure level.
Experience in cloud platforms like AWS, GCP, or Azure.
Solid background in Linux system administration and networking fundamentals.
Proficient with tools like Git and VS Code.
Hands-on experience with monitoring tools, especially Grafana.
Familiarity with NoSQL databases and data streaming platforms (Kafka, BMQ).
Strong grasp of SRE principles: SLOs, SLIs, SLA management, toil reduction, incident handling.
Ability to multitask and thrive in high-pressure environments