We are seeking a highly skilled SRE Engineer to join our team and play a critical role in delivering exceptional managed services to our clients. As a key member of our engineering team, you will be responsible for designing, implementing, and maintaining robust and scalable infrastructure solutions on Google Cloud Platform (GCP) and Amazon Web Services (AWS).
Key Responsibilities:
Design, build, and maintain highly available, scalable, and performant platform components and shared services.
Define, monitor, and report on key Service Level Indicators (SLIs) and Service Level Objectives (SLOs) relevant to platform health and customer experience.
Identify and eliminate single points of failure across the infrastructure.
Participate in a rotating on-call schedule to provide after-hours support and incident response as needed.
Implement and improve monitoring, logging, and alerting systems to gain deep visibility into platform health, resource utilization, and potential issues including those triggered by customer activity.
Automate repetitive operational tasks ("toil") related to platform management, provisioning, scaling, and healing.
Develop and maintain Infrastructure as Code (IaC) and CI/CD pipeline to manage the platform infrastructure consistently and reliably.
Participate in incident response, troubleshooting, and resolution efforts for platform issues.
Collaborate with Customer Success teams to diagnose and resolve complex platform issues that may be related to customer-specific configurations or usage.
Contribute to the architectural design and evolution of the platform, focusing on resilience, multi-tenancy best practices, and supportability under varying customer loads.
Perform capacity planning to ensure the platform can handle anticipated customer growth and usage patterns.
Qualifications:
at least 2+ years of Site Reliability Engineer, DevOps Engineer, or similar role supporting production systems.
Experience working with cloud platforms (GCP, AWS, Alibaba, Azure).
Strong understanding of monitoring, logging, and alerting principles and tools (Prometheus, Grafana, ELK Stack, Datadog).
Proficiency in Infrastructure as Code (IaC) tools (Terraform, CloudFormation, Pulumi).
Solid scripting and automation skills (Python, Go, Bash).
Experience with containerization and orchestration technologies (Docker, Swarm, Kubernetes).
Familiarity with CI/CD pipelines and practices.
Understanding of networking fundamentals, databases, and distributed systems.
Experience participating in on-call rotations.
Excellent problem-solving and troubleshooting skills.
Strong communication and collaboration skills.
A passion for automation and continuous improvement.
A proactive approach in problem identification and resolution - don’t wait around, grab & fix it.
A learning attitude.
Excellent communication skills in English and Bahasa Indonesia.
Preferred Qualifications:
Certifications in GCP or AWS.
Experience working directly with customer-facing teams.
Experience defining and tracking customer-facing SLOs.
Experience providing self-service tooling or observability insights to customers.
Experience with cloud cost optimization strategies.