About Our Client

Our client is a software company on a mission to build the AI Co-Worker, a powerful platform designed to learn complex jobs and deploy intelligent agents to solve them at scale. Their technology sits at the intersection of AI and human productivity, enabling businesses to automate sophisticated workflows like never before.

As they scale their ML-powered products, they are looking for an SRE with deep experience in ML infrastructure to ensure the models, services, and data pipelines are highly available, observable, and production-ready.

This is a key role at the intersection of DevOps, ML, and platform engineering.

About the Role

As a Site Reliability Engineer focused on AI/ML Ops, you’ll be responsible for the operational excellence of the machine learning infrastructure. You’ll help build scalable, reliable systems for training, serving, and monitoring models across a modern ML stack. You'll partner with engineers, data scientists, and product teams to ensure fast, safe, and automated deployments of both models and services.

What You’ll Do

Design, deploy, and maintain highly reliable infrastructure for model training, inference, and data processing
Build scalable ML Ops pipelines for versioning, testing, releasing, and monitoring models in production
Implement observability across ML workloads (latency, accuracy, drift, utilization, failure modes)
Automate infrastructure provisioning with tools like Terraform, Helm, and Kubernetes
Improve CI/CD workflows for both code and ML models (e.g., A/B testing, shadow deployment, rollback)
Own and optimize GPU/TPU resource utilization across cloud/on-prem environments
Implement robust incident response, root cause analysis, and performance tuning
Collaborate with ML engineers, backend engineers, and product teams to deploy models into production safely

What They Are Looking For

8+ years of experience in SRE, DevOps, or platform engineering roles
Experience managing production ML systems, model serving, or ML pipelines
Strong programming skills (Python, Bash, Go, or similar)
Deep knowledge of Kubernetes, containerization (Docker), and orchestration patterns
Experience with cloud infrastructure (AWS/GCP/Azure), IaC tools (Terraform, Pulumi, etc.)
Familiarity with ML tools and frameworks (PyTorch, TensorFlow, Hugging Face, MLflow, Airflow, etc.)
Hands-on experience with logging/monitoring/alerting tools (Prometheus, Grafana, Datadog, etc.)
Excellent problem-solving skills with a passion for reliability and performance

Nice to Have

Experience with LLM Ops or real-time inference of foundation models
Familiarity with feature stores, vector databases, or model registries
Experience operating hybrid cloud/on-prem GPU clusters
Background in security, compliance, and cost optimization in ML environments
Previous experience in a startup or fast-paced, high-growth environment

Why Join Them?

Culture That Empowers: They foster a collaborative, high-trust environment where innovation thrives and every voice matters.
Visionary Leadership: The founders and CTO are highly respected figures in the Silicon Valley tech ecosystem, with a track record of building world-class products.
True Autonomy: You'll have the freedom to shape the product, make meaningful decisions, and influence the future of AI-powered work.

If you’re excited about building rock-solid infrastructure for AI systems, and want to shape the future of reliable ML in production, our client would love to meet you.

Find Your Dream Job

Date Posted

Job Type

Technology

Work Setting

Salary Range

Experience Level

4330 matching jobs

Associate DevOps Engineer(Kubernetes, CI/CD, container orchestration)

DevOps Engineer - AWS

Devops Junior

Site Reliability Engineer

Site Reliability Engineer

Director Site Reliability Engineering

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer

Azure DevOps Engineer

Sr SRE/AI-ML Ops. Engineer

New SRE Jobs

For SRE Professionals

For Employers

Company