Find Your Dream Job

Search through thousands of job postings to find your next opportunity

Date Posted

Job Type

Technology

Work Setting

Salary Range

$0k $100k $200k+

Experience Level

Sr SRE/AI-ML Ops. Engineer

Spektrum Recruiting

Redwood City, CA

About Our Client

Our client is a software company on a mission to build the AI Co-Worker, a powerful platform designed to learn complex jobs and deploy intelligent agents to solve them at scale. Their technology sits at the intersection of AI and human productivity, enabling businesses to automate sophisticated workflows like never before.


As they scale their ML-powered products, they are looking for an SRE with deep experience in ML infrastructure to ensure the models, services, and data pipelines are highly available, observable, and production-ready.


This is a key role at the intersection of DevOps, ML, and platform engineering.


About the Role

As a Site Reliability Engineer focused on AI/ML Ops, you’ll be responsible for the operational excellence of the machine learning infrastructure. You’ll help build scalable, reliable systems for training, serving, and monitoring models across a modern ML stack. You'll partner with engineers, data scientists, and product teams to ensure fast, safe, and automated deployments of both models and services.


What You’ll Do

  • Design, deploy, and maintain highly reliable infrastructure for model training, inference, and data processing
  • Build scalable ML Ops pipelines for versioning, testing, releasing, and monitoring models in production
  • Implement observability across ML workloads (latency, accuracy, drift, utilization, failure modes)
  • Automate infrastructure provisioning with tools like Terraform, Helm, and Kubernetes
  • Improve CI/CD workflows for both code and ML models (e.g., A/B testing, shadow deployment, rollback)
  • Own and optimize GPU/TPU resource utilization across cloud/on-prem environments
  • Implement robust incident response, root cause analysis, and performance tuning
  • Collaborate with ML engineers, backend engineers, and product teams to deploy models into production safely


What They Are Looking For

  • 8+ years of experience in SRE, DevOps, or platform engineering roles
  • Experience managing production ML systems, model serving, or ML pipelines
  • Strong programming skills (Python, Bash, Go, or similar)
  • Deep knowledge of Kubernetes, containerization (Docker), and orchestration patterns
  • Experience with cloud infrastructure (AWS/GCP/Azure), IaC tools (Terraform, Pulumi, etc.)
  • Familiarity with ML tools and frameworks (PyTorch, TensorFlow, Hugging Face, MLflow, Airflow, etc.)
  • Hands-on experience with logging/monitoring/alerting tools (Prometheus, Grafana, Datadog, etc.)
  • Excellent problem-solving skills with a passion for reliability and performance


Nice to Have

  • Experience with LLM Ops or real-time inference of foundation models
  • Familiarity with feature stores, vector databases, or model registries
  • Experience operating hybrid cloud/on-prem GPU clusters
  • Background in security, compliance, and cost optimization in ML environments
  • Previous experience in a startup or fast-paced, high-growth environment


Why Join Them?

  • Culture That Empowers: They foster a collaborative, high-trust environment where innovation thrives and every voice matters.
  • Visionary Leadership: The founders and CTO are highly respected figures in the Silicon Valley tech ecosystem, with a track record of building world-class products.
  • True Autonomy: You'll have the freedom to shape the product, make meaningful decisions, and influence the future of AI-powered work.


If you’re excited about building rock-solid infrastructure for AI systems, and want to shape the future of reliable ML in production, our client would love to meet you.

New SRE Jobs

Connecting top SRE talent with leading companies.

For SRE Professionals

For Employers

Company