About The Opportunity

We’re a deep-tech innovator at the intersection of Artificial Intelligence, machine-learning infrastructure, and edge-to-cloud platforms. Our award-winning solutions let Fortune-500 enterprises build, train, and deploy large-scale AI models—seamlessly, securely, and at lightning speed. As global demand for generative AI, RAG pipelines, and autonomous agents accelerates, we’re scaling our MLOps team to keep our customers two steps ahead of the curve.

Role & Responsibilities (max 6)

Own the full MLOps stack—design, build, and harden GPU-accelerated Kubernetes clusters across on-prem DCs and AWS/GCP/Azure for model training, fine-tuning, and low-latency inference.
Automate everything: craft IaC modules (Terraform/Pulumi) and CI/CD pipelines that deliver zero-downtime releases and reproducible experiment tracking.
Ship production-grade LLM workloads—optimize RAG/agent pipelines, manage model registries, and implement self-healing workflow orchestration with Kubeflow/Flyte/Prefect.
Eliminate bottlenecks: profile CUDA, resolve driver mismatches, and tune distributed frameworks (Ray, DeepSpeed) for multi-node scale-out.
Champion reliability: architect HA data lakes, databases, ingress/egress, DNS, and end-to-end observability (Prometheus/Grafana) targeting 99.99 % uptime.
Mentor & influence: instill platform-first mind-set, codify best practices, and report progress/road-blocks directly to senior leadership.

Skills & Qualifications (max 6)

Must-Have

5 + yrs DevOps/Platform experience with Docker & Kubernetes; expert bash/Python/Go scripting.
Hands-on building ML infrastructure for distributed GPU training and scalable model serving.
Deep fluency in cloud services (EKS/GKE/AKS), networking, load-balancing, RBAC, and Git-based CI/CD.
Proven mastery of IaC & config-management (Terraform, Pulumi, Ansible).

Preferred
Production experience with LLM fine-tuning, RAG architectures, or agentic workflows at scale.
Exposure to Kubeflow, Flyte, Prefect, or Ray; track record of setting up observability and data-lake pipelines (Delta Lake, Iceberg).

Skills: cloud services,containerization,automation tools,version control,devops

Find Your Dream Job

Date Posted

Job Type

Technology

Work Setting

Salary Range

Experience Level

4330 matching jobs

Associate DevOps Engineer(Kubernetes, CI/CD, container orchestration)

DevOps Engineer - AWS

Devops Junior

Site Reliability Engineer

Site Reliability Engineer

Director Site Reliability Engineering

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer

Azure DevOps Engineer

DevOps Engineer

New SRE Jobs

For SRE Professionals

For Employers

Company