We’re a deep-tech innovator at the intersection of Artificial Intelligence, machine-learning infrastructure, and edge-to-cloud platforms. Our award-winning solutions let Fortune-500 enterprises build, train, and deploy large-scale AI models—seamlessly, securely, and at lightning speed. As global demand for generative AI, RAG pipelines, and autonomous agents accelerates, we’re scaling our MLOps team to keep our customers two steps ahead of the curve.
Role & Responsibilities (max 6)
Own the full MLOps stack—design, build, and harden GPU-accelerated Kubernetes clusters across on-prem DCs and AWS/GCP/Azure for model training, fine-tuning, and low-latency inference.
Automate everything: craft IaC modules (Terraform/Pulumi) and CI/CD pipelines that deliver zero-downtime releases and reproducible experiment tracking.
Ship production-grade LLM workloads—optimize RAG/agent pipelines, manage model registries, and implement self-healing workflow orchestration with Kubeflow/Flyte/Prefect.
Eliminate bottlenecks: profile CUDA, resolve driver mismatches, and tune distributed frameworks (Ray, DeepSpeed) for multi-node scale-out.
Champion reliability: architect HA data lakes, databases, ingress/egress, DNS, and end-to-end observability (Prometheus/Grafana) targeting 99.99 % uptime.
Mentor & influence: instill platform-first mind-set, codify best practices, and report progress/road-blocks directly to senior leadership.