We are seeking a skilled and motivated Cloud Engineer with focus on AI workloads to design, implement, and maintain cloud infrastructure optimized for AI and Generative AI workloads. This role involves provisioning cloud resources, automating deployments, integrating cloud-native AI services, and ensuring secure, scalable, and observable AI/ML environments across AWS, Azure, and GCP platforms.
Key Responsibilities
Cloud Platform & AI Services Management:
Administer and troubleshoot AI/ML services across AWS and Azure.
Apply best practices for managing cloud-native AI services such as Azure OpenAI and AWS SageMaker.
Support hybrid and multi-cloud environments for AI workloads.
AI/ML Platform Engineering
Deploy and manage secure, scalable AI/ML workloads in the cloud.
Integrate vector databases and similarity search services into AI pipelines.
Infrastructure As Code (IaC)
Provision AI-ready infrastructure using Terraform, Bicep, and CloudFormation.
Maintain reusable IaC modules for consistent and automated deployments.
API Management & Integration
Design and maintain API gateways (e.g., Azure API Management) for AI-powered applications.
Ensure secure and scalable API integrations for ML services.
DevOps & CI/CD For AI Pipelines
Build and maintain CI/CD workflows for ML model training, deployment, and retraining.
Integrate with tools like GitHub Actions or Azure DevOps.
Scripting & Automation
Develop automation scripts in Python, Bash, or PowerShell for provisioning, data preparation, and operational tasks.
Container Orchestration
Deploy and manage containerized AI workloads using Kubernetes.
Secure runtime environments and manage resource scaling.
Security & Compliance
Implement encryption, access controls, and compliance policies for cloud-based LLMs and AI services.
Collaborate with InfoSec teams to enforce governance standards.
Monitoring & Observability
Set up metrics, logging, and alerting for AI model performance and infrastructure health using tools like Prometheus and ELK.
Required Skills And Qualifications
5+ years of experience managing cloud platforms (AWS, Azure, GCP), including AI/ML services.
Hands-on experience with Azure OpenAI, AWS SageMaker, or similar platforms.
Proficiency in Infrastructure as Code tools (Terraform, Bicep, CloudFormation).
Experience with API management tools (e.g., Azure API Management).
Strong scripting skills in Python, Bash, or PowerShell.
Experience deploying and managing Kubernetes clusters.
Knowledge of cloud-native vector databases and similarity search services.
Understanding of cloud security principles and compliance for AI/ML.
Familiarity with monitoring tools like Prometheus, Grafana, and ELK.
Excellent problem-solving and collaboration skills.
Soft Skills
Strong communication and cross-functional collaboration abilities.
Analytical mindset with a proactive approach to problem-solving.
Eagerness to learn and adopt emerging cloud AI technologies