Manager | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering
- Job requisition ID : 107230
- Location: Bengaluru
- Entity: Deloitte Touche Tohmatsu India LLP
Job Title: AI Infrastructure Architect / Operate Lead (Manager)
Role Summary
The AI Infrastructure Architect / Operate Lead is responsible for operationalizing, managing, and optimizing AI/ML platforms and infrastructure at scale. This role focuses on ensuring high availability, reliability, performance, security, and cost efficiency of AI workloads across multi-cloud and hybrid environments.
The role bridges AI engineering, cloud platform operations, MLOps, DevOps, and SRE practices, enabling organizations to run production-grade AI systems with strong governance and operational excellence.
Key Responsibilities
1. AI Platform Operations & Service Reliability
-
Own end-to-end operations of AI platforms and infrastructure, including:
-
Model serving platforms (batch & real-time)
-
AI pipelines and orchestration frameworks
-
Data ingestion and processing layers
-
Ensure:
-
99.9%+ availability and resilience
-
Defined SLOs/SLIs for AI services
-
Lead incident, problem, and change management processes
-
Conduct root cause analysis (RCA) and implement preventive measures
2. MLOps & Model Lifecycle Management
-
Lead operationalization of end-to-end ML lifecycle:
-
Model training, validation, deployment, monitoring, retraining
-
Implement and manage:
-
ML pipelines (CI/CD for models)
-
Model registry and versioning
-
Ensure:
-
Model reproducibility and traceability
-
Model performance tracking (latency, accuracy)
-
Drift detection (data drift / concept drift)
-
Integrate automated retraining and feedback loops
3. Cloud & Platform Engineering
-
Oversee deployment and operations across Azure, AWS, GCP, and hybrid environments
-
Manage:
-
Kubernetes clusters (On-prem/AKS/EKS/GKE)
-
Serverless and container-based AI workloads
-
Drive:
-
Infrastructure-as-Code (IaC) adoption (Terraform, Bicep, CloudFormation)
-
Platform standardization and reusable components
-
Ensure scalable infrastructure for training (high compute) and inference (low latency)
4. GPU & High-Performance Compute Optimization
-
Manage and optimize GPU/TPU-based workloads
-
Ensure efficient:
-
Workload scheduling
-
Resource allocation and bin-packing
-
Optimize infrastructure for:
-
Distributed training (e.g., Horovod, DeepSpeed)
-
Cost-performance trade-offs
-
Monitor GPU utilization and improve efficiency metrics
5. Observability & Intelligent Monitoring
-
Implement end-to-end observability across:
-
Infrastructure (CPU, GPU, memory)
-
Platform services
-
AI models
-
Define metrics for:
-
Model drift, bias, latency, throughput
-
Deploy monitoring tools:
-
Prometheus, Grafana, ELK, Azure Monitor, Datadog
-
Enable predictive alerting and AIOps capabilities
6. Security, Compliance & Responsible AI
-
Ensure secure operation of AI systems:
-
Identity & access management (IAM/RBAC)
-
Data encryption (at rest & in transit)
-
Secure model endpoints
-
Enforce:
-
Data privacy regulations (GDPR, HIPAA, etc.)
-
Responsible AI policies (bias detection, explainability)
-
Maintain:
-
Audit trails for models and data
-
Governance frameworks for model lifecycle
7. FinOps & Cost Optimization
-
Drive cost efficiency for AI workloads:
-
GPU and compute optimization
-
Storage and data transfer optimization
-
Implement:
-
Autoscaling and workload scheduling strategies
-
Spot/preemptible usage
-
Build:
-
Cost dashboards and chargeback models
-
Align AI infrastructure spend with business outcomes
8. Service Delivery & Operations Management
-
Lead 24x7 operations support (if applicable)
-
Manage SLAs, OLAs, and KPIs
-
Implement ITIL-based processes:
-
Incident, problem, change, release management
-
Drive continuous service improvement initiatives
9. Team Leadership & Talent Development
-
Lead and mentor a team of:
-
MLOps engineers
-
Cloud/platform engineers
-
SREs / AI Ops specialists
-
Responsibilities include:
-
Workforce planning and hiring
-
Capability development and certifications
-
Performance management
-
Foster a culture of:
-
Automation-first mindset
-
Reliability engineering
-
DevOps practices
10. Stakeholder & Program Management
-
Partner with:
-
Data science and AI engineering teams
-
Enterprise architects
-
Security and governance teams
-
Translate business requirements into:
-
Scalable AI infrastructure solutions
-
Provide leadership updates on:
-
Platform health
-
Cost metrics
-
Operational KPIs
11. Continuous Improvement & Innovation
-
Introduce:
-
Self-healing infrastructure
-
Autonomous operations using AI (AIOps)
-
Evaluate new technologies:
-
LLMOps (vector DBs, prompt pipelines, inference optimization)
-
Edge AI and distributed inference
-
Improve platform maturity across:
-
Automation
-
Standardization
-
Reliability
Required Qualifications
Education
-
Bachelor’s or Master’s degree in Computer Science, Engineering, or related field
Experience
-
12+ years in:
-
Cloud/platform engineering or infrastructure operations
-
At least 3-5 years in AI/ML infrastructure or MLOps
-
Proven team management experience (Manager level)
Technical Skills
Cloud & Infrastructure
-
Azure, AWS, GCP (multi-cloud preferred)
-
Kubernetes, Docker
-
Infrastructure as Code (Terraform, ARM/Bicep, CloudFormation)
AI/ML & MLOps
-
Platforms: Azure ML, SageMaker, Vertex AI, MLflow, Kubeflow
-
Model lifecycle management and pipeline orchestration
Data & Processing
-
Apache Spark, Kafka, Airflow
-
Data pipelines and feature stores
Observability & Monitoring
-
Prometheus, Grafana, ELK stack, Datadog
Programming
-
Python, Bash, or scripting languages
Leadership & Functional Skills
-
Strong people leadership and delivery management
-
Experience in SRE / DevOps transformations
-
Knowledge of ITIL-based service management
-
Strong stakeholder communication and executive reporting
Preferred Qualifications
-
Certifications:
-
Azure/AWS/GCP Architect
-
Certified Kubernetes Administrator (CKA)
-
AI/ML certifications (Azure ML, AWS ML Specialty)
-
Experience with:
-
Generative AI / LLMOps ecosystems
-
Vector databases (FAISS, Pinecone, etc.)
-
Responsible AI frameworks