Manager | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering

Deloitte -
Bengaluru, Karnataka

Apply Now

Job details

4 days ago

Qualifications

CI/CD
Cloud infrastructure
Azure
Computer Science
Kubernetes
DevOps
Spark
Encryption
AWS Certification
Master's degree
Bash (Unix shell)
RBAC
AWS
Docker
Bachelor's degree
Team management
Terraform
Continuous integration
Scripting
Kafka
AI
Leadership
Communication skills
Python
Identity & access management

Full job description

Job requisition ID :: 107230

Date: Jun 22, 2026

Location: Bengaluru

Designation: Manager

Entity: Deloitte Touche Tohmatsu India LLP

Manager | Hybrid cloud | Bengaluru | Engineering | Hybrid Cloud Engineering

Job requisition ID : 107230
Location: Bengaluru
Entity: Deloitte Touche Tohmatsu India LLP

Job Title: AI Infrastructure Architect / Operate Lead (Manager)

Role Summary

The AI Infrastructure Architect / Operate Lead is responsible for operationalizing, managing, and optimizing AI/ML platforms and infrastructure at scale. This role focuses on ensuring high availability, reliability, performance, security, and cost efficiency of AI workloads across multi-cloud and hybrid environments.

The role bridges AI engineering, cloud platform operations, MLOps, DevOps, and SRE practices, enabling organizations to run production-grade AI systems with strong governance and operational excellence.

Key Responsibilities

1. AI Platform Operations & Service Reliability

Own end-to-end operations of AI platforms and infrastructure, including:
Model serving platforms (batch & real-time)
AI pipelines and orchestration frameworks
Data ingestion and processing layers
Ensure:
99.9%+ availability and resilience
Defined SLOs/SLIs for AI services
Lead incident, problem, and change management processes
Conduct root cause analysis (RCA) and implement preventive measures

2. MLOps & Model Lifecycle Management

Lead operationalization of end-to-end ML lifecycle:
Model training, validation, deployment, monitoring, retraining
Implement and manage:
ML pipelines (CI/CD for models)
Model registry and versioning
Ensure:
Model reproducibility and traceability
Model performance tracking (latency, accuracy)
Drift detection (data drift / concept drift)
Integrate automated retraining and feedback loops

3. Cloud & Platform Engineering

Oversee deployment and operations across Azure, AWS, GCP, and hybrid environments
Manage:
Kubernetes clusters (On-prem/AKS/EKS/GKE)
Serverless and container-based AI workloads
Drive:
Infrastructure-as-Code (IaC) adoption (Terraform, Bicep, CloudFormation)
Platform standardization and reusable components
Ensure scalable infrastructure for training (high compute) and inference (low latency)

4. GPU & High-Performance Compute Optimization

Manage and optimize GPU/TPU-based workloads
Ensure efficient:
Workload scheduling
Resource allocation and bin-packing
Optimize infrastructure for:
Distributed training (e.g., Horovod, DeepSpeed)
Cost-performance trade-offs
Monitor GPU utilization and improve efficiency metrics

5. Observability & Intelligent Monitoring

Implement end-to-end observability across:
Infrastructure (CPU, GPU, memory)
Platform services
AI models
Define metrics for:
Model drift, bias, latency, throughput
Deploy monitoring tools:
Prometheus, Grafana, ELK, Azure Monitor, Datadog
Enable predictive alerting and AIOps capabilities

6. Security, Compliance & Responsible AI

Ensure secure operation of AI systems:
Identity & access management (IAM/RBAC)
Data encryption (at rest & in transit)
Secure model endpoints
Enforce:
Data privacy regulations (GDPR, HIPAA, etc.)
Responsible AI policies (bias detection, explainability)
Maintain:
Audit trails for models and data
Governance frameworks for model lifecycle

7. FinOps & Cost Optimization

Drive cost efficiency for AI workloads:
GPU and compute optimization
Storage and data transfer optimization
Implement:
Autoscaling and workload scheduling strategies
Spot/preemptible usage
Build:
Cost dashboards and chargeback models
Align AI infrastructure spend with business outcomes

8. Service Delivery & Operations Management

Lead 24x7 operations support (if applicable)
Manage SLAs, OLAs, and KPIs
Implement ITIL-based processes:
Incident, problem, change, release management
Drive continuous service improvement initiatives

9. Team Leadership & Talent Development

Lead and mentor a team of:
MLOps engineers
Cloud/platform engineers
SREs / AI Ops specialists
Responsibilities include:
Workforce planning and hiring
Capability development and certifications
Performance management
Foster a culture of:
Automation-first mindset
Reliability engineering
DevOps practices

10. Stakeholder & Program Management

Partner with:
Data science and AI engineering teams
Enterprise architects
Security and governance teams
Translate business requirements into:
Scalable AI infrastructure solutions
Provide leadership updates on:
Platform health
Cost metrics
Operational KPIs

11. Continuous Improvement & Innovation

Introduce:
Self-healing infrastructure
Autonomous operations using AI (AIOps)
Evaluate new technologies:
LLMOps (vector DBs, prompt pipelines, inference optimization)
Edge AI and distributed inference
Improve platform maturity across:
Automation
Standardization
Reliability

Required Qualifications

Education

Bachelor’s or Master’s degree in Computer Science, Engineering, or related field

Experience

12+ years in:
Cloud/platform engineering or infrastructure operations
At least 3-5 years in AI/ML infrastructure or MLOps
Proven team management experience (Manager level)

Technical Skills

Cloud & Infrastructure

Azure, AWS, GCP (multi-cloud preferred)
Kubernetes, Docker
Infrastructure as Code (Terraform, ARM/Bicep, CloudFormation)

AI/ML & MLOps

Platforms: Azure ML, SageMaker, Vertex AI, MLflow, Kubeflow
Model lifecycle management and pipeline orchestration

Data & Processing

Apache Spark, Kafka, Airflow
Data pipelines and feature stores

Observability & Monitoring

Prometheus, Grafana, ELK stack, Datadog

Programming

Python, Bash, or scripting languages

Leadership & Functional Skills

Strong people leadership and delivery management
Experience in SRE / DevOps transformations
Knowledge of ITIL-based service management
Strong stakeholder communication and executive reporting

Preferred Qualifications

Certifications:
Azure/AWS/GCP Architect
Certified Kubernetes Administrator (CKA)
AI/ML certifications (Azure ML, AWS ML Specialty)
Experience with:
Generative AI / LLMOps ecosystems
Vector databases (FAISS, Pinecone, etc.)
Responsible AI frameworks

Apply Now

Jobseeker tools

Employer Tools

Browse

Stay Connected