Job Title: Associate Director – SRE & Observability Engineer (AI Infrastructure)
Role Overview
We are seeking a seasoned Site Reliability Engineering (SRE) and Observability leader to design, build, and scale reliability frameworks for AI/GenAI platforms and data-intensive workloads.
This role will focus on ensuring high availability, performance, scalability, and cost-efficiency across AI infrastructure (LLMs, model training/inference, vector databases, pipelines) by embedding SRE principles, observability, and automation into the platform lifecycle.
Key Responsibilities
1. SRE Strategy for AI Infrastructure
-
Define and lead SRE strategy and operating model for AI platforms across cloud (Azure, AWS, GCP) and hybrid environments
-
Establish SLIs, SLOs, and SLAs tailored to:
-
LLM inference latency and throughput
-
Model training performance and job success rates
-
Pipeline reliability (RAG, orchestration frameworks, agents)
-
Drive adoption of error budgets and reliability engineering practices across AI and platform teams
2. Observability Architecture for AI Workloads
-
Design and implement end-to-end observability frameworks for AI systems, including:
-
Metrics (latency, throughput, GPU utilization, token usage)
-
Logs (model behavior, system failures, prompt traces)
-
Traces (distributed AI workflows, API calls, orchestration flows)
-
Build observability for:
-
LLM pipelines and agent-based systems
-
Vector databases and retrieval layers
-
Data ingestion and feature pipelines
-
Enable deep visibility into model performance, drift, and degradation
3. Reliability Engineering & Automation
-
Implement self-healing systems, auto-remediation, and resiliency patterns
-
Design fault tolerance strategies:
-
Multi-region deployment
-
Model fallback and routing strategies
-
Graceful degradation in GenAI systems
-
Lead adoption of:
-
Chaos engineering for AI workloads
-
Canary deployments and A/B testing for models
-
Drive automation-first SRE practices using IaC and policy-as-code
4. AI System Performance Optimization
-
Optimize:
-
Inference latency and throughput
-
GPU/accelerator utilization
-
Distributed training efficiency
-
Work with engineering teams to:
-
Fine-tune model serving infrastructure
-
Implement caching, batching, and async processing
-
Drive performance benchmarking frameworks for AI workloads
5. Incident Management & Reliability Operations
-
Establish incident response frameworks tailored for AI platforms
-
Lead root cause analysis (RCA) for:
-
Model failures
-
Pipeline breakdowns
-
Infrastructure bottlenecks
-
Define and track MTTR, MTBF, availability, and reliability KPIs
-
Build runbooks, playbooks, and operational dashboards
6. Tooling & Platform Enablement
-
Implement and manage observability and SRE tooling such as:
-
Monitoring: Prometheus, Grafana, Datadog, Azure Monitor, CloudWatch
-
Logging & tracing: ELK stack, OpenTelemetry, Jaeger
-
AI observability: Langfuse, Weights & Biases, Arize, WhyLabs (preferred)
-
Develop custom telemetry pipelines for AI-specific metrics (token usage, prompt traces, response quality signals)
-
Integrate observability into CI/CD and MLOps pipelines
7. Governance & Risk Management
-
Define reliability guardrails and governance policies
-
Ensure compliance, security, and availability requirements for AI systems
-
Implement controls for:
-
Model drift and degradation detection
-
Data pipeline integrity
-
Responsible AI monitoring
8. Stakeholder Leadership & Advisory
-
Act as a trusted advisor to:
-
Platform engineering
-
Data science teams
-
Enterprise architecture and leadership
-
Translate reliability metrics into business impact (customer experience, revenue risk)
-
Drive enterprise adoption of SRE practices for AI
9. Thought Leadership & Innovation
-
Develop POVs, frameworks, and accelerators for:
-
AI SRE maturity models
-
Observability patterns for GenAI
-
Stay ahead of trends in:
-
AI reliability engineering
-
Observability tooling and standards
-
Lead internal capability building and external client workshops
Required Qualifications
Experience
-
12–15+ years of experience in:
-
Site Reliability Engineering / DevOps / Platform Engineering
-
Cloud infrastructure and distributed systems
-
4–6+ years working with AI/ML platforms, MLOps, or data-intensive systems
-
Proven experience in designing high-scale, highly reliable systems
Core Skills
-
Deep expertise in:
-
SRE principles (SLI/SLO, error budgets, incident management)
-
Observability (metrics, logs, tracing)
-
Distributed system design and failure modes
-
Strong understanding of:
-
AI/ML workloads (training, inference, pipelines)
-
LLM architectures and GenAI systems
Technical Skills
-
Cloud Platforms: Azure, AWS, GCP
-
Infrastructure:
-
Kubernetes, containers, serverless architectures
-
Observability stack:
-
OpenTelemetry, Prometheus, Grafana, ELK
-
Programming / scripting:
-
Python, Go, or similar
-
CI/CD & IaC:
-
Terraform, ARM, CloudFormation, GitOps
Leadership & Consulting Skills
-
Executive communication and stakeholder management
-
Ability to lead cross-functional, global teams
-
Strong problem-solving and analytical mindset
-
Experience in client-facing advisory and transformation programs
Preferred Qualifications
-
Certifications:
-
Kubernetes (CKA/CKAD)
-
Cloud Architect (Azure/AWS/GCP)
-
Exposure to:
-
AI observability platforms (Arize, WhyLabs, Langfuse, etc.)
-
FinOps alignment for AI workloads
-
Experience with:
-
Multi-cloud and hybrid deployment strategies