Customer
The Customer Team empowers organizations to build deeper relationships with customers through innovative strategies, advanced analytics, GenAI, transformative technologies, and creative design. We enable Deloitte client service teams to enhance customer experience and drive sustained growth and customer value creation and capture, through customer and commercial strategies, digital product and innovation, marketing, commerce, sales, and service. We are a team of strategists, data scientists, operators, creatives, designers, engineers, and architects, balancing business strategy, technology, creativity, and ongoing managed services to help solve the biggest problems that impact customers, partners, constituents, and the workforce. We also offer Business Process as a Service, enabling organizations to streamline operations and achieve greater efficiency through scalable, technology-enabled managed insights that guide ongoing transformation and operational excellence.
Position Summary
Level: Consultant Managed Service or equivalent
The Site Reliability Engineer (SRE) improves availability, latency, performance, efficiency, change safety, and resilience of production services on Microsoft Azure and Google Cloud Platform (GCP). The SRE defines and runs SLIs/SLOs/error budgets, builds reliability automation, leads incident response and blameless postmortems, and strengthens systems through resilience engineering and chaos experiments. The role uses industry observability platforms (e.g., Dynatrace, Splunk, Datadog) in addition to native cloud tooling to measure and improve customer outcomes
Work you’ll do:
SLO/SLI & Error Budget Management
-
Define user-journey SLIs and measurable SLOs for critical services; translate reliability goals into engineering actions.
-
Operationalize error budgets to guide release risk decisions and reliability investment.
-
Run regular SLO reviews, publish reliability scorecards, and maintain service reliability roadmaps.
Observability Engineering (Dynatrace / Splunk / Datadog + Cloud-Native)
-
Design end-to-end observability across metrics, logs, traces, synthetics, and RUM (where applicable) mapped to SLIs.
-
Implement and govern telemetry standards (e.g., trace/metric conventions) and ensure coverage for critical paths.
-
Build actionable alerting (symptom-based), reduce noise, and improve on-call signal quality.
-
Create dashboards and investigations that connect platform signals to customer impact and SLO compliance.
Tools (examples):
-
Dynatrace: APM, distributed tracing, service flow, anomaly detection, SLO dashboards.
-
Splunk: log analytics, SIEM-adjacent investigations (when needed for prod incidents), correlation searches, alert tuning.
-
Datadog: APM, infra monitoring, logs, synthetics, SLO management, incident workflows.
-
Cloud-native: Azure Monitor / Log Analytics / Application Insights, GCP Cloud Monitoring / Logging / Trace.
Incident Response, On-Call, and Postmortems
-
Participate in on-call rotation; lead incident command for high-severity events.
-
Drive rapid mitigation (rollback/roll-forward), stakeholder comms, and stable recovery.
-
Facilitate blameless postmortems, identify systemic causes, and ensure corrective actions are implemented and verified.
Resilience, Capacity, Performance
-
Engineer reliability patterns: timeouts, retries (with jitter), circuit breakers, bulkheads, load shedding, graceful degradation.
-
Perform capacity planning, load testing, scaling strategy validation, and performance tuning aligned to SLOs.
-
Plan and test DR: define RTO/RPO, conduct failover tests and recovery drills.
Chaos Engineering (Added)
-
Design and run chaos experiments to validate resilience assumptions and reduce unknown failure modes.
-
Define hypotheses tied to SLOs (e.g., “regional dependency failure should degrade gracefully without breaching availability SLO”).
-
Implement controlled fault injection: dependency outages, latency/packet loss, CPU/memory pressure, pod/node termination, zonal failure simulations.
-
Establish safety guardrails: blast-radius limits, approvals, monitoring/abort conditions, and learning-focused postmortems.
-
Integrate game days into reliability programs and track reliability improvements from findings.
Toil Reduction & Reliability Automation
-
Identify and reduce toil via automation (auto-remediation, safe diagnostics, runbooks-as-code).
-
Build self-service operational tooling to improve mean time to detect/restore and reduce manual intervention.
-
Own/drive production readiness reviews and reliability acceptance criteria for new services.
Cloud Scope (Azure + GCP)
Azure (examples)
AKS, App Service, Functions, VM Scale Sets, Azure SQL/Cosmos DB, Event Hubs/Service Bus; resilience via Availability Zones, regional strategies, traffic management; telemetry via Azure Monitor / App Insights.
GCP (examples)
GKE, Cloud Run, Compute Engine, Cloud SQL/Spanner, Pub/Sub; resilience via multi-zone/region strategies and traffic management; telemetry via Cloud Monitoring/Logging/Trace.
Cross-Cloud
Standardize SLOs, incident practices, and observability conventions across Azure and GCP; manage reliability of shared dependencies (identity, DNS, certificates, third parties).
The team:
The team:
Our Digital Foundry Operate & Innovations (DFO&I) team partners with organizations to rapidly design, build, and scale digital products and experiences that drive business growth and elevate customer engagement. As a multidisciplinary group of strategists, designers, engineers, and operations specialists, we deliver end-to-end solutions—from initial concept and agile development to ongoing digital operations—enabling clients to experiment, iterate, and scale digital initiatives with confidence and agility. We support clients across domains such as strategy, commerce, marketing, sales, and service, helping them realize their digital ambitions through flexible, scalable teams. Our expertise spans the full digital lifecycle, including customer research, experience design, platform development, content production, and marketing automation. By bridging the gap between strategy and execution, we empower organizations to achieve measurable outcomes and deliver exceptional customer experiences in an ever-evolving digital landscape.
The team:
Our Digital Foundry Operate & Innovations (DFO&I) team partners with organizations to rapidly design, build, and scale digital products and experiences that drive business growth and elevate customer engagement. As a multidisciplinary group of strategists, designers, engineers, and operations specialists, we deliver end-to-end solutions—from initial concept and agile development to ongoing digital operations—enabling clients to experiment, iterate, and scale digital initiatives with confidence and agility. We support clients across domains such as strategy, commerce, marketing, sales, and service, helping them realize their digital ambitions through flexible, scalable teams. Our expertise spans the full digital lifecycle, including customer research, experience design, platform development, content production, and marketing automation. By bridging the gap between strategy and execution, we empower organizations to achieve measurable outcomes and deliver exceptional customer experiences in an ever-evolving digital landscape.
Qualifications
Must Have Skills/Project Experience/Certifications:
-
3 to 6 years in SRE / Production Reliability for distributed, customer-facing systems.
-
Hands-on experience defining and operating SLIs/SLOs/error budgets.
-
Experience with Azure and GCP production workloads (especially AKS/GKE and managed services).
-
Strong incident response leadership and postmortem discipline.
-
Proficiency in at least one engineering language (Go/Python/Java/C#) for automation and tooling.
-
Practical experience with at least one enterprise observability platform: Dynatrace, Splunk, and/or Datadog.
Preferred Skills:
-
Kubernetes reliability engineering (autoscaling behavior, upgrades, networking, workload resiliency).
-
OpenTelemetry-based instrumentation and tracing practices.
-
Chaos engineering experience (game days, fault injection, experiment design and safety controls).
-
Regulated environment operations (auditability/change controls) while preserving SRE principles
Education:
- BE/B.Tech/M.C.A./M.Sc (CS) degree or equivalent from accredited university
Location:
- Bengaluru/Hyderabad/Pune/Chennai