Responsibilities
- Own the reliability, availability, and performance of Sephora's microservices and
production workloads
- Design and improve resilient infrastructure on GCP, with strong emphasis on Cloud Run,
Kubernetes, and containerized services
- Build and maintain observability across logs, metrics, tracing, alerting, and service health
so issues are detected early and resolved quickly
- Improve deployment safety through stronger CI/CD pipelines, release controls, rollback
strategies, and environment consistency
- Lead incident response and production readiness practices, including runbooks,
postmortems, on-call hygiene, capacity planning, and resilience testing
- Reduce operational toil by automating repetitive work and improving tooling for
engineers supporting distributed services
- Partner with development teams to improve the operability, scalability, and fault
tolerance of microservices early in the design lifecycle
- Strengthen cloud security and infrastructure hygiene across IAM, secrets management,
workload hardening, and production safeguards
- Improve service performance, resource efficiency, and cloud cost management without
compromising reliability
- Support architecture and reliability reviews for critical services and high-traffic business
events
Qualifications
- 5+ years of experience in Site Reliability Engineering or closely related DevOps roles with
meaningful production ownership
- Strong experience running production systems on Google Cloud Platform
- Hands-on experience with Cloud Run, Kubernetes, and container-based microservices in
production
- Strong experience with infrastructure as code, particularly Terraform and Terragrunt
- Strong understanding of observability using tools such as OpenTelemetry, Cloud
Monitoring, New Relic, or equivalent systems
- Strong understanding of distributed systems, microservice failure modes, reliability
engineering, and production debugging
- Experience building or improving CI/CD pipelines and release workflows in modern
engineering environments, including GitHub Actions
- Ability to write code and automation in one or more languages such as Python or Java
- Good judgment during incidents and a practical mindset around reliability, recovery, and
risk tradeoffs
- Strong written and verbal communication skills, with the ability to work effectively
across engineering teams
- Experience working with AI tooling and agentic workflows in engineering or operational
environments
- Experience in retail, e-commerce, or other customer-facing environments is a plus
Pay: ₹645,584.61 - ₹2,103,249.11 per year
Experience:
- DevOps: 6 years (Required)
- GCP: 5 years (Required)
Work Location: Remote