Job Title: Senior Site Reliability Engineer (SRE)
Job Description:
We are seeking a Senior Site Reliability Engineer (SRE) to support Customer AWS/ Azure platform modernization and reliability initiatives. This role focuses on migrating legacy worker processes to Kubernetes, strengthening Infrastructure as Code (IaC) and CI/CD pipelines, and driving strong observability and operational excellence.
The SRE will work closely with Customer engineering teams to embed reliability, automation, and monitoring into the platform while ensuring high availability, scalability, and predictable deployments.
Key Responsibilities:
- Kubernetes & Platform Modernization
- Lead the containerization and migration of existing worker processes to Kubernetes.
- Design Kubernetes-native deployment patterns including health checks, autoscaling, and failure recovery.
- Define resource requests/limits, rollout strategies, and operational standards for workloads.
- Define, implement, and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for critical services.
Reliability Engineering & SRE Practices:
- Define, implement, and maintain SLIs, SLOs, and error budgets for critical services.
- Continuously monitor SLO compliance and drive improvements based on error budget usage.
- Participate in architecture reviews focused on high availability, scalability, and fault tolerance.
- Apply resilience patterns such as retries, circuit breakers, rate limiting, and graceful degradation.
Incident, Problem & Change Management:
- Act as a Tier-3 escalation point for production and deployment issues.
- Lead incident response, blameless postmortems, and Root Cause Analysis (RCA).
- Maintain and improve runbooks, escalation paths, and on-call readiness.
- Track and improve key metrics such as MTTR, deployment success rate, and incident frequency.
Automation & Infrastructure as Code:
- Develop and maintain Infrastructure as Code using Terraform, CloudFormation, and AWS CDK.
- Build and enhance CI/CD pipelines supporting rolling, blue/green, and canary deployments.
- Automate Dev-to-Staging redeployments with validation, rollback, and promotion mechanisms.
- Reduce operational toil through automation and self-healing workflows.
Monitoring, Observability & Logging (SRE Tools Focus):
- Design and operate end-to-end observability covering metrics, logs, and traces.
Hands-on experience with:
o New Relic / Datadog for APM, distributed tracing, and SLO tracking
o Prometheus for metrics collection
o Grafana for dashboards and SRE scorecards
o Graylog / ELK for centralized logging and root cause analysis
- Ensure alerts are SLO-driven, actionable, and noise-free.
- Build customer-facing dashboards to demonstrate reliability and deployment health.
Cloud Infrastructure & Platform Reliability:
- Provision and operate cloud infrastructure primarily on AWS.
- Manage compute, networking, load balancers, IAM, backups, patching, and DR readiness.
- Optimize performance and cost through autoscaling, rightsizing, and capacity planning.
- Support reliability of data platforms such as MongoDB, Elasticsearch/OpenSearch, MySQL (RDS), and DocumentDB.