Site Reliability Engineer(C2H)

Talent Vision Services
India

Quick apply

Job details

Contractual / Temporary
₹24,00,000 a year

Qualifications

CI/CD
Elasticsearch
Cloud infrastructure
Azure
Kubernetes
MongoDB
AWS
Terraform
Continuous integration
New Relic
RDS database
MySQL
Identity & access management

Full job description

Cloud Managed Services

Datavail’s business focuses on helping you use your data to drive business results through cost-saving services. The success of your business depends on how well you understand and manage your data. Our managed cloud services give you the power to unleash your organization’s potential. We provide comprehensive and technically advanced support for Cloud Operation to ensure that your infrastructure is safe, secure, and managed with the utmost level of care.

Our delivery performance in data management leads the industry. We offer highly trained Cloud administrators via a 24×7, always on, always available, global delivery model.

With the combination of a proven delivery model and top-notch experience ensures that Datavail will remain the Cloud experts on demand you desire. Datavail’s flexible and client focused services always add value to your organization.

Job Description

Job Title: Senior Site Reliability Engineer (SRE)

Job Description:
We are seeking a Senior Site Reliability Engineer (SRE) to support Customer AWS/ Azure platform modernization and reliability initiatives. This role focuses on migrating legacy worker processes to Kubernetes, strengthening Infrastructure as Code (IaC) and CI/CD pipelines, and driving strong observability and operational excellence.
The SRE will work closely with Customer engineering teams to embed reliability, automation, and monitoring into the platform while ensuring high availability, scalability, and predictable deployments.

Key Responsibilities:

Kubernetes & Platform Modernization
Lead the containerization and migration of existing worker processes to Kubernetes.
Design Kubernetes-native deployment patterns including health checks, autoscaling, and failure recovery.
Define resource requests/limits, rollout strategies, and operational standards for workloads.
Define, implement, and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for critical services.

Reliability Engineering & SRE Practices:

Define, implement, and maintain SLIs, SLOs, and error budgets for critical services.
Continuously monitor SLO compliance and drive improvements based on error budget usage.
Participate in architecture reviews focused on high availability, scalability, and fault tolerance.
Apply resilience patterns such as retries, circuit breakers, rate limiting, and graceful degradation.

Incident, Problem & Change Management:

Act as a Tier-3 escalation point for production and deployment issues.
Lead incident response, blameless postmortems, and Root Cause Analysis (RCA).
Maintain and improve runbooks, escalation paths, and on-call readiness.
Track and improve key metrics such as MTTR, deployment success rate, and incident frequency.

Automation & Infrastructure as Code:

Develop and maintain Infrastructure as Code using Terraform, CloudFormation, and AWS CDK.
Build and enhance CI/CD pipelines supporting rolling, blue/green, and canary deployments.
Automate Dev-to-Staging redeployments with validation, rollback, and promotion mechanisms.
Reduce operational toil through automation and self-healing workflows.

Monitoring, Observability & Logging (SRE Tools Focus):

Design and operate end-to-end observability covering metrics, logs, and traces.

Hands-on experience with:
o New Relic / Datadog for APM, distributed tracing, and SLO tracking
o Prometheus for metrics collection
o Grafana for dashboards and SRE scorecards
o Graylog / ELK for centralized logging and root cause analysis

Ensure alerts are SLO-driven, actionable, and noise-free.
Build customer-facing dashboards to demonstrate reliability and deployment health.

Cloud Infrastructure & Platform Reliability:

Provision and operate cloud infrastructure primarily on AWS.
Manage compute, networking, load balancers, IAM, backups, patching, and DR readiness.
Optimize performance and cost through autoscaling, rightsizing, and capacity planning.
Support reliability of data platforms such as MongoDB, Elasticsearch/OpenSearch, MySQL (RDS), and DocumentDB.

Quick apply

Jobseeker tools

Employer Tools

Browse

Stay Connected