Site Reliability Engineer (SRE) – 8 Years Experience
Location: Remote
Employment Type: Contract / Full-Time
Immediate Joiners Preferred
About the Role
We are looking for a highly skilled Site Reliability Engineer (SRE) with 8+ years of experience to enhance the reliability, scalability, performance, and operational excellence of our mission-critical platforms. The ideal candidate will have strong expertise in cloud infrastructure, automation, observability, incident management, and platform engineering.
This role offers the opportunity to work on large-scale production environments, drive automation initiatives, and partner closely with engineering teams to build resilient and highly available systems.
Key Responsibilities
Define, measure, and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to improve system reliability.
Lead production incident response, troubleshooting, root cause analysis, and post-incident reviews.
Design and implement automation solutions to reduce operational overhead and improve engineering efficiency.
Build and enhance observability frameworks, including monitoring, logging, alerting, and distributed tracing.
Drive performance tuning, capacity planning, resilience engineering, and reliability initiatives.
Implement Infrastructure as Code (IaC) and deployment automation using modern DevOps practices.
Collaborate with development and platform teams to improve production readiness, scalability, and availability.
Support and optimize cloud-native architectures and containerized environments.
Required Qualifications
8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or related roles.
Strong programming and scripting skills in Python and/or Go.
Advanced Linux administration and troubleshooting expertise.
Hands-on experience with Kubernetes and Docker.
Strong knowledge of Terraform and Infrastructure as Code (IaC) practices.
Experience working with cloud platforms such as AWS, Azure, or GCP.
Expertise in observability tools including Prometheus, Grafana, ELK Stack, and OpenTelemetry.
Experience designing and maintaining CI/CD pipelines and automation frameworks.
Strong background in incident management, root cause analysis, and production support.
Preferred Qualifications
Kubernetes and/or Cloud Platform Certifications.
Experience with Chaos Engineering and resilience testing.
Platform Engineering experience.
Experience supporting large-scale, high-availability production environments.
What We're Looking For
Strong analytical and problem-solving skills.
Experience managing critical production systems and participating in on-call rotations.
Excellent communication and stakeholder management abilities.
Passion for automation, reliability, operational excellence, and continuous improvement.
Experience:
Site Reliability Engineer (SRE): 8 years (Required)
Work Location: Remote
Pay: ₹70,000.00 - ₹95,000.00 per month
Work Location: Remote