Site Reliability Engineer (SRE) – 8+ Years Experience
Location: Bangalore, India
Work Schedule: 7:30 AM – 5:00 PM
Immediate Joiners Preferred
About the Role
We are looking for a highly skilled Site Reliability Engineer (SRE) with 8+ years of experience to enhance the reliability, scalability, performance, and operational excellence of our mission-critical platforms. The ideal candidate will have strong expertise in cloud infrastructure, automation, observability, incident management, and platform engineering.
This role offers the opportunity to work on large-scale production environments, drive automation initiatives, and partner closely with engineering teams to build resilient and highly available systems.
Key Responsibilities
- Define, measure, and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to improve system reliability.
- Lead production incident response, troubleshooting, root cause analysis, and post-incident reviews.
- Design and implement automation solutions to reduce operational overhead and improve engineering efficiency.
- Build and enhance observability frameworks, including monitoring, logging, alerting, and distributed tracing.
- Drive performance tuning, capacity planning, resilience engineering, and reliability initiatives.
- Implement Infrastructure as Code (IaC) and deployment automation using modern DevOps practices.
- Collaborate with development and platform teams to improve production readiness, scalability, and availability.
- Support and optimize cloud-native architectures and containerized environments.
Required Qualifications
- 8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, or related roles.
- Strong programming and scripting skills in Python and/or Go.
- Advanced Linux administration and troubleshooting expertise.
- Hands-on experience with Kubernetes and Docker.
- Strong knowledge of Terraform and Infrastructure as Code (IaC) practices.
- Experience working with cloud platforms such as AWS, Azure, or GCP.
- Expertise in observability tools including Prometheus, Grafana, ELK Stack, and OpenTelemetry.
- Experience designing and maintaining CI/CD pipelines and automation frameworks.
- Strong background in incident management, root cause analysis, and production support.
Preferred Qualifications
- Kubernetes and/or Cloud Platform Certifications.
- Experience with Chaos Engineering and resilience testing.
- Platform Engineering experience.
- Experience supporting large-scale, high-availability production environments.
What We're Looking For
- Strong analytical and problem-solving skills.
- Experience managing critical production systems and participating in on-call rotations.
- Excellent communication and stakeholder management abilities.
- Passion for automation, reliability, operational excellence, and continuous improvement.
Pay: ₹70,000.00 - ₹90,000.00 per month
Experience:
- Site Reliability Engineer (SRE): 8 years (Required)
Work Location: Remote