Site Reliability Engineer (SRE)
Location: Bangalore
Experience: 8+ Years
Work Mode: Hybrid
Work Schedule: 7:30 AM – 5:00 PM
Joining Preference: Immediate Joiners Preferred
About the Role:
We are seeking an experienced Site Reliability Engineer (SRE) to drive reliability, scalability, and operational excellence across critical production platforms. The ideal candidate will have strong expertise in cloud infrastructure, Kubernetes, observability, automation, and incident management, with a focus on building highly available and resilient systems.
Key ResponsibilitiesService Reliability:
- Define and manage SLIs, SLOs, and Error Budgets.
- Monitor platform health and proactively address reliability risks.
- Improve service availability, scalability, and performance.
Incident Management
- Participate in and lead on-call support rotations.
- Manage production incidents, troubleshooting, and service recovery.
- Conduct root cause analysis and postmortem reviews.
- Drive improvements in MTTD and MTTR metrics.
Automation & Infrastructure
- Automate deployments, scaling, remediation, and operational tasks.
- Implement Infrastructure as Code (IaC) practices.
- Reduce manual operational effort through scripting and automation.
Observability & Monitoring
- Build and maintain monitoring, logging, and distributed tracing solutions.
- Support rapid issue diagnosis and performance analysis.
- Enable proactive capacity planning and system optimization.
Performance & Resilience
- Conduct load testing and capacity planning.
- Implement failover mechanisms, canary deployments, and resilience strategies.
- Support reliability-focused platform improvements and testing initiatives.
Required Skills:
- Strong programming or scripting experience in Python, Go, or similar languages.
- Advanced Linux administration and troubleshooting skills.
- Strong understanding of networking concepts and distributed systems.
- Hands-on experience with:
- Kubernetes
- Docker
- Terraform
- CI/CD Pipelines
- Experience with cloud platforms:
- AWS
- Azure
- GCP
- Expertise in observability tools such as:
- Prometheus
- Grafana
- ELK Stack
- OpenTelemetry
Preferred Qualifications:
- Experience implementing SLI/SLO/Error Budget frameworks.
- Cloud certifications (AWS, Azure, or GCP).
- Kubernetes or DevOps certifications.
- Experience with Chaos Engineering and Resilience Testing.
- Background in Platform Engineering, Production Operations, or Systems Engineering.
What We're Looking For:
- Strong problem-solving and troubleshooting skills.
- Experience leading production incident response.
- Excellent communication and stakeholder management abilities.
- Ability to work effectively in high-pressure production environments.
- Passion for automation, reliability, and operational excellence.
Pay: ₹80,000.00 - ₹100,000.00 per month
Work Location: Hybrid remote in Bengaluru, Karnataka (Bengaluru Urban District)