Role Overview
We are seeking experienced Site Reliability Engineering (SRE) to join our platform support team. In this role, you will troubleshoot platform issues, resolve user queries, and provide deep technical support across our cloud-native infrastructure. You will work closely with engineering, product, and customer-facing teams to ensure platform reliability, performance, and an excellent user experiences.
Key Responsibilities
Serve as a technical point of contact for user-reported issues on the platform, triaging, troubleshooting, and resolving queries within defined SLAs.
Investigate and debug issues across Kubernetes clusters, AWS infrastructure, and application services using logs, metrics, traces, and dashboards.
Diagnose root causes of incidents related to deployments, networking, storage, performance degradation, and service availability.
Collaborate with engineering teams to escalate, reproduce, and resolve complex bugs or infrastructure-level problems.
Build and improve runbooks, troubleshooting guides, and internal knowledge base articles to accelerate future resolutions.
Monitor platform health using observability tooling (Prometheus, Grafana, Datadog, ELK, OpenTelemetry, etc.) and proactively identify potential risks.
Participate in incident response, postmortems, and continuous improvement initiatives aligned with SRE best practices.
Provide clear, empathetic, and timely communication to users, translating technical findings into actionable guidance.
What You'll Bring
A pragmatic, user-first mindset; comfort with ambiguity; and a genuine passion for keeping systems reliable and users unblocked. You enjoy turning messy incidents into clear narratives and durable fixes.
Skills
3+ years of hands-on experience in an SRE, DevOps, Platform Engineering, or Production Support role.
Strong working knowledge of Kubernetes: workloads, networking (services, ingress, CNI), RBAC, Helm, troubleshooting pod/node-level issues, and kubectl proficiency.
Solid experience with AWS services: EC2, EKS, IAM, VPC, S3, CloudWatch, Route53, ELB/ALB, and Auto Scaling.
Deep understanding of SRE principles: SLIs/SLOs/SLAs, error budgets, toil reduction, blameless postmortems, and reliability engineering practices.
Proficiency in observability tooling and concepts: metrics, logs, distributed tracing, alerting, and dashboarding (Prometheus, Grafana, Datadog, New Relic, ELK, Splunk, or similar).
Strong Linux fundamentals and scripting ability (Bash, Python, or Go).
Familiarity with CI/CD pipelines, Infrastructure as Code (Terraform, CloudFormation), and Git-based workflows.
Excellent debugging mindset and the ability to reason about distributed systems failure modes.
Strong written and verbal communication skills, with the ability to interact with both technical and non-technical users.
Preferred Qualifications
Experience supporting customer-facing or developer-facing platforms (PaaS, internal developer platforms, SaaS).
Exposure to service meshes (Istio, Linkerd), API gateways, or event-driven architectures (Kafka, SQS).
Familiarity with chaos engineering, load testing, or performance tuning.
Relevant certifications: CKA, CKAD, AWS Solutions Architect / SysOps Administrator, or equivalent.
Prior experience working in a 24x7 on-call or follow-the-sun support model.
Pay: ₹800,000.00 - ₹1,000,000.00 per year
Application Question(s):
- What is your current CTC ?
- What is your notice period ?
Work Location: Hybrid remote in Noida, Uttar Pradesh