Job Title: Site Reliability Engineer
Location: Remote / India
Experience Level: Mid-Level (4–6 years)
Role Type: Contract (Fixed Term)
Working Hours: TBD
Position Overview:-
We are looking for a Mid-Level Site Reliability Engineer to join our global SRE team. This role goes beyond reactive incident response — you will handle incidents independently, contribute to reliability improvements, and help reduce operational toil across a fast-moving fintech infrastructure. You will work closely with global engineering teams to maintain system health and improve the resilience of our platforms.
Key Responsibilities
- On-Call & Incident Response: Independently handle incidents during scheduled shifts, coordinating communication with relevant teams and driving issues to resolution. Escalate complex or novel issues to senior engineers with clear context and initial findings.
- Incident Mitigation: Independently triage and resolve production alerts across AWS infrastructure and application layers. Apply sound judgment to distinguish transient issues from systemic failures and act accordingly.
- Monitoring & Alerting: Build and improve Datadog monitors and dashboards following team standards. Help reduce alert fatigue by identifying noisy alerts and proposing tuning improvements.
- Runbook & SOP Authorship: Create runbooks from scratch for new failure modes, update existing SOPs to reflect operational learnings, and contribute meaningfully to post-mortems with structured root cause analysis and action items.
- Reliability Initiatives: Proactively identify sources of toil and operational inefficiency. Propose and implement automation or process improvements that reduce manual intervention and improve system resilience.
- Root Cause Analysis: Lead RCA investigations for infrastructure and application-level failures in the AWS environment. Produce clear, action-oriented incident reports.
- Deployment Support: Monitor CI/CD pipelines during deployments, flag reliability risks, and initiate rollbacks following established procedures when stability is at risk.
Qualifications
- Experience: 4–6 years of hands-on experience in SRE, DevOps, or Platform Engineering roles.
- AWS Expertise: Deep working knowledge of Amazon ECS, IAM, VPC, ALB/NLB, RDS, S3, MSK, ElastiCache, Lambda, CloudWatch, and an awareness of cost optimization practices.
- Infrastructure as Code: Proficiency with Terraform for managing and modifying AWS resources; comfortable reading and writing Terraform configurations independently.
- Observability: Proficiency building and maintaining Datadog monitors and dashboards. Familiarity with Grafana is a plus.
- Architecture Knowledge: Solid understanding of common architectural patterns (e.g., Microservices, Pub/Sub, Load Balancing) and their reliability implications.
- Linux & Scripting: Strong Linux command-line skills; comfortable writing Python automation scripts for operational tasks and tooling.
- CI/CD: Working knowledge of GitHub Actions or GitLab CI for ECS-based deployments.
- Communication: Excellent written and verbal English communication skills. Able to produce clear async handover notes, post-mortems, and status updates across global time zones.
- Ownership Mindset: Demonstrated track record of driving open issues to closure. Verifying root causes rather than assuming silence means resolved, and following up across teams until items are truly done.
- Structured Troubleshooting: Systematic, hypothesis-driven approach to diagnosing issues, forming a clear picture of what is known, what to check next, and why, rather than relying on trial and error.
- Proactive Teamwork: Posts incremental status updates during investigations, raises blockers early instead of going silent, and asks clarifying questions when expectations or feedback are unclear.
Preferred Skills
- Basic understanding of financial markets and market data (e.g., equities, options, market data feeds).
- SLO/SLI definition and error budget management experience.
- Database query skills and familiarity with RDS or Cassandra performance metrics.
- Experience reviewing runbooks or mentoring junior engineers on operational best practices.
- Self-directed ramp-up: comfortable learning unfamiliar systems through documentation and runbooks with minimal hand-holding.
- Strong documentation habit: proactively leaves high-quality handover notes and improves runbooks as part of daily work.
Pay: ₹40,000.00 - ₹60,000.00 per month
Benefits:
- Flexible schedule
- Leave encashment
- Provident Fund
- Work from home
Work Location: In person