Site Reliability Engineer (SRE)

Zversal Pvt Ltd
Mohali, Punjab

Quick apply

Job details

Full-time
₹40,000 - ₹60,000 a month
4 hours ago

Benefits

Provident Fund
Leave encashment
Work from home
Flexible schedule

Qualifications

CI/CD
Cassandra
Load balancing
DevOps
Git
English
AWS
Terraform
Continuous integration
Mentoring
GitHub
S3
Budgeting
Linux
RDS database
GitLab
Communication skills
Python
Identity & access management

Full job description

Job Title: Site Reliability Engineer

Location: Remote / India

Experience Level: Mid-Level (4–6 years)

Role Type: Contract (Fixed Term)

Working Hours: TBD

Position Overview:-

We are looking for a Mid-Level Site Reliability Engineer to join our global SRE team. This role goes beyond reactive incident response — you will handle incidents independently, contribute to reliability improvements, and help reduce operational toil across a fast-moving fintech infrastructure. You will work closely with global engineering teams to maintain system health and improve the resilience of our platforms.

Key Responsibilities

On-Call & Incident Response: Independently handle incidents during scheduled shifts, coordinating communication with relevant teams and driving issues to resolution. Escalate complex or novel issues to senior engineers with clear context and initial findings.
Incident Mitigation: Independently triage and resolve production alerts across AWS infrastructure and application layers. Apply sound judgment to distinguish transient issues from systemic failures and act accordingly.
Monitoring & Alerting: Build and improve Datadog monitors and dashboards following team standards. Help reduce alert fatigue by identifying noisy alerts and proposing tuning improvements.
Runbook & SOP Authorship: Create runbooks from scratch for new failure modes, update existing SOPs to reflect operational learnings, and contribute meaningfully to post-mortems with structured root cause analysis and action items.
Reliability Initiatives: Proactively identify sources of toil and operational inefficiency. Propose and implement automation or process improvements that reduce manual intervention and improve system resilience.
Root Cause Analysis: Lead RCA investigations for infrastructure and application-level failures in the AWS environment. Produce clear, action-oriented incident reports.
Deployment Support: Monitor CI/CD pipelines during deployments, flag reliability risks, and initiate rollbacks following established procedures when stability is at risk.

Qualifications

Experience: 4–6 years of hands-on experience in SRE, DevOps, or Platform Engineering roles.
AWS Expertise: Deep working knowledge of Amazon ECS, IAM, VPC, ALB/NLB, RDS, S3, MSK, ElastiCache, Lambda, CloudWatch, and an awareness of cost optimization practices.
Infrastructure as Code: Proficiency with Terraform for managing and modifying AWS resources; comfortable reading and writing Terraform configurations independently.
Observability: Proficiency building and maintaining Datadog monitors and dashboards. Familiarity with Grafana is a plus.
Architecture Knowledge: Solid understanding of common architectural patterns (e.g., Microservices, Pub/Sub, Load Balancing) and their reliability implications.
Linux & Scripting: Strong Linux command-line skills; comfortable writing Python automation scripts for operational tasks and tooling.
CI/CD: Working knowledge of GitHub Actions or GitLab CI for ECS-based deployments.
Communication: Excellent written and verbal English communication skills. Able to produce clear async handover notes, post-mortems, and status updates across global time zones.
Ownership Mindset: Demonstrated track record of driving open issues to closure. Verifying root causes rather than assuming silence means resolved, and following up across teams until items are truly done.
Structured Troubleshooting: Systematic, hypothesis-driven approach to diagnosing issues, forming a clear picture of what is known, what to check next, and why, rather than relying on trial and error.
Proactive Teamwork: Posts incremental status updates during investigations, raises blockers early instead of going silent, and asks clarifying questions when expectations or feedback are unclear.

Preferred Skills

Basic understanding of financial markets and market data (e.g., equities, options, market data feeds).
SLO/SLI definition and error budget management experience.
Database query skills and familiarity with RDS or Cassandra performance metrics.
Experience reviewing runbooks or mentoring junior engineers on operational best practices.
Self-directed ramp-up: comfortable learning unfamiliar systems through documentation and runbooks with minimal hand-holding.
Strong documentation habit: proactively leaves high-quality handover notes and improves runbooks as part of daily work.

Pay: ₹40,000.00 - ₹60,000.00 per month

Benefits:

Flexible schedule
Leave encashment
Provident Fund
Work from home

Work Location: In person

Quick apply

Jobseeker tools

Employer Tools

Browse

Stay Connected