Site Reliability Engineer (SRE) – Platform Team (AWS + Hybrid)
Purpose of Position
We’re looking for a Site Reliability Engineer to join our Platform team and own the reliability of the systems that keep our company trading — from our AWS-based digital platform to the connectivity and point-of-sale systems in every store. You’ll treat reliability as a product: setting SLOs with engineering and product leaders, managing error budgets, and making the call between shipping faster and holding the line.
This is a hands-on role that spans cloud and ground. One hour you’re tuning a Prometheus federation or an Open Telemetry pipeline; the next you’re troubleshooting a store’s connectivity at peak service. You’ll lead P0/P1 incident response, run blameless post-mortems, and build the observability and on-call practices that let a distributed, multi-site business run calmly through its busiest hours.
We care about reliability and cost — uptime at any price isn’t the goal. We want someone who genuinely understands how a technology failure translates into guest experience, crew workflows, and revenue, and who builds systems and schedules around how restaurants operate.
Our current stack includes:
-
Application Stack: TypeScript, NestJS, TypeORM/Sequelize, React / React Native, MySQL / Aurora.
-
Cloud (AWS): ECS/Fargate, Lambda, RDS Aurora, CloudFront, API Gateway, SQS/SNS, IAM, VPC networking, Flow Logs.
-
IaC & Automation: Terraform and AWS CDK at scale, CloudFormation, GitHub Actions, Bitrise.
-
Observability: Prometheus, Grafana, Open Telemetry / OTel Collector, Datadog, AWS CloudWatch.
-
Networking: SD-WAN, MPLS, BGP, VLANs and firewall policy; AWS Direct Connect, Site-to-Site VPN, Transit Gateway.
-
SLOs & error budgets: service-level indicators, objectives and agreements shaped with product and engineering leaders.
-
Incident response: P0/P1 command, structured blameless post-mortems, and error-budget-driven release decisions.
-
On-call by design: rosters built around operational peaks — breakfast, lunch, dinner, and weekend surge.
-
Multi-site footprint: store connectivity and last-mile reliability across a distributed retail / QSR estate.
Reporting to: Head of Engineering (Platform & Cloud Operations)
Responsibilities
Responsibilities include, but are not limited to:
Cloud Infrastructure & Reliability
-
Operate AWS at scale — ECS, Lambda, RDS Aurora, CloudFront, IAM, and VPC networking, managed as infrastructure-as-code with Terraform or AWS CDK.
-
Engineer for cost-aware reliability — balance availability against cost; right-size and optimise rather than chase uptime at any price.
-
Harden release readiness — CI/CD, deployment safety, and rollback so changes reach production reliably.
Hybrid & On-Premises Networking
-
Run physical and hybrid networks — SD-WAN, MPLS, BGP routing, VLANs, and firewall policy across a multi-site estate.
-
Own hybrid connectivity — AWS Direct Connect, Site-to-Site VPN, and Transit Gateway between cloud and stores.
-
Solve last-mile problems — diagnose and resolve store connectivity issues remotely, calmly and quickly, at peak service.
-
Operate and scale Prometheus — federation, remote write, cardinality management, and alerting rules.
-
Build Grafana for ops — dashboards, alerting, and provisioning at scale (Mimir or Thanos a plus).
-
Run Open Telemetry pipelines — OTel Collector processors, exporters, and multi-destination routing.
-
Integrate OSS and commercial — Datadog alongside the open-source stack for comprehensive insight.
SLA / SLO Ownership & Incident Response
-
Define SLIs, SLOs and SLAs — shaped with product and engineering leaders, not filled into a spreadsheet.
-
Manage error budgets — translate burn rate into clear release decisions and communicate trade-offs.
-
Lead P0/P1 incidents — command response and run structured, blameless post-mortems.
-
Report reliability — present metrics clearly to non-technical stakeholders across ops, finance, and exec.
Restaurant Operations & On-Call
-
Connect tech to the guest — understand how failures hit guest experience, crew workflows, and revenue — not just ticket counts.
-
Know the QSR stack — POS systems, kitchen display systems (KDS), ordering integrations, and loyalty platforms.
-
Design around peaks — build systems and on-call schedules for breakfast, lunch, dinner, and weekend surge.
Collaboration & Enablement
-
Partner across teams — support scalable, reliable service design with product and engineering.
-
Mentor and document — raise SRE and DevOps practice; document infrastructure, decisions, and runbooks.
-
Improve the platform — contribute to internal developer platform and portal tooling that lifts team productivity.
Essential Requirements & Behaviour
Cloud & Infrastructure
-
5+ years with AWS — hands-on with ECS, Lambda, RDS Aurora, CloudFront, IAM, and VPC networking
-
Proficiency with Terraform or AWS CDK for infrastructure-as-code at scale
-
Familiarity with AWS cost optimisation alongside reliability — not just uptime at any cost
-
Bachelor’s degree in computer science, Engineering, or a related discipline, or equivalent experience
On-Premises & Hybrid Networking
-
Proven experience managing physical network infrastructure — SD-WAN, MPLS, BGP routing, VLANs, and firewall policy
-
Comfortable with last-mile connectivity challenges in distributed, multi-site environments (retail / hospitality preferred)
-
Experience with hybrid connectivity patterns — AWS Direct Connect, Site-to-Site VPN, and Transit Gateway
-
Able to troubleshoot a store connectivity issue remotely at peak service — calmly and quickly
Observability at Scale
-
Deep experience operating and scaling Prometheus — federation, remote write, cardinality management, alerting rules
-
Grafana expertise: dashboard design for ops audiences, alerting, provisioning, and managing at scale (Grafana Mimir or Thanos a plus)
-
Hands-on with Open Telemetry and OTel Collector pipelines — processors, exporters, and multi-destination routing
-
Experience integrating commercial observability (Datadog) alongside OSS stacks
SLA / SLO Ownership & Incident Response
-
Experience defining service-level indicators, objectives, and agreements — shaping them with product and engineering leaders, not just filling in a spreadsheet
-
Able to manage and communicate error budgets and translate burn rate into release decisions
-
Track record leading P0/P1 incident response and running structured post-mortems
-
Comfortable presenting reliability metrics to non-technical stakeholders (ops, finance, exec)
Restaurant Operations Understanding
-
Genuine understanding of how technology failures translate to guest experience, crew workflows, and revenue — not just ticket counts
-
Familiarity with QSR or fast-casual tech: POS systems, kitchen display systems (KDS), ordering integrations, and loyalty platforms
-
Awareness of operational peaks (breakfast, lunch, dinner, weekend surge) and how to build systems and on-call schedules around them
Nice to Have
-
Experience across AWS and Azure networking, Transit Gateway, Open Telemetry, and infrastructure reliability
-
Background with Datadog Observability Pipelines Worker (OPW) or similar log routing
-
Okta or SSO / IAM integration experience for workforce identity
-
Contributions to internal developer platforms or portal tooling
-
AWS Solutions Architect, or equivalent certification
What Makes You Ideal
You treat reliability as a product, not a chore — you shape SLOs, defend error budgets, and stay calm when a store goes dark at the lunch peak. You’re as comfortable in a BGP session or a firewall policy as you are in an OTel pipeline or a Terraform module, and you know when spending more on uptime stops being worth it.
Most of all, you understand that behind every metric is a guest waiting for an order and a crew member trying to serve them. You build systems, dashboards, and on-call schedules around how restaurants run — and you leave behind runbooks and practices the whole team can rely on.
About Softobiz Technologies
Softobiz Technologies is a technology and product services company headquartered in India, operating Global Capability Centers (GCCs) for leading international clients across healthcare, fintech, and enterprise software. Our GCC model enables world-class talent in India to work directly within the product and engineering teams of our global partners, contributing meaningfully to product strategy, growth, and operations
Innovation begins with like-minded people aiming to transform the world together. At Softobiz, we invite you to become a part of an organization that has been helping clients transform their business by fusing insights, creativity, and technology. With a team of 400+ technology enthusiasts, we have been trusted by leading enterprises around the globe for over 18+ years.
At Softobiz, we foster a culture of equality, learning, collaboration, and creative freedom, empowering our employees to grow and excel in their careers. Our technical craftsmen are pioneers in the latest technologies like AI, machine learning, and product development.
Why Should You Join Softobiz?
- Work with technical craftsmen who are pioneers in the latest technologies.
- Access training sessions and skill-enhancement courses for personal and professional growth.
- Be rewarded for exceptional performance and celebrate success through engaging parties.
- Experience a culture that embraces diversity and creates an inclusive environment for all employees.
Softobiz is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. All qualified applicants will be afforded equal employment opportunities without discrimination based on race, creed, color, national origin, sex, age, disability, or marital status.