GG-1447 - Site Reliability Engineer (SRE) – Platform Team (AWS + Hybrid)

Softobiz Technologies -
Hyderabad, Telangana

Apply Now

Job details

Full-time

Qualifications

BCS
CI/CD
Hospitality
SSO
Cloud infrastructure
Azure
Computer Science
React
Enterprise Software
Git
AWS Certified Solutions Architect – Associate
Network infrastructure
Firewall
AWS
Incident response
Presentation skills
Bachelor's degree
Machine learning
Terraform
Continuous integration
BGP
Computer networking
POS
GitHub
APIs
MPLS
RDS database
TypeScript
React Native
MySQL
Identity & access management
VPN

Full job description

Site Reliability Engineer (SRE) – Platform Team (AWS + Hybrid)

Purpose of Position

We’re looking for a Site Reliability Engineer to join our Platform team and own the reliability of the systems that keep our company trading — from our AWS-based digital platform to the connectivity and point-of-sale systems in every store. You’ll treat reliability as a product: setting SLOs with engineering and product leaders, managing error budgets, and making the call between shipping faster and holding the line.

This is a hands-on role that spans cloud and ground. One hour you’re tuning a Prometheus federation or an Open Telemetry pipeline; the next you’re troubleshooting a store’s connectivity at peak service. You’ll lead P0/P1 incident response, run blameless post-mortems, and build the observability and on-call practices that let a distributed, multi-site business run calmly through its busiest hours.

We care about reliability and cost — uptime at any price isn’t the goal. We want someone who genuinely understands how a technology failure translates into guest experience, crew workflows, and revenue, and who builds systems and schedules around how restaurants operate.

Our current stack includes:

Technology we use

Current Stack

Application Stack: TypeScript, NestJS, TypeORM/Sequelize, React / React Native, MySQL / Aurora.
Cloud (AWS): ECS/Fargate, Lambda, RDS Aurora, CloudFront, API Gateway, SQS/SNS, IAM, VPC networking, Flow Logs.
IaC & Automation: Terraform and AWS CDK at scale, CloudFormation, GitHub Actions, Bitrise.
Observability: Prometheus, Grafana, Open Telemetry / OTel Collector, Datadog, AWS CloudWatch.
Networking: SD-WAN, MPLS, BGP, VLANs and firewall policy; AWS Direct Connect, Site-to-Site VPN, Transit Gateway.

Reliability & Operations

SLOs & error budgets: service-level indicators, objectives and agreements shaped with product and engineering leaders.
Incident response: P0/P1 command, structured blameless post-mortems, and error-budget-driven release decisions.
On-call by design: rosters built around operational peaks — breakfast, lunch, dinner, and weekend surge.
Multi-site footprint: store connectivity and last-mile reliability across a distributed retail / QSR estate.

Reporting to: Head of Engineering (Platform & Cloud Operations)

Responsibilities

Responsibilities include, but are not limited to:

Responsibilities

Cloud Infrastructure & Reliability

Operate AWS at scale — ECS, Lambda, RDS Aurora, CloudFront, IAM, and VPC networking, managed as infrastructure-as-code with Terraform or AWS CDK.
Engineer for cost-aware reliability — balance availability against cost; right-size and optimise rather than chase uptime at any price.
Harden release readiness — CI/CD, deployment safety, and rollback so changes reach production reliably.

Hybrid & On-Premises Networking

Run physical and hybrid networks — SD-WAN, MPLS, BGP routing, VLANs, and firewall policy across a multi-site estate.
Own hybrid connectivity — AWS Direct Connect, Site-to-Site VPN, and Transit Gateway between cloud and stores.
Solve last-mile problems — diagnose and resolve store connectivity issues remotely, calmly and quickly, at peak service.

Observability at Scale

Operate and scale Prometheus — federation, remote write, cardinality management, and alerting rules.
Build Grafana for ops — dashboards, alerting, and provisioning at scale (Mimir or Thanos a plus).
Run Open Telemetry pipelines — OTel Collector processors, exporters, and multi-destination routing.
Integrate OSS and commercial — Datadog alongside the open-source stack for comprehensive insight.

SLA / SLO Ownership & Incident Response

Define SLIs, SLOs and SLAs — shaped with product and engineering leaders, not filled into a spreadsheet.
Manage error budgets — translate burn rate into clear release decisions and communicate trade-offs.
Lead P0/P1 incidents — command response and run structured, blameless post-mortems.
Report reliability — present metrics clearly to non-technical stakeholders across ops, finance, and exec.

Restaurant Operations & On-Call

Connect tech to the guest — understand how failures hit guest experience, crew workflows, and revenue — not just ticket counts.
Know the QSR stack — POS systems, kitchen display systems (KDS), ordering integrations, and loyalty platforms.
Design around peaks — build systems and on-call schedules for breakfast, lunch, dinner, and weekend surge.

Collaboration & Enablement

Partner across teams — support scalable, reliable service design with product and engineering.
Mentor and document — raise SRE and DevOps practice; document infrastructure, decisions, and runbooks.
Improve the platform — contribute to internal developer platform and portal tooling that lifts team productivity.

Essential Requirements & Behaviour

Cloud & Infrastructure

5+ years with AWS — hands-on with ECS, Lambda, RDS Aurora, CloudFront, IAM, and VPC networking
Proficiency with Terraform or AWS CDK for infrastructure-as-code at scale
Familiarity with AWS cost optimisation alongside reliability — not just uptime at any cost
Bachelor’s degree in computer science, Engineering, or a related discipline, or equivalent experience

On-Premises & Hybrid Networking

Proven experience managing physical network infrastructure — SD-WAN, MPLS, BGP routing, VLANs, and firewall policy
Comfortable with last-mile connectivity challenges in distributed, multi-site environments (retail / hospitality preferred)
Experience with hybrid connectivity patterns — AWS Direct Connect, Site-to-Site VPN, and Transit Gateway
Able to troubleshoot a store connectivity issue remotely at peak service — calmly and quickly

Observability at Scale

Deep experience operating and scaling Prometheus — federation, remote write, cardinality management, alerting rules
Grafana expertise: dashboard design for ops audiences, alerting, provisioning, and managing at scale (Grafana Mimir or Thanos a plus)
Hands-on with Open Telemetry and OTel Collector pipelines — processors, exporters, and multi-destination routing
Experience integrating commercial observability (Datadog) alongside OSS stacks

SLA / SLO Ownership & Incident Response

Experience defining service-level indicators, objectives, and agreements — shaping them with product and engineering leaders, not just filling in a spreadsheet
Able to manage and communicate error budgets and translate burn rate into release decisions
Track record leading P0/P1 incident response and running structured post-mortems
Comfortable presenting reliability metrics to non-technical stakeholders (ops, finance, exec)

Restaurant Operations Understanding

Genuine understanding of how technology failures translate to guest experience, crew workflows, and revenue — not just ticket counts
Familiarity with QSR or fast-casual tech: POS systems, kitchen display systems (KDS), ordering integrations, and loyalty platforms
Awareness of operational peaks (breakfast, lunch, dinner, weekend surge) and how to build systems and on-call schedules around them

Nice to Have

Experience across AWS and Azure networking, Transit Gateway, Open Telemetry, and infrastructure reliability
Background with Datadog Observability Pipelines Worker (OPW) or similar log routing
Okta or SSO / IAM integration experience for workforce identity
Contributions to internal developer platforms or portal tooling
AWS Solutions Architect, or equivalent certification

What Makes You Ideal

You treat reliability as a product, not a chore — you shape SLOs, defend error budgets, and stay calm when a store goes dark at the lunch peak. You’re as comfortable in a BGP session or a firewall policy as you are in an OTel pipeline or a Terraform module, and you know when spending more on uptime stops being worth it.

Most of all, you understand that behind every metric is a guest waiting for an order and a crew member trying to serve them. You build systems, dashboards, and on-call schedules around how restaurants run — and you leave behind runbooks and practices the whole team can rely on.

About Softobiz Technologies

Softobiz Technologies is a technology and product services company headquartered in India, operating Global Capability Centers (GCCs) for leading international clients across healthcare, fintech, and enterprise software. Our GCC model enables world-class talent in India to work directly within the product and engineering teams of our global partners, contributing meaningfully to product strategy, growth, and operations

Innovation begins with like-minded people aiming to transform the world together. At Softobiz, we invite you to become a part of an organization that has been helping clients transform their business by fusing insights, creativity, and technology. With a team of 400+ technology enthusiasts, we have been trusted by leading enterprises around the globe for over 18+ years.

At Softobiz, we foster a culture of equality, learning, collaboration, and creative freedom, empowering our employees to grow and excel in their careers. Our technical craftsmen are pioneers in the latest technologies like AI, machine learning, and product development.

Why Should You Join Softobiz?

Work with technical craftsmen who are pioneers in the latest technologies.
Access training sessions and skill-enhancement courses for personal and professional growth.
Be rewarded for exceptional performance and celebrate success through engaging parties.
Experience a culture that embraces diversity and creates an inclusive environment for all employees.

Softobiz is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. All qualified applicants will be afforded equal employment opportunities without discrimination based on race, creed, color, national origin, sex, age, disability, or marital status.

Apply Now

Jobseeker tools

Employer Tools

Browse

Stay Connected