Platform Reliability Engineer (SRE / DevSecOps)

Lifesight -
India

Apply Now

Job details

Full-time

Qualifications

CI/CD
Cloud infrastructure
Azure
Go
Law
Load balancing
Disaster recovery
Master's degree
Bash (Unix shell)
AWS
SSL
PostgreSQL
Terraform
Continuous integration
Redis
DNS
APIs
ISO 27001
SaaS
AI
TypeScript
Python
Identity & access management

Full job description

We are building an AI-native software factory that rapidly launches SaaS products into the market. Our product engineers use AI-assisted development tools to build fast. We are hiring a senior, hands-on Platform Reliability Engineer to make sure every product we launch is production-grade: deployable, observable, secure, scalable, resilient, and cost-efficient.

You will own the shared production layer across our portfolio. Your job is to turn “it works” into “it runs reliably for customers.” You will define the standards, tooling, and operating practices that let a small engineering team launch and maintain many products without operational chaos.

What you will own

This role owns productionization across the service lifecycle: deployment standards, production readiness reviews, observability, SLOs and alerting, automated recovery, scalability, security hardening, and incident response. The goal is to reduce manual operational toil by turning repeatable ops work into software, templates, and automation.

Responsibilities

Build and own the “golden path” for launching and operating products in production:

infrastructure templates
CI/CD pipelines
environment provisioning
secrets management
DNS, SSL, and edge configuration
rollout and rollback workflows
backups and restore testing
monitoring, logging, tracing, dashboards, and alerts

Define and enforce a Production Readiness Review for every launch, covering reliability, security, scalability, rollback, observability, and recovery. Define service-level indicators and service-level objectives for each product, and build alerting tied to customer impact rather than noisy infra events. Architect and operate reliable cloud infrastructure for multi-product SaaS workloads:

autoscaling
load balancing
caching
queues and background jobs
database reliability
failover and disaster recovery
capacity planning and performance tuning

Own runtime and cloud security hardening:

IAM and least-privilege access
secret rotation and key management
dependency and container scanning
patching and vulnerability management
network boundaries and service-to-service access
audit logging
WAF/CDN and edge protections
secure release controls

Lead incident response for production issues:

triage
mitigation
root cause analysis
postmortems
follow-through remediation

Reduce operational toil by automating repetitive support, maintenance, and recovery work. Partner closely with the product engineers from design through launch so every new app is deployable through a standard platform, not a one-off setup. For AI-native products, design runtime guardrails around:

model/API credentials
provider rate limits
graceful degradation during vendor issues
latency and cost monitoring
fallback behavior for core AI workflows

What we’re looking for

5+ years of hands-on experience in SRE, platform engineering, production engineering, DevSecOps, or an infra-heavy backend role with direct production ownership
Strong experience with at least one major cloud platform such as AWS, GCP, or Azure
Strong infrastructure-as-code skills with Terraform, OpenTofu, Pulumi, or equivalent
Strong CI/CD and release engineering experience
Strong observability skills across logs, metrics, traces, dashboards, and alerting
Strong security fundamentals across IAM, secrets, network controls, vulnerability management, and secure delivery
Experience operating containers and/or serverless systems in production
Solid coding and scripting ability in at least one language such as TypeScript, Python, Go, or Bash
Experience with PostgreSQL, Redis, queues, background workers, and modern web app infrastructure
Experience owning on-call, incidents, postmortems, and recovery processes
Comfort working in a fast-moving startup where many products are launched from shared building blocks
Comfort reviewing and hardening AI-generated or AI-assisted code and infrastructure changes

Nice to have

Experience with multi-tenant SaaS products
Experience building internal developer platforms
SOC 2, ISO 27001, or security compliance preparation experience
Experience with LLM/AI application operations
Experience with FinOps or cloud cost optimization
Experience supporting a product portfolio rather than a single application

Success in the first 90 days

Establish a standard production deployment template for all new products
Put centralized monitoring, logging, tracing, and alerting in place
Create and enforce a production readiness checklist for launches
Define initial SLOs for core products
Implement backups and successfully test restore procedures
Roll out a baseline security hardening standard across all production apps
Create incident response runbooks and escalation paths

Success metrics

Time from product-ready codebase to production launch
Change failure rate
Mean time to detect and mean time to recover
Uptime and latency performance against agreed SLOs
Number of critical production incidents
Backup restore success rate
Security findings closed within target time
Infrastructure cost per product and per active customer

Apply Now

Jobseeker tools

Employer Tools

Browse

Stay Connected