Location: Hyderabad / Indore
Experience: 3–6 Years
Techdome is hiring a Site Reliability Engineer (SRE) to build, operate, and continuously improve highly available, secure, and scalable cloud infrastructure across our healthcare, fintech, AI, and SaaS products.
This role goes beyond traditional DevOps. You'll own production environments, improve system reliability, automate operations, build resilient deployment pipelines, manage incidents, and ensure seamless releases using strategies such as Blue-Green Deployments, Rolling Deployments, and Zero-Downtime Releases.
If you're passionate about automation, cloud infrastructure, Kubernetes, observability, and AI-powered operations, we'd love to hear from you.
-
Maintain the availability, reliability, scalability, and performance of production systems.
-
Manage and optimize production environments across cloud platforms.
-
Design and automate deployment pipelines using CI/CD best practices.
-
Implement Blue-Green, Rolling, and Zero-Downtime deployment strategies.
-
Build Infrastructure as Code using Terraform and Ansible.
-
Implement observability using Prometheus, Grafana, ELK, Datadog, OpenTelemetry, and centralized logging.
-
Define and maintain SLIs, SLOs, and Error Budgets.
-
Lead production incident management, Root Cause Analysis (RCA), and post-incident reviews.
-
Perform cloud cost optimization and capacity planning.
-
Automate operational workflows using scripting and AI-powered tooling.
-
Participate in on-call rotations and production support.
-
3+ years of experience as a Site Reliability Engineer, DevOps Engineer, Platform Engineer, or Cloud Engineer.
-
Hands-on experience with AWS, Azure, or GCP.
-
Strong experience with Docker and Kubernetes.
-
Expertise in Terraform, Ansible, or other Infrastructure as Code tools.
-
Experience building CI/CD pipelines using Jenkins, GitHub Actions, GitLab CI, or similar.
-
Strong Linux administration, networking, and distributed systems knowledge.
-
Programming or scripting experience in Python, Go, or Bash.
-
Experience managing large-scale production environments.
-
Understanding of deployment strategies including Blue-Green, Canary, and Rolling Deployments.
-
Experience with AI tools such as GitHub Copilot, Claude, Cursor, ChatGPT, or similar developer productivity tools.
-
Experience building AI-powered operational workflows for monitoring, alert triage, incident summarization, or automation.
-
Experience in FinTech, Payments, Healthcare, or other high-availability environments.
-
Knowledge of SRE principles including SLOs, SLIs, Error Budgets, Chaos Engineering, and Reliability Engineering.
-
Work on real-world AI, Healthcare, Payments, and SaaS products.
-
Own critical production infrastructure from Day 1.
-
Build systems that support thousands of users and business-critical workflows.
-
Collaborate directly with founders and senior engineering leadership.
-
AI-first engineering culture with modern tooling and automation.
-
Fast decision-making, genuine ownership, and accelerated career growth.
We value your time and move quickly.
Step 1: AI Interview on our in-house platform, JustInterview.ai (JIA)
Step 2: Technical Discussion & 1:1 with Leadership
Meet the decision-makers, discuss real engineering challenges, and if it's the right fit you'll have your decision quickly.