Site Reliability Engineer

InfraCloud Technologies -
Pune, Maharashtra

Quick apply

Job details

1 day ago

Qualifications

CI/CD
TCP
Cloud infrastructure
Incident management
Software troubleshooting
Kubernetes
Load balancing
Git
Google Cloud Platform
RBAC
Cloud security
SSL
Distributed systems
Terraform
Continuous integration
Computer networking
DNS
Linux
TCP/IP
GitLab
Identity & access management

Full job description

Position Summary

We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, automate, and operate scalable, secure, and highly available cloud-native platforms. The ideal candidate will have strong expertise in Kubernetes ecosystem technologies, Google Cloud Platform (GCP), Infrastructure as Code (Terraform), GitOps, Observability, Service Mesh, and Secrets Management.

The SRE will work closely with Development, Platform Engineering, Security, and DevOps teams to ensure reliability, performance, scalability, and operational excellence across production environments.

Key Responsibilities

Kubernetes Platform Engineering

Design, deploy, and manage large-scale Kubernetes clusters in production environments.
Administer and optimize Kubernetes networking using:
- Cilium
- Istio Service Mesh
- Kubernetes Ingress Controllers
Build highly available and resilient container platforms.
Implement cluster lifecycle management, upgrades, scaling, and capacity planning.
Troubleshoot complex Kubernetes infrastructure and application issues.

Cloud Infrastructure (GCP)

Design and operate cloud-native infrastructure on Google Cloud Platform.
Manage services such as:
- GKE (Google Kubernetes Engine)
- VPC Networking
- IAM
- Cloud Load Balancers
- Cloud Storage
- Monitoring and Logging services
Ensure security, scalability, and cost optimization of cloud environments.
Implement multi-environment and multi-region deployment strategies.

Infrastructure as Code (Terraform)

Develop and maintain reusable Terraform modules.
Automate provisioning and management of cloud infrastructure.
Implement infrastructure standards and governance.
Maintain version-controlled infrastructure repositories.
Ensure repeatable, auditable, and scalable infrastructure deployments.

Kubernetes Package Management (Helm)

Create and maintain Helm charts for platform and application deployments.
Standardize deployment practices across teams.
Manage Helm repositories and release strategies.
Support blue-green, canary, and rolling deployment methodologies.

GitOps & Continuous Delivery

Build and maintain GitOps workflows using ArgoCD.
Automate application deployment pipelines.
Implement environment promotion strategies.
Maintain deployment compliance and auditability.
Drive CI/CD best practices across engineering teams.

Secrets & Service Discovery Management

Manage secrets, certificates, and application credentials using Vault.
Implement secure secret injection patterns for Kubernetes workloads.
Configure and maintain Consul for service discovery and service networking.
Establish access control and security policies for sensitive workloads.

Monitoring, Observability & Reliability Engineering

Build comprehensive observability solutions using:
- Prometheus
- Prometheus Operator
- Grafana
- Loki
- Tempo
- Alloy
- Mimir
- Pyroscope
Define and implement:
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Error Budgets
Create dashboards, alerts, and operational runbooks.
Conduct root cause analysis (RCA) and postmortems.
Improve system reliability, performance, and operational visibility.

Incident Response & Operations

Participate in on-call rotations.
Lead incident management during production outages.
Troubleshoot infrastructure, networking, application, and platform issues.
Develop automation to reduce operational toil.
Create disaster recovery and business continuity procedures.

Automation & Platform Engineering

Develop automation scripts and operational tooling.
Improve platform self-service capabilities.
Drive reliability engineering best practices.
Eliminate manual operational processes through automation.

Required Technical Skills

Container & Kubernetes Ecosystem

Kubernetes (Production-grade administration)
Cilium
Istio Service Mesh
Kubernetes Ingress Controllers
Container Networking
Cluster Security and RBAC

Cloud Platforms

Google Cloud Platform (GCP)
GKE
Cloud Networking
IAM and Security Controls

Infrastructure as Code

Terraform
Infrastructure Automation
Configuration Management Concepts

Deployment & GitOps

ArgoCD
GitOps Methodologies
GitLab
CI/CD Pipelines

Secrets & Service Networking

HashiCorp Vault
Consul

Monitoring & Observability

Prometheus
Prometheus Operator
Grafana
Loki
Tempo
Alloy
Mimir
Pyroscope

Operating Systems & Networking

Linux Administration
TCP/IP
DNS
Load Balancing
SSL/TLS
Network Troubleshooting

Preferred Qualifications

Experience managing large-scale Kubernetes platforms.
Experience supporting mission-critical production systems.
Strong understanding of distributed systems concepts.
Knowledge of cloud security best practices.
Experience implementing SRE principles such as:
- SLI/SLO/Error Budgets
- Capacity Planning
- Incident Management
- Reliability Engineering
Experience with multi-cluster Kubernetes environments.
Relevant certifications such as:
- Certified Kubernetes Administrator (CKA)
- Certified Kubernetes Security Specialist (CKS)
- Google Cloud Professional Certifications
- HashiCorp Terraform Associate

Experience

5–10+ years of overall infrastructure/platform engineering experience.
3–5+ years of hands-on Kubernetes production experience.
Strong experience in cloud-native platforms, observability, automation, and GitOps-driven operations.

Quick apply