We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, automate, and operate scalable, secure, and highly available cloud-native platforms. The ideal candidate will have strong expertise in Kubernetes ecosystem technologies, Google Cloud Platform (GCP), Infrastructure as Code (Terraform), GitOps, Observability, Service Mesh, and Secrets Management.
The SRE will work closely with Development, Platform Engineering, Security, and DevOps teams to ensure reliability, performance, scalability, and operational excellence across production environments.
- Design, deploy, and manage large-scale Kubernetes clusters in production environments.
- Administer and optimize Kubernetes networking using:
- Cilium
- Istio Service Mesh
- Kubernetes Ingress Controllers
- Build highly available and resilient container platforms.
- Implement cluster lifecycle management, upgrades, scaling, and capacity planning.
- Troubleshoot complex Kubernetes infrastructure and application issues.
- Design and operate cloud-native infrastructure on Google Cloud Platform.
- Manage services such as:
- GKE (Google Kubernetes Engine)
- VPC Networking
- IAM
- Cloud Load Balancers
- Cloud Storage
- Monitoring and Logging services
- Ensure security, scalability, and cost optimization of cloud environments.
- Implement multi-environment and multi-region deployment strategies.
- Develop and maintain reusable Terraform modules.
- Automate provisioning and management of cloud infrastructure.
- Implement infrastructure standards and governance.
- Maintain version-controlled infrastructure repositories.
- Ensure repeatable, auditable, and scalable infrastructure deployments.
- Create and maintain Helm charts for platform and application deployments.
- Standardize deployment practices across teams.
- Manage Helm repositories and release strategies.
- Support blue-green, canary, and rolling deployment methodologies.
- Build and maintain GitOps workflows using ArgoCD.
- Automate application deployment pipelines.
- Implement environment promotion strategies.
- Maintain deployment compliance and auditability.
- Drive CI/CD best practices across engineering teams.
- Manage secrets, certificates, and application credentials using Vault.
- Implement secure secret injection patterns for Kubernetes workloads.
- Configure and maintain Consul for service discovery and service networking.
- Establish access control and security policies for sensitive workloads.
- Build comprehensive observability solutions using:
- Prometheus
- Prometheus Operator
- Grafana
- Loki
- Tempo
- Alloy
- Mimir
- Pyroscope
- Define and implement:
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Error Budgets
- Create dashboards, alerts, and operational runbooks.
- Conduct root cause analysis (RCA) and postmortems.
- Improve system reliability, performance, and operational visibility.
- Participate in on-call rotations.
- Lead incident management during production outages.
- Troubleshoot infrastructure, networking, application, and platform issues.
- Develop automation to reduce operational toil.
- Create disaster recovery and business continuity procedures.
- Develop automation scripts and operational tooling.
- Improve platform self-service capabilities.
- Drive reliability engineering best practices.
- Eliminate manual operational processes through automation.
- Kubernetes (Production-grade administration)
- Cilium
- Istio Service Mesh
- Kubernetes Ingress Controllers
- Container Networking
- Cluster Security and RBAC
- Google Cloud Platform (GCP)
- GKE
- Cloud Networking
- IAM and Security Controls
- Terraform
- Infrastructure Automation
- Configuration Management Concepts
- ArgoCD
- GitOps Methodologies
- GitLab
- CI/CD Pipelines
- Prometheus
- Prometheus Operator
- Grafana
- Loki
- Tempo
- Alloy
- Mimir
- Pyroscope
- Linux Administration
- TCP/IP
- DNS
- Load Balancing
- SSL/TLS
- Network Troubleshooting
- Experience managing large-scale Kubernetes platforms.
- Experience supporting mission-critical production systems.
- Strong understanding of distributed systems concepts.
- Knowledge of cloud security best practices.
- Experience implementing SRE principles such as:
- SLI/SLO/Error Budgets
- Capacity Planning
- Incident Management
- Reliability Engineering
- Experience with multi-cluster Kubernetes environments.
- Relevant certifications such as:
- Certified Kubernetes Administrator (CKA)
- Certified Kubernetes Security Specialist (CKS)
- Google Cloud Professional Certifications
- HashiCorp Terraform Associate
Experience
- 5–10+ years of overall infrastructure/platform engineering experience.
- 3–5+ years of hands-on Kubernetes production experience.
- Strong experience in cloud-native platforms, observability, automation, and GitOps-driven operations.