Job Title: DevOps/SRE Engineer
Role Overview
DevOps/SRE who builds and operates cloud-native platforms with a focus on Kubernetes, CI/CD, observability, and secure-by-default practices to enable reliable, high-velocity releases.
Role Summary
Own environment setup and day-to-day operations for services on Kubernetes, design and maintain CI/CD and GitOps workflows, implement observability stacks, and enforce security baselines. Lead small platform initiatives and production incidents with guidance from senior/staff.
Key Responsibilities
Platform Engineering (Kubernetes)
- Operate production-grade Kubernetes clusters (node pools, upgrades, backups) with guidance
- Manage Ingress controllers, CNI, and basic service mesh setup (Istio/Linkerd) including mTLS
- Configure autoscaling (HPA/VPA/KEDA), requests/limits, affinity/anti-affinity, and quotas
- Package and standardize deployments with Helm/Kustomize; maintain golden templates
CI/CD & GitOps
- Build and maintain CI/CD pipelines (GitHub Actions/GitLab CI/Jenkins/Azure DevOps)
- Implement GitOps (Argo CD/Flux/Pulumi) for environment reconciliation and progressive delivery
- Manage artifact repositories, build caching, and supply chain basics (SBOMs, image signing)
Observability & SRE
- Deploy and tune monitoring/logging/tracing (Prometheus, Grafana, ELK/Loki, OpenTelemetry)
- Define SLIs/SLOs with teams; configure actionable alerts and runbooks
- Lead incident response for owned areas; drive post-incident reviews and remediation
Security & Compliance
- IAM and least-privilege patterns; manage secrets (Vault/Sealed Secrets/SOPS)
- Enforce image scanning, policy checks (OPA Gatekeeper/Kyverno), and Pod Security Standards
- Configure network policies, TLS, and certificate management (cert-manager)
Cloud & IaC
- Provision and manage cloud resources (AWS/Azure/GCP) using IaC - Pulumi is a plus
- Manage VPC/networking, load balancers, DNS/CDN, storage classes, and backups/DR procedures
- Track costs and basic capacity planning; recommend right-sizing and autoscaling policies
Collaboration & Enablement
- Create platform docs, templates, and paved paths
- Partner with app teams to debug deployments, performance, and reliability issues
Mandatory Skills
- Kubernetes operations (Helm/Kustomize, Ingress, autoscaling, resource policies)
- CI/CD systems and GitOps (GitHub Actions/GitLab CI/Jenkins; Argo CD/Flux)
- Observability stacks (Prometheus, Grafana, ELK/Loki) and OpenTelemetry basics
- IaC with Terraform/Pulumi; cloud experience (AWS/Azure/GCP)
- Security: IAM, secrets management, image scanning, policy enforcement (OPA/Kyverno)
- Docker, Linux, networking fundamentals (DNS, TLS, routing), Bash; a scripting language (Python preferred)
- Familiarity with release strategies (canary/blue-green) and safe rollback practices