Role Overview
We are looking for an experienced Infrastructure Lead to drive the design, implementation, and optimization of scalable, secure, and highly available cloud infrastructure. This role will lead DevOps/SRE initiatives, establish best practices, and ensure reliability and performance of mission-critical systems.
Key Responsibilities
1. Cloud Infrastructure & Architecture
-
Design, build, and manage scalable cloud infrastructure on AWS/Azure
-
Lead architecture decisions for high availability, fault tolerance, and performance
-
Drive infrastructure automation using Infrastructure as Code (Terraform)
2. DevOps & CI/CD Enablement
-
Establish and optimize CI/CD pipelines (Jenkins, GitLab CI, CircleCI, ArgoCD)
-
Implement GitOps practices for consistent and reliable deployments
-
Improve deployment frequency, reduce lead time, and minimize failures
3. Kubernetes & Containerization
-
Manage and scale Kubernetes clusters (EKS/AKS/on-prem)
-
Implement container orchestration, service mesh, and cluster optimization strategies
-
Ensure platform stability and performance tuning
4. Monitoring, Reliability & Incident Management
-
Define and enforce SLOs/SLAs and reliability standards
-
Implement observability frameworks (Prometheus, Grafana, Datadog, ELK)
-
Lead incident response, root cause analysis (RCA), and MTTR reduction
5. Automation & Operational Excellence
-
Drive automation across infrastructure provisioning, monitoring, and recovery
-
Build reusable infrastructure modules and accelerators
-
Reduce manual effort through scripting (Python, Bash) and tooling
6. Security & Compliance
-
Implement cloud security best practices (IAM, network security, policies)
-
Ensure compliance through Kubernetes policies and governance frameworks
-
Drive secure-by-design infrastructure practices
7. Cost Optimization
-
Monitor and optimize cloud usage and costs
-
Implement right-sizing, auto-scaling, and resource utilization strategies
8. Leadership & Stakeholder Management
-
Lead and mentor DevOps/SRE teams
-
Collaborate with engineering, product, and architecture teams
-
Drive infrastructure best practices across projects and teams
9. Innovation & AI-driven Ops (Nice to Have)
-
Explore AI/ML-driven infrastructure and AIOps capabilities
-
Implement intelligent monitoring, anomaly detection, and RCA automation
Required Skills & Experience
-
8+ years of experience in Infrastructure / DevOps / SRE roles
-
Strong expertise in AWS (preferred)
-
Hands-on experience with Terraform (IaC)
-
Deep knowledge of Kubernetes and containerization (Docker)
-
Experience with CI/CD tools (Jenkins, GitLab CI, CircleCI, ArgoCD)
-
Strong understanding of monitoring & observability tools
-
Proficiency in scripting (Python, Bash)
-
Experience in managing high-availability, large-scale systems