Job Summary:
We are seeking a highly experienced Principal Cloud Infrastructure Engineer to lead the architecture, automation, and scalability of enterprise-grade cloud platforms. This role requires 10+ years of hands-on expertise in designing highly resilient AWS environments, building Infrastructure as Code (IaC) frameworks using Terraform, and managing large-scale Kubernetes ecosystems.
The ideal candidate will play a strategic role in strengthening platform reliability, cloud security, deployment automation, and operational excellence across the organization. You will work closely with engineering, platform, security, and architecture teams to establish scalable cloud-native solutions and drive infrastructure modernization initiatives.
Key Responsibilities
Cloud Infrastructure & Architecture
Architect, deploy, and manage highly available, scalable, and secure cloud infrastructure on AWS. Design enterprise-grade cloud environments leveraging services such as EKS, EC2, VPC, IAM, S3, RDS, Route53, CloudWatch, and Load Balancers. Drive cloud-native architecture standards and best practices for scalability, resiliency, and disaster recovery.
Infrastructure as Code (IaC)
Lead the implementation and governance of Infrastructure as Code using Terraform. Develop reusable Terraform modules, manage remote state strategies, and implement environment standardization using Terragrunt. Ensure infrastructure provisioning is automated, version-controlled, and compliant with enterprise standards.
Kubernetes & Container Platform Engineering
Design and manage production-grade Kubernetes (EKS/K8s) clusters for large-scale microservices platforms. Implement best practices for cluster scaling, workload orchestration, networking, ingress management, and security policies. Manage container deployment strategies using Helm, Service Mesh technologies (Istio), and Git Ops methodologies.
CI/CD & Platform Automation
Build and optimize automated CI/CD pipelines enabling zero-downtime deployments and faster release cycles. Implement Git Ops-based deployment strategies using tools such as Argo CD, Jenkins, and GitHub Actions. Automate operational processes, infrastructure provisioning, and platform maintenance tasks using Python and Bash scripting.
Observability, Reliability & Performance
Define and implement enterprise monitoring, alerting, logging, and observability frameworks. Ensure platform reliability through proactive monitoring using Prometheus, Grafana, ELK Stack, Datadog, or similar tools. Establish and maintain SLA/SLO-driven operational standards and incident response practices.
Security & Governance
Enforce security-first cloud infrastructure practices including IAM governance, least-privilege access, encryption, and network isolation. Conduct infrastructure security assessments, compliance reviews, and vulnerability remediation activities. Collaborate with security teams to implement enterprise compliance and governance standards.
Technical Leadership
Provide technical leadership and mentorship to DevOps, Cloud, and Infrastructure engineering teams. Lead architectural reviews, infrastructure modernization initiatives, and platform strategy discussions. Drive adoption of best practices across automation, reliability engineering, and cloud operations.
Technical Requirements
Primary Skills
-
AWS Cloud Architecture
- Kubernetes (EKS/K8s)
- Terraform & Infrastructure as Code (IaC)
Secondary Skills
-
Python / Bash Scripting
- CI/CD Tools: GitHub Actions, Jenkins, ArgoCD
- Helm & Service Mesh (Istio)
- Monitoring & Observability: Prometheus, Grafana, ELK, Datadog
Required Experience
10+ years of experience in Cloud Infrastructure, DevOps Engineering, Platform Engineering, or Site Reliability Engineering (SRE) Strong experience managing enterprise-scale cloud-native environments and Kubernetes platforms Proven expertise in automation, infrastructure scalability, and cloud security best practices