Cloud Operations Lead – SRE / DevOps / Platform Engineering
Experience
9–12 Years
Shift
Overlap with US & EU Business Hours
Role Summary
We are seeking an experienced Cloud Operations Lead with a strong background in Site Reliability Engineering (SRE), DevOps, and Platform Engineering. The ideal candidate will be responsible for ensuring the reliability, security, and operational excellence of cloud-based platforms and services while leading a small team of engineers.
This is a hands-on role with approximately 80% focus on Cloud Operations, Production Support, Reliability, and Platform Ownership, combined with leadership responsibilities.
Key Responsibilities
-
Lead cloud operations and production support activities across AWS-based platforms.
-
Manage and troubleshoot Linux systems, cloud infrastructure, networking, and Kubernetes environments.
-
Drive operational excellence through monitoring, observability, automation, and incident management.
-
Build and maintain Infrastructure as Code (IaC) using Terraform, Ansible, and Helm.
-
Support and optimize CI/CD pipelines using GitHub Actions, Jenkins, and deployment automation tools.
-
Design and implement monitoring, alerting, dashboards, runbooks, and operational standards.
-
Lead vulnerability remediation, secrets management, access governance, and platform hardening initiatives.
-
Automate infrastructure provisioning, OS/AMI upgrades, and day-2 operational activities.
-
Support production deployments, release management, and change control processes.
-
Collaborate with engineering teams on onboarding, platform readiness, access management, and operational best practices.
-
Mentor and guide junior engineers while driving continuous service improvement.
Required Skills (Non-Negotiable)
-
Strong Linux Administration and Troubleshooting
-
AWS Cloud Operations (IAM, EC2, Networking, EKS)
-
Kubernetes Administration and Production Support
-
Terraform and Infrastructure as Code
-
CI/CD Tools (GitHub Actions, Jenkins)
-
Monitoring & Observability (Datadog, Prometheus, Grafana, SignalFx, Nagios, or similar)
-
Incident Management, Root Cause Analysis, and Production Support
-
Security Operations including vulnerability remediation, access management, and secrets rotation
-
Experience working in enterprise environments with formal change management processes
Preferred Skills
-
DNS, Proxy, Edge Services, and Networking Platforms
-
Teleport, Bastion Hosts, Service Accounts, and Access Management Solutions
-
Container Security and Supply Chain Security
-
AMI/Image Lifecycle Management
-
AI-enabled Operations, Custom Agentic AI, or Hyperscaler AI Services
Leadership Expectations
-
Lead a team of cloud/platform engineers.
-
Drive operational governance, service reliability, and process standardization.
-
Promote automation-first and reliability-first engineering practices.
-
Partner with stakeholders across Cloud, Infrastructure, Security, and Application teams.
Nice to Have
-
Experience in SRE, Platform Engineering, or Managed Services environments.
-
Exposure to AI-powered operations, observability, or automation solutions.
-
Experience supporting large-scale distributed systems and cloud-native applications.