Position Overview
We are seeking a highly experienced Senior DevOps Engineer to design, build, and operate scalable cloud infrastructure with a primary focus on AWS. The ideal candidate has extensive experience managing complex distributed systems, deploying machine learning workloads, implementing secure and automated infrastructure, and supporting multi-service architectures in production environments.
This role requires a strong infrastructure engineering mindset, deep cloud expertise, and the ability to collaborate with software, data, and machine learning teams to deliver highly available and scalable platforms.
Key ResponsibilitiesCloud Infrastructure & Platform Engineering
- Design, implement, and manage cloud-native infrastructure primarily on AWS.
- Build and maintain highly available, fault-tolerant, and scalable production environments.
- Develop Infrastructure as Code (IaC) using tools such as Terraform or CloudFormation.
- Establish cloud governance, security, networking, and operational best practices.
- Optimize infrastructure costs while maintaining performance and reliability.
Kubernetes & Container Orchestration
- Design and operate production Kubernetes environments.
- Manage containerized applications and platform services across multiple environments.
- Implement autoscaling, service discovery, ingress routing, and workload isolation strategies.
- Optimize cluster performance, reliability, and resource utilization.
Machine Learning Infrastructure
- Deploy, manage, and scale machine learning workloads in production environments.
- Support GPU-based and CPU-based workloads for training and inference.
- Build deployment pipelines for ML models and AI services.
- Collaborate with ML engineers and data scientists to operationalize machine learning systems.
- Manage model-serving infrastructure and inference scaling requirements.
Networking & Multi-Service Architecture
- Design and maintain complex networking architectures across cloud environments.
- Configure and manage:
- Load balancers
- API gateways
- Service meshes
- Reverse proxies
- Traffic routing policies
- Support multi-service and microservice-based platforms.
- Implement secure communication between distributed services.
CI/CD & Automation
- Build and maintain robust CI/CD pipelines.
- Automate infrastructure provisioning, deployments, testing, and operational workflows.
- Implement deployment strategies including:
- Blue/green deployments
- Canary releases
- Rolling updates
- Improve engineering productivity through platform automation.
Reliability & Observability
- Implement monitoring, logging, tracing, and alerting solutions.
- Establish SLOs, SLIs, and operational metrics.
- Lead incident response and root-cause analysis activities.
- Continuously improve platform reliability and operational excellence.
Required QualificationsExperience
- 5+ years of DevOps, Platform Engineering, Site Reliability Engineering (SRE), or Infrastructure Engineering experience.
- Proven experience operating production environments at scale.
- Experience supporting mission-critical systems with high availability requirements.
AWS Expertise
Strong hands-on experience with AWS services including:
- EC2
- VPC
- IAM
- Route 53
- Application Load Balancer (ALB)
- Network Load Balancer (NLB)
- ECS and/or EKS
- S3
- RDS
- ElastiCache
- CloudWatch
- Secrets Manager
- Lambda (preferred)
Kubernetes & Containers
- Extensive Kubernetes production experience.
- Strong understanding of:
- Networking
- Ingress controllers
- Storage management
- Cluster operations
- Security policies
- Advanced Docker experience.
Infrastructure as Code
Experience with:
- Terraform (strongly preferred)
- CloudFormation
- Pulumi (nice to have)
CI/CD
Hands-on experience with one or more:
- GitHub Actions
- GitLab CI/CD
- Jenkins
- ArgoCD
- CircleCI
Networking
Strong understanding of:
- DNS
- TLS/SSL
- VPNs
- Routing
- Reverse proxies
- Service-to-service communication
- Network security architecture
Preferred QualificationsGoogle Cloud Platform Exposure
Experience with:
- GKE
- Cloud Run
- Compute Engine
- Cloud Storage
- IAM
- VPC Networking
Machine Learning Infrastructure
Experience deploying and operating:
- ML inference services
- GPU workloads
- Model-serving platforms
- MLOps workflows
- Vector databases
- LLM applications and AI infrastructure
Observability & Operations
Experience with:
- Prometheus
- Grafana
- OpenTelemetry
- ELK/OpenSearch
- Datadog
- New Relic
Security
- Cloud security best practices.
- IAM design and access controls.
- Secrets management.
- Vulnerability management and compliance frameworks.
Required Application Submission
Applicants should include:
- Resume/CV
- LinkedIn profile (optional)
- GitHub profile (if applicable)
- Description of the largest production infrastructure they have managed
- Summary of Kubernetes and AWS environments they have operated
- Details of machine learning or AI workloads they have deployed
- Examples of CI/CD pipelines and Infrastructure-as-Code projects they have implemented
Success Criteria
The successful candidate will:
- Architect and maintain highly reliable cloud infrastructure.
- Independently manage AWS-based production environments.
- Successfully deploy and operate complex ML and AI workloads.
- Design secure and scalable multi-service routing architectures.
- Drive automation, observability, and operational excellence.
- Serve as a technical leader for cloud infrastructure and platform engineering initiatives.
- Improve deployment velocity, system reliability, and infrastructure scalability across the organization.
Pay: ₹647,045.59 - ₹804,652.19 per year
Benefits:
- Health insurance
- Paid sick time
- Paid time off
- Provident Fund
Work Location: In person