Posting title: SRE Platform Engineer
Experience: 7-10 Years
Location: Kolkata
Work mode: On-site
Primary skills: Kubernetes, IaC-Terraform, AWS, ArgoCD, Karpenter, Keda, Atlantis, SLO/SLI, AWS Control Tower, Multi-Cluster, Multi-Environment, Toel Framework
Qualification: B.Tech / B.E. in Computer Science or MCA / M.Tech
Role Overview
We are seeking an experienced Site Reliability Engineer (SRE) to lead the modernization and reliability of our cloud infrastructure and platform operations. The ideal candidate will have deep expertise in AWS, Kubernetes, Infrastructure as Code, observability, automation, and production operations at scale.
Key Responsibilities
Cloud & Platform Engineering
-
Design, build, and maintain scalable cloud infrastructure on AWS.
-
Manage and optimize AWS services including EKS, RDS Aurora MySQL, ElastiCache, EC2, and networking components.
-
Drive Kubernetes adoption and operational excellence across production environments.
-
Manage cluster provisioning, scaling, and optimization using Karpenter and EKS.
Infrastructure as Code & Automation
-
Lead Terraform governance, standardization, and best practices.
-
Resolve infrastructure drift and improve multi-team collaboration workflows.
-
Design migration strategies for legacy manually managed environments into Infrastructure as Code.
-
Develop reusable Terraform modules and infrastructure automation frameworks
Reliability & Operations
-
Ensure high availability and reliability of mission-critical real-time applications.
-
Define and improve incident management processes, operational readiness, and service reliability.
-
Support production workloads requiring minimal downtime and strict SLAs.
-
Create and maintain operational runbooks, SOPs, and recovery procedures.
Observability & Monitoring
-
Enhance monitoring, alerting, and observability practices using Datadog and Prometheus.
-
Reduce alert fatigue through alert rationalization, prioritization, and ownership models.
-
Implement SLOs, SLIs, and error budget frameworks.
-
Improve APM adoption and application performance visibility.
CI/CD & GitOps
-
Manage and optimize CI/CD pipelines using GitHub Actions, Atlantis, Bitbucket Pipelines, and Rundeck.
-
Drive GitOps practices using ArgoCD.
-
Improve deployment reliability, automation, and release governance.
Data Platform & Scalability
-
Support large-scale event-driven architectures handling hundreds of thousands of events per second.
-
Work closely with engineering teams on database scalability and data platform modernization.
-
Support technologies such as Aurora MySQL, ClickHouse, and Pulsar.
Platform Governance
-
Establish service ownership models and service catalogs.
-
Partner with development teams to shift operational ownership closer to engineering teams.
-
Drive platform standards, security, reliability, and operational best practices.
Required Skills
-
Strong experience with AWS cloud services.
-
Hands-on expertise in Kubernetes (EKS) administration and operations.
-
Advanced Terraform experience with large-scale environments.
-
Strong knowledge of Infrastructure as Code, GitOps, and automation practices.
-
Experience with Datadog, Prometheus, Grafana, or similar observability platforms.
-
Experience managing CI/CD pipelines and deployment automation.
-
Strong Linux systems administration skills.
-
Experience with incident management, root cause analysis, and production support.
-
Knowledge of distributed systems and high-throughput architectures.
-
Strong scripting skills using Python, Bash, or similar languages.
Preferred Skills
-
Experience with Karpenter for Kubernetes autoscaling.
-
Experience with ArgoCD and GitOps workflows.
-
Exposure to ClickHouse, Apache Pulsar, Kafka, or event-streaming platforms.
-
Knowledge of SRE principles including SLIs, SLOs, and Error Budgets.
-
Experience optimizing monitoring and observability costs.
-
Experience migrating legacy infrastructure to modern cloud-native platforms.