SRE Platform Engineer

Arting Digital Private Limited
Kolkata, West Bengal

Quick apply

Job details

₹26,00,000 a year
13 hours ago

Qualifications

CI/CD
Cloud infrastructure
System administration
Computer Science
Incident management
Kubernetes
Software deployment
Git
Master's degree
Bash (Unix shell)
MCA
AWS
Bachelor's degree
Distributed systems
Terraform
Continuous integration
Scripting
GitHub
B.E.
Budgeting
Linux
Apache
Kafka
Root cause analysis
RDS database
Python
MySQL

Full job description

Posting title: SRE Platform Engineer

Experience: 7-10 Years

Location: Kolkata

Work mode: On-site

Primary skills: Kubernetes, IaC-Terraform, AWS, ArgoCD, Karpenter, Keda, Atlantis, SLO/SLI, AWS Control Tower, Multi-Cluster, Multi-Environment, Toel Framework

Qualification: B.Tech / B.E. in Computer Science or MCA / M.Tech

Role Overview

We are seeking an experienced Site Reliability Engineer (SRE) to lead the modernization and reliability of our cloud infrastructure and platform operations. The ideal candidate will have deep expertise in AWS, Kubernetes, Infrastructure as Code, observability, automation, and production operations at scale.

Key Responsibilities

Cloud & Platform Engineering

Design, build, and maintain scalable cloud infrastructure on AWS.
Manage and optimize AWS services including EKS, RDS Aurora MySQL, ElastiCache, EC2, and networking components.
Drive Kubernetes adoption and operational excellence across production environments.
Manage cluster provisioning, scaling, and optimization using Karpenter and EKS.

Infrastructure as Code & Automation

Lead Terraform governance, standardization, and best practices.
Resolve infrastructure drift and improve multi-team collaboration workflows.
Design migration strategies for legacy manually managed environments into Infrastructure as Code.
Develop reusable Terraform modules and infrastructure automation frameworks

Reliability & Operations

Ensure high availability and reliability of mission-critical real-time applications.
Define and improve incident management processes, operational readiness, and service reliability.
Support production workloads requiring minimal downtime and strict SLAs.
Create and maintain operational runbooks, SOPs, and recovery procedures.

Observability & Monitoring

Enhance monitoring, alerting, and observability practices using Datadog and Prometheus.
Reduce alert fatigue through alert rationalization, prioritization, and ownership models.
Implement SLOs, SLIs, and error budget frameworks.
Improve APM adoption and application performance visibility.

CI/CD & GitOps

Manage and optimize CI/CD pipelines using GitHub Actions, Atlantis, Bitbucket Pipelines, and Rundeck.
Drive GitOps practices using ArgoCD.
Improve deployment reliability, automation, and release governance.

Data Platform & Scalability

Support large-scale event-driven architectures handling hundreds of thousands of events per second.
Work closely with engineering teams on database scalability and data platform modernization.
Support technologies such as Aurora MySQL, ClickHouse, and Pulsar.

Platform Governance

Establish service ownership models and service catalogs.
Partner with development teams to shift operational ownership closer to engineering teams.
Drive platform standards, security, reliability, and operational best practices.

Required Skills

Strong experience with AWS cloud services.
Hands-on expertise in Kubernetes (EKS) administration and operations.
Advanced Terraform experience with large-scale environments.
Strong knowledge of Infrastructure as Code, GitOps, and automation practices.
Experience with Datadog, Prometheus, Grafana, or similar observability platforms.
Experience managing CI/CD pipelines and deployment automation.
Strong Linux systems administration skills.
Experience with incident management, root cause analysis, and production support.
Knowledge of distributed systems and high-throughput architectures.
Strong scripting skills using Python, Bash, or similar languages.

Preferred Skills

Experience with Karpenter for Kubernetes autoscaling.
Experience with ArgoCD and GitOps workflows.
Exposure to ClickHouse, Apache Pulsar, Kafka, or event-streaming platforms.
Knowledge of SRE principles including SLIs, SLOs, and Error Budgets.
Experience optimizing monitoring and observability costs.
Experience migrating legacy infrastructure to modern cloud-native platforms.

Quick apply

Jobseeker tools

Employer Tools

Browse

Stay Connected