Role description
Job Description
Job Title Cloud Site Reliability Engineer SRE
Position Overview
We are seeking a Cloud Site Reliability Engineer SRE to drive the reliability scalability and performance of our cloudbased infrastructure The ideal candidate combines software engineering expertise with advanced systems operations skills to maintain highly available systems while reducing operational toil This role involves automation monitoring capacity planning incident response and cloud platform management across a dynamic distributed environment
As a Cloud SRE you will work closely with Engineering Architecture DevOps and security teams to ensure seamless service experiences for our customers while contributing to platform design and operational efficiency
Position Requirements
Our Engineers play a citical role in the success of our clients and are expected to effectively communicate our recommended solutions in a consultative role for each client Therefore a successful candidate will possess a high degree of selfmanagement personal accountability strong communication skills and teamwork The ability to interact engineer and communicate collaboratively at the highest technical levels with customers vendors partners and all members of staff is required
Key Responsibilities
System Reliability Availability Design and maintain faulttolerant highavailability architectures across AWS Azure and GCP Implement redundancy load balancing and automated failover strategies
Cloud Infrastructure Management Deploy manage and optimize cloud resources using IaC tools such as Terraform Ansible
Monitoring Observability Implement monitoring ing and logging frameworks using Splunk Azure monitor Dynatrace AWS cloud watch or similar to detect and resolve issues proactively
Incident Management Lead realtime incident response rootcause analysis and postmortems to continuously improve uptime and resilience
Capacity Planning Scaling Predict traffic patterns optimize resource utilization and enforce autoscaling and performance best practices
Automation Tooling Develop scripts and internal tooling for automating routine tasks to reduce manual intervention Languages may include Python Power Shell or Bash
Security Compliance Collaborate with security teams to implement secure infrastructure practices including encryption rolebased access auditing and vulnerability management
Collaboration Mentorship Work across engineering and DevOps teams providing guidance on reliability best practices and mentoring junior SREs
Required Skills Qualifications
Programming Scripting Proficiency in Python Power Shell Bash or equivalent for automation and system management
Cloud Platforms Handson experience with AWS Azure or GCP strong understanding of VPCs IAM serverless architectures and managed Kubernetes services
Containers Orchestration Experience with Docker and Kubernetes
Infrastructure as Code IaC Proficient in Terraform Ansible
Monitoring Observability Expertise with Splunk Azure Monitor Dynatrace AWS Cloud Watch or similar tools
Expert Knowledge and practical experience using Cloud data migration tools
Operating Systems Advanced knowledge of Windows LinuxUnix environments with experience in system administration and networking fundamentals
Incident Response Strong problemsolving skills under pressure with experience managing outages and mitigating risk
Collaboration Communication Ability to articulate technical insights coordinate across teams and contribute to a blameless culture to resolve issues and drive consistent results
Preferred Qualifications
Industry certifications such as AWS Certified Solutions Architect Google Cloud Professional DevOps Engineer Azure Dev Ops Engineer
Exposure to chaos engineering or resilience testing frameworks
Prior experience in multicloud deployments or hybrid cloud environments
Familiarity with servicelevel objectives SLOs indicators SLIs and error budgets for service reliability
Gather feedback from the department on areas of improvement and provide solutions utilizing Azure
Skills
Mandatory Skills : Automation & Scripting
Good to Have Skills : Critical Incident Response, Monitoring & Observability, Service Level & Error Budget Management
About LTM
LTM is an AI-centric global technology services company and the Business Creativity partner to the world’s largest and most disruptive enterprises. We bring human insights and intelligent systems together to help clients create greater value at the intersection of technology and domain expertise. Our capabilities span integrated operations, transformation, and business AI — enabling new ways of working, new productivity paradigms, and new roads to value. Together with over 87,000 employees across 40 countries and our global network of partners, LTM — a Larsen & Toubro company — owns business outcomes for our clients, helping them not just outperform the market, but to Outcreate it. Please also note that neither LTM nor any of its authorized recruitment agencies/partners charge any candidate registration fee or any other fees from talent (candidates) towards appearing for an interview or securing employment/internship. Candidates shall be solely responsible for verifying the credentials of any agency/consultant that claims to be working with LTM for recruitment. Please note that anyone who relies on the representations made by fraudulent employment agencies does so at their own risk, and LTM disclaims any liability in case of loss or damage suffered as a consequence of the same. Recruitment Fraud Alert - https://www.ltimindtree.com/recruitment-fraud-alert/