Project Role : Application Support Engineer
Project Role Description : Act as software detectives, provide a dynamic service identifying and solving issues within multiple components of critical business systems.
Must have skills : Site Reliability Engineering
Good to have skills : NA
Minimum
3 year(s) of experience is required
Educational Qualification : 15 years full time education
Summary:
We are seeking a Senior Analyst – Site Reliability Engineering to join our infrastructure and operations team. This role bridges systems engineering, operations, and data-driven decision-making to ensure our platform maintains exceptional uptime, performance, and scalability. You will work on diagnosing complex system issues, optimizing infrastructure efficiency, designing monitoring solutions, and driving continuous improvement initiatives across our production environment.
Roles & Responsibilities:
- Conduct root cause analysis on incidents, system failures, and performance degradation document findings and implement preventive measures
- Design, implement, and maintain comprehensive monitoring, alerting, and observability solutions across cloud and on-premises infrastructure
- Analyze system metrics, logs, and traces to identify bottlenecks, trends, and opportunities for optimization
- Develop and maintain Service Level Objectives (SLOs) and error budgets track and report on SLI/SLO compliance
- Collaborate with engineering teams to review infrastructure designs, identify risks, and recommend reliability improvements
- Participate in on-call rotations, incident response, and post-incident reviews mentor team members on troubleshooting techniques
- Automate operational tasks and routine processes develop scripts and tools to improve team efficiency
- Create runbooks, documentation, and knowledge base articles to support incident response and operational procedures
Professional & Technical Skills:
5+ years of experience in operations, systems administration, SRE, or DevOps roles
Demonstrated expertise in Linux/Unix systems administration and troubleshooting
Strong proficiency in one or more programming/scripting languages (Python, Go, Bash, etc.)
Hands-on experience with containerization technologies (Docker, Kubernetes) and orchestration platforms
Solid understanding of networking concepts, TCP/IP, DNS, and load balancing
Experience with cloud platforms (AWS, GCP, Azure) or equivalent infrastructure management
Proficiency with monitoring and observability tools (Prometheus, Grafana, ELK, DataDog, New Relic, etc.)
Experience with incident management frameworks and post-incident review processes
Strong analytical, troubleshooting, and problem-solving skills
Experience with Infrastructure-as-Code tools (Terraform, Ansible, CloudFormation)
Knowledge of CI/CD pipelines and deployment automation
Experience with databases (SQL and NoSQL) and data warehousing
Familiarity with distributed systems concepts and microservices architecture
SRE certification or formal SRE training
Track record of implementing automation that reduced operational overhead by 20 +
Technical Skills:
- Operating Systems - Linux (RHEL, Ubuntu, CentOS), Windows Server
- Programming Languages - Python, Go, Bash, Ruby, or Java
- Containerization - Docker, Kubernetes, container orchestration
- Monitoring & Observability - Prometheus, Grafana, ELK, CloudWatch, Data Dog
- Cloud Platforms - AWS, GCP, Azure, or equivalent
- IaC & Automation - Terraform, Ansible, CloudFormation, Jenkins
Additional Information:
- The candidate should have minimum 3 years of experience in Site Reliability Engineering.
- This position is based at our Hyderabad office.
- A 15 years full time education is required.
- Detail-oriented with a systematic approach to troubleshooting and problem resolution
- Strong communication skills ability to explain technical concepts to both technical and non-technical stakeholders
- Proactive mindset initiative to identify improvements and propose solutions
- Comfortable working in a fast-paced environment with on-call responsibilities
- Strong collaboration skills and ability to work effectively in cross-functional teams
- Continuous learning mindset eagerness to stay current with emerging technologies