Reporting The Support Analyst will typically report to the Head, Agile CoE. Experience – 4-8 years of experience A support analyst should have strong background in site reliability engineering principles, with proven expertise in designing and implementing solutions to ensure the reliability, availability, and performance of microservices-based systems. This role requires hands-on experience with cloud platforms, automation tools, monitoring solutions, and a deep understanding of microservices architecture. The support analyst will lead a team of SREs (Support Reliability Engineer) at partner and collaborate closely with development, operations, and other cross-functional teams to drive improvements in reliability, scalability, and efficiency across our microservices ecosystem.
Key Deliverables: • Reliability Enhancements: Lead efforts to improve the reliability, availability, and performance of microservices-based systems. • Incident Management: Develop and implement incident management processes and procedures to minimize service disruptions and downtime. • Monitoring and Alerting: Design and implement robust monitoring and alerting solutions to proactively detect and mitigate issues impacting system reliability. • Automation: Drive automation initiatives to streamline operations, deployment, and recovery processes for microservices.
- Capacity Planning: Collaborate with teams to perform capacity planning and scaling exercises to ensure optimal performance and resource utilization. • Performance Optimization: Identify performance bottlenecks and optimization opportunities, working with teams to implement solutions for improved performance. - Behavioural Competencies • Leadership Ability • Problem Solving • Empathy for customers Technical Skills • Microservices Architecture • Cloud Platforms (AWS) • Monitoring Tools (App D, ELK, CloudWatch) Indicative Activities- Information Classification: Internal • Lead Reliability Initiatives: Drive efforts to improve reliability through proactive monitoring, automation, and performance optimization. • Develop Incident Management Processes: Establish incident management processes and procedures to ensure timely resolution of incidents and minimize impact on users.
- Implement Automation Solutions: Identify opportunities for automation and develop scripts, tools, and workflows to automate routine tasks and streamline operations. • Perform Capacity Planning: Collaborate with teams to assess resource requirements and plan for scaling needs based on usage patterns and growth projections. • Optimize Performance: Analyze system performance metrics, identify areas for improvement, and implement optimizations to enhance system performance and reliability. • Mentor Partner Team, Infrastructure Teams (DevOps), Development Teams • Provide guidance, support, and mentorship to SRE team members, fostering a culture of continuous learning and development