- Manage capacity and performance to help scale the infrastructure both on public and private clouds around the world
- Define and implement standards and best practices related to System Architecture Deployment metrics operational tasks
- Support services through activities such as monitoring availability system health and incident response
- Improve system performance application delivery and efficiency through automation process refinement postmortem reviews and in depth configuration analysis
- Engage in Communications across all areas of the organization
- Troubleshooting and monitoring production systems to ensure the highest uptimes are maintained
- Support and improve upon existing high availability architecture solutions as well as manage the operational activity
- Integrate Generative AI GenAI and AIOps tools to automate incident detection root cause analysis and resolution workflows e
- g
- self healing scripts intelligent runbooks reducing manual toil and accelerating response times
- Apply Prompt Engineering techniques to enhance interactions with AI based observability and automation platforms improving accuracy and efficiency of AI responses
- Leverage platform specific AI capabilities e
- g
- AWS Bedrock Azure OpenAI GCP Vertex AI to architect intelligent SRE solutions tailored to cloud environments
- Experience in one or more high level programming languages like Python or Ruby or GoLang and familiar with Object Oriented Programming
- Design and implement the CI CD CT pipeline on one or more tool stack like Jenkins Bamboo azure DevOps and AWS Code pipeline with
- Proficiency in one or more Infrastructure as code tools e
- g
- Terraform Cloud Formation Azure ARM etc
- Developing managing monitoring tools and log analysis tools to manage operations with exposure to tools such as App Dynamics Data Dog Splunk Kibana Prometheus Grafana Elasticsearch and not limited to
- Hands On experience with AIOps Process and platforms e
- g
- Dynatrace Splunk ServiceNow for incident management observability and noise reduction
- Familiarity with Prompt Engineering GenAI Applications AI ML frameworks e
- g
- TensorFlow PyTorch or platforms e
- g
- OpenAI Vertex AI
- Having awareness of Agentic AI solutions applicable in Operations and Support
Technology->DevOps->Site Reliability Engineering(SRE)