- Roles and Responsibilities
- Design and implement the lifecycle of services from conception to inception including system design build and deployment
- Develop software solutions to enable operability of large scale distributed systems capable of handling millions of transactions and petabytes of data
- Manage capacity and performance to help scale the infrastructure both on public and private clouds around the world
- Define and implement standards and best practices related to System Architecture Deployment metrics operational tasks
- Support services through activities such as monitoring availability system health and incident response
- Improve system performance application delivery and efficiency through automation process refinement postmortem reviews and in depth configuration analysis
- Engage in Communications across all areas of the organization
- Troubleshooting and monitoring production systems to ensure the highest uptimes are maintained
- Support and improve upon existing high availability architecture solutions as well as manage the operational activity
- Integrate Generative AI GenAI and AIOps tools to automate incident detection root cause analysis and resolution workflows e
- g
- self healing scripts intelligent runbooks reducing manual toil and accelerating response times
- Apply Prompt Engineering techniques to enhance interactions with AI based observability and automation platforms improving accuracy and efficiency of AI responses
- Leverage platform specific AI capabilities e
- g
- AWS Bedrock Azure OpenAI GCP Vertex AI to architect intelligent SRE solutions tailored to cloud environments
- Design implement and maintain AI ML driven monitoring and alerting systems to proactively detect anomalies and predict potential failures enabling preemptive remediation
- Develop and train machine learning models using operational telemetry logs metrics events traces to support predictive analytics and intelligent automation
- Evaluate and deploy AIOps platforms e
- g
- Moogsoft Dynatrace Splunk BigPanda Datadog Elastic to enhance observability reduce noise and accelerate incident resolution
- Experience in one or more high level programming languages like Python or Ruby or GoLang and familiar with Object Oriented Programming
Technology->DevOps->Site Reliability Engineering(SRE),Technology->DevOps->DevOps Architecture Consultancy,Technology->Artificial Intelligence->Artificial Intelligence - ALL