Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together.
Primary Responsibilities:
-
Define and own the SRE, AI-enabled operations, and observability strategy for the assigned portfolio, aligned with organizational goals and focused on improving reliability, stability, security, scalability, supportability, resilience, automation, and operational excellence across all digital properties
-
Act as a senior technical leader who bridges Site Reliability Engineering, software engineering, IT operations, AI engineering, observability platform engineering, cloud/platform teams, data engineering, and business technology leadership
-
Provide technical leadership, mentorship, and strategic guidance to senior and mid-level SREs, platform engineers, AI implementation teams, observability engineers, and cross-functional technology teams
-
Foster a culture of engineering excellence, continuous learning, proactive reliability, automation-first operations, production ownership, and operational discipline
-
Collaborate with engineering, security, architecture, product, data platform, cloud, infrastructure, operations, and business leaders to integrate reliability, observability, AI-enabled automation, and operational best practices into products and platforms from design through deployment
-
Report to senior stakeholders and CIO-level leaders on critical paths, operational risks, reliability posture, production readiness, mitigation plans, technology debt, AI adoption opportunities, and strategic SRE initiatives
-
Define, govern, and continuously improve enterprise reliability standards, including SLAs, SLIs, SLOs, error budgets, operational risk scoring, production readiness criteria, resilience scorecards, and service health models
-
Lead the reliability and peak season readiness initiatives by owning the assessment framework, collaborating with application teams, identifying reliability gaps, and driving critical applications toward 99.999% availability from a resiliency, availability, and reliability perspective
-
Architect and govern enterprise-grade monitoring, alerting, and observability standards across lines of business using platforms such as Splunk, Dynatrace, Grafana, DataDog, OpenTelemetry, ServiceNow, cloud-native monitoring tools, and next-generation observability platforms
-
Drive the transition from static dashboards and tool-specific monitoring to unified, intelligent, business-impact-aligned observability that provides visibility into application health, infrastructure health, customer experience, service reliability, operational risk, and business impact
-
Lead the design and implementation of a modern enterprise observability dashboard and intelligence platform using technologies such as React, JavaScript, TypeScript, REST APIs, Snowflake, Kafka Streaming, cloud-native services, and enterprise data platforms
-
Partner with data engineering and platform teams to design scalable data models, telemetry pipelines, event streams, API integrations, and analytical capabilities using Snowflake, relational databases, Kafka, streaming platforms, and observability data sources
-
Integrate real-time and near-real-time telemetry from logs, metrics, traces, events, alerts, incidents, change records, service metadata, cloud platforms, infrastructure platforms, and business systems
-
Ensure the observability dashboard supports service health views, dependency mapping, role-based views, SLO tracking, alert correlation, incident insights, customer impact analysis, capacity trends, executive reporting, and AI-assisted recommendations
-
Lead the strategy, design, and implementation of AI-enabled observability, AIOps, and intelligent automation capabilities to transform incident management from reactive to proactive, predictive, and increasingly autonomous
-
Drive implementation of AI and GenAI capabilities for incident triage, impact assessment, log analysis, anomaly detection, event correlation, root cause analysis, knowledge retrieval, runbook recommendation, production readiness validation, and automated remediation
-
Partner with engineering and platform teams to integrate LLM-based triage, Agentic AI workflows, AI-powered observability, and automated remediation into SRE workflows, on-call processes, incident response, and operational support models
-
Identify practical AI implementation opportunities that reduce alert noise, accelerate root cause analysis, reduce manual toil, improve developer productivity, and deliver measurable improvements in MTTD, MTTA, MTTR, and MTBI
-
Work with security, architecture, data governance, and platform teams to ensure AI-enabled solutions are implemented securely, responsibly, explainably, and in alignment with enterprise standards
-
Analyze and model system dependencies across applications, APIs, infrastructure, databases, cloud services, message streams, third-party integrations, and business-critical workflows
-
Conduct risk and threat modeling for operational scenarios including natural disasters, cloud region failures, cyberattacks, infrastructure failures, software defects, data pipeline failures, dependency failures, and peak-volume business events
-
Design and implement resilience patterns such as automated failover, geo-redundancy, circuit breakers, bulkheads, throttling, graceful degradation, blue-green deployments, canary deployments, automated rollback, and self-healing automation
-
Lead chaos engineering strategy and execution to proactively identify failure modes, validate system resilience, and improve recovery readiness
-
Provide technical leadership across hybrid hosting environments including Unix, Linux, Windows, Azure, AWS, GCP, private cloud, Kubernetes, containers, serverless platforms, and enterprise hosting platforms
-
Partner with infrastructure, cloud, network, security, and application teams to ensure platforms are reliable, scalable, secure, observable, resilient, cost-efficient, and supportable
-
Lead technology transformation efforts including cloud migration strategy, HCP assessment and adoption, platform modernization, containerization, serverless architecture, open source and inner source adoption, and automation-led operations
-
Guide teams on modern technology trends, emerging AI capabilities, evolving observability practices, changing cloud/platform technologies, and new engineering patterns that can improve reliability and operational effectiveness
-
Own and drive automation strategy to eliminate manual toil by designing scalable automation frameworks for runbooks, incident response, change validation, operational support, reporting, remediation, and self-service operations
-
Define and track toil metrics, automation coverage, operational efficiency metrics, incident trends, reliability improvement outcomes, and continuous improvement opportunities
-
Build automation-first operational models using scripting, APIs, workflow automation, CI/CD integration, AI-assisted workflows, and reusable engineering patterns
-
Improve operational tooling and frameworks by evaluating, selecting, standardizing, and governing tools across the SRE, observability, AI operations, and platform engineering portfolio
-
Act as a senior gatekeeper for production changes by establishing change governance processes, operational risk scoring, AI-assisted readiness validation, rollback validation, and release reliability standards
-
Lead incident response for P1 and P2 incidents, including war room facilitation, executive communication, technical triage, impact assessment, recovery coordination, root cause analysis, and post-incident review processes
-
Respond to platform emergencies, alerts, and escalations from customer support, business operations, application teams, and technology partners while ensuring root cause is addressed and corrective actions are implemented
-
Leverage ServiceNow and ITSM processes for incident, problem, change, knowledge, configuration, and service management at enterprise scale
-
Participate in and lead on-call rotation, setting the standard for on-call excellence, operational readiness, knowledge sharing, escalation management, and continuous improvement
Create and maintain architectural diagrams, flow diagrams, runbooks, operational playbooks, executive-level reports, service health documentation, dashboard documentation, and AI-enabled operational process documentation
-
Comply with the terms and conditions of the employment contract, company policies and procedures, and any and all directives (such as, but not limited to, transfer and/or re-assignment to different work locations, change in teams and/or work shifts, policies in regards to flexibility of work benefits and/or work environment, alternative work arrangements, and other decisions that may arise due to the changing business environment). The Company may adopt, vary or rescind these policies and directives in its absolute discretion and without any limitation (implied or otherwise) on its ability to do so
Required Qualifications:
-
Bachelor's degree in Computer Science, Information Technology, Engineering, Data Science, or a related field preferred.
-
15+ years of overall experience in the IT industry across software development, infrastructure, operations, platform engineering, cloud engineering, production support, or enterprise technology delivery.
-
9+ years of hands-on experience in Site Reliability Engineering, Platform Engineering, Production Engineering, DevOps, Cloud Operations, or a similar role with demonstrated leadership in driving reliability at enterprise scale
-
9+ years of experience designing, implementing, and governing monitoring, alerting, and observability architectures for cloud, hybrid, and enterprise software solutions using tools such as Splunk, Dynatrace, DataDog, Grafana, OpenTelemetry, ServiceNow, or similar platforms
-
7+ years of coding or scripting experience with two or more of the following: Java, Python, Go, JavaScript, TypeScript, C#, C/C++, Perl, PowerShell, Shell scripting, Mainframe technologies, or similar languages
-
5+ years of experience building, designing, integrating, and programmatically consuming REST APIs at scale
-
2+ years of experience mentoring and providing technical leadership to SRE engineers, software engineers, platform engineers, observability engineers, AI engineers, or cross-functional technology teams
-
Solid hands-on experience implementing SRE practices across large-scale enterprise applications, including SLAs, SLIs, SLOs, error budgets, monitoring, alerting, incident response, capacity planning, performance engineering, resilience engineering, and production readiness
-
Demonstrated experience defining, managing, and operationalizing SLAs, SLIs, SLOs, error budgets, and reliability metrics as operational standards
-
Proven practical experience with AI implementation, AIOps, AI-enabled observability, intelligent incident detection, event correlation, anomaly detection, automated response, or LLM-based triage
-
Experience identifying AI use cases, designing implementation patterns, integrating AI capabilities into operational workflows, and measuring business or operational outcomes
-
Experience building or supporting observability dashboards, operational intelligence platforms, service health portals, or executive reporting solutions
-
Experience with modern front-end or dashboard development technologies such as React, JavaScript, TypeScript, HTML, CSS, REST APIs, UI components, and data visualization frameworks
-
Experience working with data platforms such as Snowflake, SQL Server, PostgreSQL, MySQL, or similar relational, analytical, or operational data stores
-
Experience with streaming or event-driven platforms such as Kafka, Kafka Streams, event hubs, message queues, or similar technologies
-
Experience integrating observability data from logs, metrics, traces, events, alerts, incidents, changes, service metadata, infrastructure platforms, cloud platforms, and business systems.
-
Experience with automation and deployment tools such as Terraform, Ansible, Jenkins, GitHub Actions, GitLab CI/CD, Azure DevOps, Argo CD, Helm, Kubernetes operators, or similar tools
-
Experience with programmatic interaction with relational databases and data-driven operational decision-making
-
Experience leading incident response for P1/P2 production incidents, including war room facilitation, executive stakeholder communication, root cause analysis, and post-incident review processes
-
Experience leveraging ServiceNow or similar ITSM platforms for incident, problem, change, knowledge, configuration, and service management processes
-
Experience in health care, insurance, financial services, government programs, regulated environments, or large-scale enterprise technology operations
-
Solid understanding of hybrid hosting and infrastructure platforms including Unix, Linux, Windows, Azure, AWS, GCP, private cloud, containers, Kubernetes, serverless platforms, and enterprise hosting platforms
-
Familiarity with GenAI, Agentic AI, LLM-based assistants, AI copilots, prompt engineering, semantic search, RAG patterns, vector databases, model integration, AI governance, and responsible AI practices
-
Proven track record of planning, supporting, or improving 99.999% availability for critical applications in production environments
-
Proven solid architectural understanding of engineering fundamentals including unit testing, performance testing, chaos engineering, code reviews, telemetry, Agile, DevOps, CI/CD, security, API design, and production readiness
-
Proven deep expertise in CI/CD pipelines, containerization, serverless architecture, public cloud, private cloud, application observability, messaging, streaming architecture, and platform automation
-
Demonstrated ability to guide technical priorities, conduct design reviews, influence architecture decisions, define engineering standards, and drive adoption of modern technology practices
-
Proven ability to evaluate emerging technologies, understand changing technology dynamics, guide teams on adoption strategy, and translate new technology capabilities into practical enterprise implementation plans
-
Proven ability to communicate effectively with technical and non-technical, globally distributed audiences, including presenting to senior leadership and CIO-level stakeholders on reliability posture, AI initiatives, operational risk, and strategic technology direction
-
Proven solid technical writing skills, including creating architectural diagrams, flow diagrams, runbooks, end-user documentation, operational playbooks, executive-level reports, and technology strategy documents
-
Flexibility to support 24x7 operations through shift-based, on-call, and rotational support models
Preferred Qualification:
-
AI Dojo certification Level 1, Level 2, and Level 3
At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission.