Job Description
At Oracle Health, we’re building the next generation of reliable, AI-powered healthcare infrastructure.
Our Clinical AI Assistant platform supports healthcare professionals operating in real-world clinical environments where reliability, scalability, and operational excellence are critical. We’re looking for experienced Site Reliability Engineers who want significant ownership, difficult technical challenges, and the opportunity to influence how large-scale AI systems are operated.
This is a hands-on engineering role focused on solving hard distributed systems problems, improving platform resilience, and building intelligent operational capabilities at scale.
What You’ll Own
- Contribute to the reliability and operational excellence of large-scale cloud-native healthcare platforms.
- Design, implement, and operate highly available distributed systems supporting AI-driven healthcare services.
- Develop automation, operational tooling, and self-healing capabilities to improve system efficiency and reliability.
- Enhance system scalability, observability, deployment safety, and incident management processes.
- Investigate production issues, perform root cause analysis, and implement effective corrective and preventive actions.
- Build and support AIOps capabilities such as anomaly detection, automated alerting, predictive scaling, and remediation workflows.
- Collaborate with software engineering and platform teams to improve system resiliency, performance, and operational readiness.
- Implement and maintain solutions leveraging Kubernetes, CI/CD pipelines, Infrastructure as Code (IaC), and cloud-native technologies.
- Participate in operational reviews, knowledge sharing, and continuous improvement initiatives to strengthen engineering practices.
Responsibilities
What We’re Looking For
3–5+ years of experience in Site Reliability Engineering, DevOps, Production Engineering, Cloud Infrastructure, or related roles
- Experience operating and supporting production systems with high availability and reliability requirements
- Good understanding of distributed systems, cloud infrastructure, and reliability engineering principles
- Hands-on experience with Kubernetes, container orchestration, and containerized applications
- Strong scripting, automation, and software development skills using modern programming languages
- Experience building operational tooling and improving system reliability through automation
- Strong troubleshooting, debugging, and performance tuning skills in Linux-based environments
- Experience with observability platforms, monitoring, logging, tracing, and alerting technologies
- Ability to collaborate effectively with software engineering and platform teams to improve system reliability and operational efficiency
- Strong problem-solving skills with a focus on operational excellence, continuous improvement, and learning
Helpful Experience