In this role, you will:
- Architect, design, and operate distributed, highly available, and resilient systems for multi-tenant, horizontally scalable, and cost-efficient architectures that deliver consistent latency, throughput, and durability across OCI regions.
- Collaborate cross-functionally with Compute, Storage, Networking, OKE and functions to deliver new platform features focusing on compute and control plane services, enforce secure-by-default designs, and improve overall services reliability.
- Mentor and guide engineers in distributed systems design, high-scale data processing, and operational excellence; set and raise engineering standards across multiple teams.
- Drive operational excellence by owning service-level objectives (availability, latency, durability) and reducing toil through automation, observability, and self-healing mechanisms.
- Own the full service lifecycle from design and implementation to deployment, on-call, and continuous improvement — maintaining high code and reliability standards.
- Define and drive the technical roadmap for compute and control plane services.
- Partner with product management and field teams to translate customer needs into roadmap priorities.
Contribute to the broader Compute vision, influencing how compute Services evolve to support mission-critical workloads globally.
-
Qualifications:
- 12+ years of development experience with large scale, highly available distributed systems
- Proficiency in Java programming patterns, programming experience with Scala, Python is preferred.
- Advanced knowledge of data structures, algorithms, and operating systems.
- Experience with operating distributed services at scale
- Expertise in Linux and operating systems
- Systematic problem-solving approach, strong communication skills, strong ownership and drive
- Deep understanding of service metrics and alarms through the development of dashboards, service KPIs, alarming systems
- Ability to propose, scope, design and direct automation, optimizations, and enhancements
- BS or MS degree in Computer Science/Engineering or a related IT field or equivalent experience relevant to functional area.
Soft Skills & Leadership
- Proven ability to drive technical outcomes, take ownership of deliverables, and work independently in fast-evolving AI solution spaces.
- Strong communication skills, with the ability to articulate technical concepts and document solution approaches, collaborate across multiple geo distributed teams.
- Demonstrated problem-solving ability leveraging AI, distributed systems, and cloud-native application behaviors.
- A proactive, experimentation-oriented mindset with a strong willingness to learn and guide team on emerging AI technologies, frameworks, and engineering patterns.
We are looking for a hands-on senior engineer with technical depth and breadth, proven experience in solving cloud scale problems, distributed systems design & implementation experience to build fault tolerant solutions that will form the foundations of the next generation of Compute offerings. The candidate is expected to have strong written and verbal communications skills, the ability to lead projects across organizational boundaries, and experience representing their work to senior leaders.
As a member of the software engineering division, you will take an active role in the definition and evolution of standard practices and procedures. Define specifications for significant new projects and specify, design and develop software according to those specifications. You will perform professional software development tasks associated with the developing, designing and debugging of software applications or operating systems.