Senior AI Site Reliability Engineer
Meet the Team
We are the Data Center Network Services team within Cisco IT that supports network services for Cisco Engineering and business functions worldwide. Our mission is simple – build the network of the future that is adaptable and agile on Cisco’s networking solutions. Cisco IT networks are deployed, monitored, and managed with a DevOps approach to support rapid application changes. We invest in transformative technologies that enable us to deliver services in a fast and reliable manner.
The team culture is collaborative and fun, where thinking creatively and tinkering on new ideas are encouraged.
Your Impact
You will be responsible for designing, developing, testing, and deploying advanced AI-driven software features for data center networks. You have strong interpersonal skills and are comfortable collaborating with fellow engineers, cross-functional engineering teams, and internal clients. You will create and implement innovative, high-quality capabilities to provide our clients with the best possible experience.
Minimum Requirements
- Bachelor of Engineering or Technology with a minimum of 10 years of experience and demonstrated ability in designing and building scalable and reliable networking solutions specifically for AI/ML infrastructure and high-performance computing environments.
- Proven leadership skills to successfully drive strategic automation initiatives, influence and guide the team’s direction in automation, and foster a culture of continuous improvement and innovation by proactively identifying and implementing solutions designed to enhance service reliability and operational efficiency for long-term success.
- Strong work experience with Cisco Data Center Networking technologies
- Strong programming skills and concepts to deliver networking technologies
- Expertise with Continuous Integration and Development (CI/CD), and setting up CI/CD pipelines
- Proficiency in Terraform and Ansible for Infrastructure as Code (IaC)
- Experience in tools including JIRA, GIT, and Jenkins
- Solid grasp of software engineering concepts including common data structures/standard algorithms, object-oriented design, distributed computing and cloud computing paradigms.
- Expertise in AI Fabric with a deep understanding of high-performance networking for AI/ML workloads.
- Managing networking for GPU Experience clusters environments.
- Ability to implementand utilize AI-based observability tools.
- Ability to forecast infrastructure needs for scaling AI workloads and managing the lifecycle of hardware/software releases.
- Experience in technologies like Routing, Switching, Nexus, VPC, VDC, VLAN, VXLAN, BGP
- Experience with ACI networks.
- Experience in creating documentation and training materials
- Ability to work closely with Business Units to resolve hardware/software interoperability issues.
Preferred Qualifications
- Good understanding of the Build& Release Operations
- Good understanding on DevOps principles
- Comfortable with Agile practices and beliefs in “quality driven” development
- Understanding of Unix/Linux
- Domain knowledge about contemporary network technologies, network management and protocols
- Experience on application/platform instrumentation, measurement, log data processing, and monitoring
- CCNA or CCNP
- Experience in managing Cisco Nexus Dashboard and APIC for centralized policy, monitoring, and fabric orchestration.
- Experience with Nexus Dashboard Fabric Controller
- Experience with VXLan based networks and troubleshooting
Why Cisco?
At Cisco, we’re revolutionizing how data and infrastructure connect and protect organizations in the AI era – and beyond. We’ve been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds. These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint.
Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you’ll see that the opportunities to grow and build are limitless. We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere.
We are Cisco, and our power starts with you.