Position Description:
We are looking for an experienced Availability Manager to join our team. The ideal candidate will be responsible for ensuring that IT services consistently meet agreed availability targets and align with current and future business needs. You will own the end-to-end Availability Management process, working closely with Service Owners, Incident, Problem, Change and Capacity Managers, infrastructure teams, application support, and vendors to proactively monitor, measure, analyse, and improve the availability of critical services. The role requires a strong understanding of ITIL practices, monitoring and observability tools, resilience engineering, and the ability to translate service performance data into actionable improvement plans.
Job Title: Availability Manager
Position: Senior System Engineer/Lead Analyst
Experience: 7- 13 Years
Category: Senior System Engineer/Lead Analyst
Shift: US Shift Time
Main location: India, Karnataka, Bangalore, Electronic City
Key Responsibilities
1. Availability Management Process Ownership
. Own and manage the Availability Management process aligned with ITIL best practices
. Define availability strategies, policies, and standards
. Ensure services meet agreed SLAs, OLAs, and underpinning contracts
2. Service Availability Monitoring & Reporting
. Monitor system and application availability across environments
. Define and track availability KPIs and SLAs
. Produce dashboards and reports using tools like: ServiceNow Monitoring tools (Dynatrace, AppDynamics, SolarWinds)
3. Availability Planning & Design
. Design availability models for new and existing services
. Conduct capacity and resilience planning
. Ensure high availability (HA), redundancy, and failover mechanisms are in place
4. Incident & Problem Management Integration
. Work closely with Incident and Problem Management teams
. Analyze outages and identify root causes impacting availability
. Drive permanent fixes to reduce downtime
5. Risk & Resilience Management
. Identify single points of failure (SPOFs)
. Conduct risk assessments and recommend mitigation strategies
. Ensure disaster recovery (DR) and business continuity plans are aligned
6. Continuous Service Improvement (CSI)
. Identify trends and recurring availability issues
. Recommend improvements to enhance uptime and performance
. Drive automation and predictive monitoring
Key Skills & Competencies
Technical Skills
. Strong knowledge of ITIL (Availability, Incident, Problem, Capacity Management)
. Experience with monitoring tools: Dynatrace, AppDynamics, SolarWinds, Nagios
. Experience with ITSM tools like: ServiceNow
. Understanding of infrastructure (servers, networks, cloud platforms)
Analytical Skills
. Strong data analysis and trend identification
. Ability to interpret availability reports and system metrics
. Root cause analysis and problem-solving mindset
Soft Skills
. Strong stakeholder communication
. Ability to work across cross-functional teams
. Proactive and preventive mindset
KPIs / Success Metrics
. Service availability (%) vs SLA
. Number and duration of outages
. Mean Time Between Failures (MTBF)
. Mean Time to Restore Service (MTRS/MTTR)
. Reduction in recurring availability issues
Qualifications
. Bachelor's degree in IT, Engineering, or related field
. ITIL Certification (v3 / v4) preferred
. 5–10 years of experience in IT Service Management
Preferred Experience
. Experience in enterprise or global IT environments
. Exposure to cloud platforms (Azure, AWS, GCP)
. Experience with high-availability and disaster recovery design
Your future duties and responsibilities
- Define, implement, and continually improve the Availability Management process in line with ITIL best practices and client SLAs/OLAs/UCs.
- Monitor, measure, analyse, and report on the availability, reliability, maintainability, and serviceability of IT services and supporting components.
- Produce and maintain the Availability Plan, reflecting current and future business needs, and ensure it aligns with the Service Level and Capacity Management processes.
- Proactively identify single points of failure, availability risks, and improvement opportunities; drive remediation through risk assessments, CFIA, FTA, and SOA techniques.
- Investigate and lead root cause analysis for major availability-impacting incidents and chronic issues, and ensure preventive actions are tracked to closure.
- Define and validate availability and recovery requirements for new and changed services, and participate in design reviews, Change Advisory Boards (CAB), and major release readiness assessments.
- Establish, track, and report KPIs such as service availability %, MTBF, MTRS, MTBSI, and unplanned downtime against agreed targets.
- Collaborate with Incident, Problem, Change, Capacity, and Continuity Managers to ensure an integrated approach to service quality and resilience.
- Engage with internal stakeholders, customers, and third-party vendors to review service performance, agree on improvement actions, and present availability dashboards in governance forums.
- Drive continual service improvement (CSI) initiatives, contribute to RCA reports, post-incident reviews, and produce executive-level availability reports.
- Ensure availability requirements are documented, maintained, and accessible in the Service Knowledge Management System (SKMS)/CMDB.
Must-Have Skills:
- Strong working knowledge of ITIL v3/v4 Service Management framework, with proven experience in Availability Management (ITIL Foundation mandatory; Intermediate / Specialist certification preferred).
- Hands-on experience designing and operating Availability Management processes in large, complex enterprise environments.
- Experience with monitoring, observability, and APM tools such as Dynatrace, AppDynamics, Splunk, ServiceNow ITOM, SCOM, Nagios, Prometheus, or Grafana.
- Strong experience with ServiceNow (or equivalent ITSM platforms) for incident, problem, change, and availability reporting workflows.
- Solid understanding of infrastructure components (servers, storage, network, databases, middleware, cloud) and how they impact end-to-end service availability.
- Proven ability to perform availability analysis techniques such as Component Failure Impact Analysis (CFIA), Fault Tree Analysis (FTA), Service Outage Analysis (SOA), and risk assessments.
- Strong data analysis and reporting skills, with the ability to build dashboards and present trends, KPIs, and improvement recommendations to senior stakeholders.
- Excellent problem-solving, analytical, and decision-making skills, especially under pressure during major incidents.
- Strong communication, stakeholder management, and collaboration skills, with the ability to engage technical teams and business leadership.
Good-to-Have Skills:
- Experience with cloud platforms such as AWS, Azure, or GCP and understanding of cloud-native availability and resilience patterns (HA, DR, multi-region, auto-scaling).
- Exposure to Site Reliability Engineering (SRE) practices, SLO/SLI/Error Budget concepts, and chaos engineering.
- Experience with IT Service Continuity Management (ITSCM) and Disaster Recovery planning and testing.
- Familiarity with automation and scripting (PowerShell, Python, or Shell) for availability reporting and monitoring integration.
- ITIL 4 Managing Professional, ITIL Specialist – Drive Stakeholder Value, or equivalent advanced certifications.
- Experience working in a 24x7 global delivery model with US-based clients.
CGI is an equal opportunity employer. In addition, CGI is committed to providing accommodation for people with disabilities in accordance with provincial legislation. Please let us know if you require reasonable accommodation due to a disability during any aspect of the recruitment process and we will work with you to address your needs.
#LI-SB2
Required qualifications to be successful in this role
Together, as owners, let’s turn meaningful insights into action.
Life at CGI is rooted in ownership, teamwork, respect and belonging. Here, you’ll reach your full potential because…
You are invited to be an owner from day 1 as we work together to bring our Dream to life. That’s why we call ourselves CGI Partners rather than employees. We benefit from our collective success and actively shape our company’s strategy and direction.
Your work creates value. You’ll develop innovative solutions and build relationships with teammates and clients while accessing global capabilities to scale your ideas, embrace new opportunities, and benefit from expansive industry and technology expertise.
You’ll shape your career by joining a company built to grow and last. You’ll be supported by leaders who care about your health and well-being and provide you with opportunities to deepen your skills and broaden your horizons.
Come join our team—one of the largest IT and business consulting services firms in the world.