Service stability and incident management
- Ensure maximum service quality and stability through prompt and effective response to technical incidents.
- Act as a catalyst for change by performing incident and problem analysis, identifying root causes, and driving continual service improvement (CSI) initiatives.
- Where relevant, perform a control function to ensure that new technology changes do not introduce instability into the production environment.
Monitoring and observability
- Drive and achieve “north star” monitoring and observability goals.
- Build comprehensive monitoring, alerting, and logging are in place for critical services, enabling proactive detection and rapid remediation of issues.
Automation and operational excellence
- Automation of operational tasks such as deployments, monitoring, scaling, and infrastructure management to reduce manual effort and operational risk.
Site Reliability Engineering (SRE) practices
- Troubleshoot issues and participate in incident response, troubleshooting, and post-incident reviews (post-mortems) to minimise downtime and institutionalise learning from failures.
- Optimise infrastructure, systems, and processes for performance, efficiency, and reliability.
- Contribute to the design and implementation of robust deployment pipelines and release strategies that enable smooth, frequent, and reliable releases (e.g. blue/green, canary).
Change, release, and rollout management
- Review and implement production-related changes, releases, and rollouts with zero or minimal impact to application stability and client experience.
- Review and coordinate dependent changes across surrounding systems, infrastructure, networks, and shared services.
- Ensure thorough technical plans are in place for all production changes, including implementation steps, fallback/rollback strategies, data conversion or migration plans, and validation checks.
Reporting and continuous improvement
- Drive closure of remediation actions to prevent recurrence of incidents.
Collaboration, coaching, and knowledge sharing
- Participate in and support cross-training and structured knowledge transfer activities within and across support and engineering teams.
Leverage AI and automation for production engineering
- Use AI-driven tools (e.g. for log analysis, anomaly detection, alert correlation, and capacity forecasting) to proactively identify, diagnose, and resolve production issues.
- Collaborate with engineering and platform teams to integrate AI/ML capabilities into monitoring, incident management, and self-healing workflows (e.g. automated remediation, intelligent runbooks).
- Continuously review and refine AI-enabled alerts, models, and automations based on production behaviour, incident learnings, and feedback from support teams.
- Promote the safe and compliant adoption of AI solutions within production engineering, ensuring adherence to the bank’s risk, security, and data governance standards.