ROLE OVERVIEW
The AI/ML Engineer is the intelligence core of the AI Ops platform. You will design, build, train, and deploy the predictive models that drive autonomous infrastructure management - including failure forecasting, anomaly detection, capacity modelling, and root cause analysis.
Beyond model development, you will own the MLOps pipeline, the agentic decision engine, the continuous learning loop, and the model registry - ensuring the platform's AI capabilities improve with every resolved incident and learned pattern.
KEY RESPONSIBILITIES
Design and train the Failure Forecast model targeting 94%+ accuracy at a 6-hour prediction horizon
Build the Anomaly Detection model targeting 97%+ precision for real-time infrastructure signal classification
Develop the Capacity Model for 24-hour resource demand forecasting at 91%+ accuracy
Build and tune the Root Cause AI model targeting 89%+ average confidence on incident attribution
Design and implement the agentic decision engine - the Detect Predict Decide Heal loop that drives auto-remediation
Build and maintain the MLOps pipeline: data ingestion, feature engineering, model training, validation, and deployment
Implement the Continuous Learning module: automated retraining triggers, knowledge base indexing, and runbook auto-generation
Develop the model registry API layer exposing real-time accuracy scores and model metadata to the platform
Integrate AI outputs with the Incident Control backend for root cause enrichment and resolution recommendation
Design the Digital Twin data model for real-time infrastructure state representation
Monitor model performance in production and implement drift detection and alerting
Collaborate with Backend Engineers on inference API design for sub-100ms prediction latency
EXPECTATIONS
Achieve and maintain model accuracy targets: 94% (Failure Forecast), 97% (Anomaly), 91% (Capacity), 89% (Root Cause)
Deliver a fully operational agentic loop processing 47,600+ AI decisions per day from go-live
Ensure model inference endpoints respond within 100ms on p95 for all production predictions
Keep human escalation rate below 2.5% through high-confidence autonomous decision-making
Implement a continuous learning pipeline that indexes and learns from every resolved incident
Produce explainable AI outputs with confidence scores on all predictions surfaced in the UI
Document all model architectures, training datasets, evaluation metrics, and retraining schedules
SKILLS & COMPETENCIES
Technical Skills
Deep expertise in supervised and unsupervised machine learning: anomaly detection, classification, regression, time-series forecasting
Proficiency in Python ML stack: scikit-learn, XGBoost, LightGBM, PyTorch or TensorFlow
Time-series modelling: LSTM, Transformer-based models, Prophet, or ARIMA for infrastructure metric forecasting
MLOps platform experience: MLflow, Kubeflow, Weights & Biases, or equivalent for experiment tracking and model management
Feature engineering from infrastructure telemetry data: metrics, logs, traces, and events
Experience building and deploying inference APIs at low-latency production scale
Knowledge of LLM integration for runbook generation and knowledge base indexing
GPU compute management for model training on cloud platforms (AWS SageMaker, Azure ML, or GCP Vertex AI)
Familiarity with reinforcement learning or agentic AI patterns for autonomous remediation design
Statistical proficiency: hypothesis testing, confidence intervals, distribution analysis
Functional & Soft Skills
Ability to translate complex model behaviour into clear, confidence-scored outputs for operations teams
Strong experimental mindset - rigorous about evaluation metrics and avoiding overfitting
Collaborative engineering partner to Backend and DevOps engineers on pipeline integration
Clear documentation practices for model cards, training pipelines.