ML Infrastructure / MLOps Engineer

Aivar Innovations
Bengaluru, Karnataka

Quick apply

Job details

Qualifications

CI/CD
Kubernetes
RBAC
AWS
Continuous integration
S3
Python

Full job description

About Aivar Innovations

Aivar is an AI-first technology partner where cutting-edge technology meets industry expertise to supercharge your projects.

Team: Accelerators

Experience: 3–7 years | MLOps/ML platform + Kubernetes

Technical Focus: Own the JARK-Stack integration on EKS: Ray + KubeRay for distributed compute, Kubeflow Pipelines for workflow orchestration, MLflow for experiment tracking, JupyterHub for development, and advanced job schedulers (Kueue, Volcano, Argo) for batch training. A bridge between data scientists and the platform.

Key Responsibilities:

Deploy and optimise Ray + KubeRay for distributed data processing and model training across GPU clusters.
Build Kubeflow Pipelines for reproducible ML workflows — data prep, training, evaluation, deployment with lineage tracking.
Configure MLflow for centralised experiment tracking and model registry across teams.
Implement advanced job scheduling — queue management, priority, preemption, gang scheduling via Kueue/Volcano.
Build model CI/CD — automated training, evaluation, validation, and canary/blue-green deployment to inference endpoints.
Create self-service tooling for data scientists — cluster provisioning, GPU allocation, experiment templates.
Monitor ML workload performance — GPU utilisation, training throughput, data pipeline efficiency.

Must-Have Technical Skills:

ML infrastructure / MLOps / ML platform engineering (3+ years).
Kubernetes (EKS preferred) — deployments, PVs, RBAC, resource management.
At least two of: Ray/KubeRay, Kubeflow, MLflow, Airflow, Argo Workflows.
Distributed training — PyTorch DDP, Horovod, DeepSpeed, or Ray Train.
Model serving — KServe, Seldon, or custom FastAPI serving.
GPU scheduling and resource management on Kubernetes.
Strong Python engineering — tools and automation, not just notebooks.

Core Tech Stack:

Ray/KubeRay, Kubeflow Pipelines, MLflow, JupyterHub, Argo Workflows, Kueue/Volcano, PyTorch/DeepSpeed, KServe, Helm, AWS (EKS, S3, EFS, ECR), Prometheus/Grafana.

Quick apply

About Aivar Innovations

Team: Accelerators

Key Responsibilities:

Must-Have Technical Skills:

Core Tech Stack:

Jobseeker tools

Employer Tools

Browse

Stay Connected