Aivar is an AI-first technology partner where cutting-edge technology meets industry expertise to supercharge your projects.
Experience: 3–7 years | MLOps/ML platform + Kubernetes
Technical Focus: Own the JARK-Stack integration on EKS: Ray + KubeRay for distributed compute, Kubeflow Pipelines for workflow orchestration, MLflow for experiment tracking, JupyterHub for development, and advanced job schedulers (Kueue, Volcano, Argo) for batch training. A bridge between data scientists and the platform.
-
Deploy and optimise Ray + KubeRay for distributed data processing and model training across GPU clusters.
-
Build Kubeflow Pipelines for reproducible ML workflows — data prep, training, evaluation, deployment with lineage tracking.
-
Configure MLflow for centralised experiment tracking and model registry across teams.
-
Implement advanced job scheduling — queue management, priority, preemption, gang scheduling via Kueue/Volcano.
-
Build model CI/CD — automated training, evaluation, validation, and canary/blue-green deployment to inference endpoints.
-
Create self-service tooling for data scientists — cluster provisioning, GPU allocation, experiment templates.
-
Monitor ML workload performance — GPU utilisation, training throughput, data pipeline efficiency.
-
ML infrastructure / MLOps / ML platform engineering (3+ years).
-
Kubernetes (EKS preferred) — deployments, PVs, RBAC, resource management.
-
At least two of: Ray/KubeRay, Kubeflow, MLflow, Airflow, Argo Workflows.
-
Distributed training — PyTorch DDP, Horovod, DeepSpeed, or Ray Train.
-
Model serving — KServe, Seldon, or custom FastAPI serving.
-
GPU scheduling and resource management on Kubernetes.
-
Strong Python engineering — tools and automation, not just notebooks.
Ray/KubeRay, Kubeflow Pipelines, MLflow, JupyterHub, Argo Workflows, Kueue/Volcano, PyTorch/DeepSpeed, KServe, Helm, AWS (EKS, S3, EFS, ECR), Prometheus/Grafana.