MLOps / DevOps Engineer

iApp Technologies -
Mohali, Punjab

Quick apply

Job details

Permanent
7 days ago

Qualifications

CI/CD
Performance tuning
Azure
Law
Kubernetes
Ansible
Software deployment
DevOps
Configuration management
Git
Master's degree
Bash (Unix shell)
Databases
AWS
Docker
Distributed systems
Terraform
Continuous integration
Scripting
GitHub
APIs
Linux
AI
Jenkins
GitLab
Python
Identity & access management

Full job description

About the Role

We are looking for an MLOps/DevOps Engineer to build, deploy, and operate infrastructure for LLM and AI workloads in production. You will work closely with ML and backend engineers to create reliable environments for training/fine-tuning, model serving, and GPU-based compute, ensuring performance, scalability, and high availability.

Key Responsibilities

Design and manage scalable infrastructure for AI/ML workloads (training, fine-tuning, inference).
Deploy, manage, and optimize GPU-enabled environments (drivers, CUDA runtime readiness, GPU monitoring, scheduling).
Build and maintain CI/CD pipelines for backend services (APIs, microservices), and
ML/LLM deployments (model versioning, rollout, rollback).
Containerize and orchestrate services using Docker and Kubernetes (EKS/GKE/AKS or self-managed).
Implement best practices for MLOps lifecycle:
model packaging and artifact management
reproducible deployments
environment management across dev/stage/prod
Set up observability (metrics, logging, alerting, tracing) for infrastructure and model services.
Improve system reliability via SRE practices: incident response, root-cause analysis, SLAs/SLOs, capacity planning.
Collaborate with ML engineers to productionize LLM workflows (LoRA adapters, inference endpoints, batch jobs).
Optimize cost and performance (autoscaling, efficient GPU utilization, job scheduling, caching).

Required Skills & Qualifications (Must Have)

3–5 years experience in DevOps / Platform Engineering / MLOps role
Strong Linux administration and networking fundamentals.
Hands-on experience with Docker and Kubernetes (deployments, services, ingress, scaling).
Experience building CI/CD pipelines (GitHub Actions / GitLab CI / Jenkins).
Proficiency in scripting/automation using Python (or strong bash + ability to work in Python).
Cloud experience with AWS / GCP / Azure (compute, networking, IAM, storage).
Familiarity with infrastructure automation and configuration management (Terraform/Ansible is a plus).

Good to Have (Preferred)

Experience with model serving frameworks: vLLM, Triton Inference Server, TorchServe, Ray Serve.
Exposure to ML lifecycle tools: MLflow, Weights & Biases, DVC.
Understanding of LLM fine-tuning concepts (LoRA/QLoRA) and deployment requirements.
Experience working with distributed systems, job schedulers, or workflow orchestration (Argo, Airflow, Prefect).
Knowledge of vector databases / RAG pipelines (FAISS, Pinecone, Weaviate, pgvector).
Familiarity with GPU performance tuning/monitoring (nvidia-smi, DCGM, Prometheus exporters).

Experience:

LLM: 3 years (Required)
Ai architecture: 3 years (Required)
DevOps engineer: 3 years (Required)

Work Location: In person

Quick apply

Jobseeker tools

Employer Tools

Browse

Stay Connected