About the Role
We are looking for an MLOps/DevOps Engineer to build, deploy, and operate infrastructure for LLM and AI workloads in production. You will work closely with ML and backend engineers to create reliable environments for training/fine-tuning, model serving, and GPU-based compute, ensuring performance, scalability, and high availability.
Key Responsibilities
- Design and manage scalable infrastructure for AI/ML workloads (training, fine-tuning, inference).
- Deploy, manage, and optimize GPU-enabled environments (drivers, CUDA runtime readiness, GPU monitoring, scheduling).
- Build and maintain CI/CD pipelines for backend services (APIs, microservices), and
- ML/LLM deployments (model versioning, rollout, rollback).
- Containerize and orchestrate services using Docker and Kubernetes (EKS/GKE/AKS or self-managed).
- Implement best practices for MLOps lifecycle:
- model packaging and artifact management
- reproducible deployments
- environment management across dev/stage/prod
- Set up observability (metrics, logging, alerting, tracing) for infrastructure and model services.
- Improve system reliability via SRE practices: incident response, root-cause analysis, SLAs/SLOs, capacity planning.
- Collaborate with ML engineers to productionize LLM workflows (LoRA adapters, inference endpoints, batch jobs).
- Optimize cost and performance (autoscaling, efficient GPU utilization, job scheduling, caching).
Required Skills & Qualifications (Must Have)
- 3–5 years experience in DevOps / Platform Engineering / MLOps role
- Strong Linux administration and networking fundamentals.
- Hands-on experience with Docker and Kubernetes (deployments, services, ingress, scaling).
- Experience building CI/CD pipelines (GitHub Actions / GitLab CI / Jenkins).
- Proficiency in scripting/automation using Python (or strong bash + ability to work in Python).
- Cloud experience with AWS / GCP / Azure (compute, networking, IAM, storage).
- Familiarity with infrastructure automation and configuration management (Terraform/Ansible is a plus).
Good to Have (Preferred)
- Experience with model serving frameworks: vLLM, Triton Inference Server, TorchServe, Ray Serve.
- Exposure to ML lifecycle tools: MLflow, Weights & Biases, DVC.
- Understanding of LLM fine-tuning concepts (LoRA/QLoRA) and deployment requirements.
- Experience working with distributed systems, job schedulers, or workflow orchestration (Argo, Airflow, Prefect).
- Knowledge of vector databases / RAG pipelines (FAISS, Pinecone, Weaviate, pgvector).
- Familiarity with GPU performance tuning/monitoring (nvidia-smi, DCGM, Prometheus exporters).
Experience:
- LLM: 3 years (Required)
- Ai architecture: 3 years (Required)
- DevOps engineer: 3 years (Required)
Work Location: In person