OVERVIEW
We are hiring an ML Systems Engineer to design and deliver cutting-edge AI solutions for enterprise
clients at the frontier of agentic AI, inference engineering, and ML systems architecture. You will go
beyond applied ML - dissecting how AI systems are built, optimized, and scaled - designing production-
grade architectures spanning retrieval systems, inference pipelines, and agentic workflows. You will
translate state-of-the-art capabilities into robust, performant solutions, operating at the intersection of ML
research awareness and engineering discipline.
KEY RESPONSIBILITIES
- Design and deliver production-grade AI systems for enterprise clients spanning agentic workflows,
LLM inference pipelines, and retrieval-augmented architectures.
- Lead ML systems architecture decisions - model serving topology, inference backend selection, KV
cache management, batching strategies, and memory optimization - alongside ML performance
engineering to profile bottlenecks, benchmark throughput/latency, and evaluate quantization
strategies (GPTQ, AWQ, GGUF).
- Architect RAG pipelines and agentic AI systems - from chunking, embedding, hybrid retrieval, and re-
ranking through to multi-agent orchestration, tool use, and memory architectures.
- Evaluate frontier model capabilities - reasoning models, multimodal systems, fine-tuned variants - and
make principled architectural trade-off decisions for client contexts.
- Build reusable accelerators, reference implementations, and evaluation/observability frameworks
encoding best practices across engagements.
- Contribute to technical solutioning - architecture designs, proof-of-concepts, and feasibility
assessments - in client-facing contexts.
TECHNICAL QUALIFICATIONS
Core Requirements
- Python & ML ecosystem: Strong programming skills with production AI system experience; hands-
on with the PyTorch ecosystem including Hugging Face Transformers, PEFT, Accelerate, and
Datasets.
- LLM inference & serving: Deep knowledge of KV cache mechanics, quantization, and batching;
hands-on with at least one inference runtime (vLLM, TGI, TensorRT-LLM, SGLang, or similar).
- Hands-on experience supporting AI/ML and LLM inference platforms at scale, including
working with vLLM for high-performance LLM serving, optimization, and large-scale inference.
- RAG & Agentic Systems: Experience designing retrieval architectures and building agentic systems
using LangGraph, LlamaIndex Workflows, AutoGen, or CrewAI - including tool use, memory, and
multi-agent coordination.
- LLM APIs & prompt engineering: Strong grasp of structured output generation, function calling, and
provider SDK usage across OpenAI, Anthropic, Mistral, Hugging Face, and similar.
- Deployment fundamentals: Proficiency with Docker, containerization, and Linux environments for
packaging, deploying, and debugging AI systems.
- Comfortable leveraging AI-assisted tools for collaborative development, code generation, refactoring,
and productivity enhancement.
Preferred
- Fine-tuning: Experience with LoRA/QLoRA, dataset curation, and instruction tuning; understanding
of when fine-tuning is the right lever vs. prompting or RAG.
- Low-level AI systems: Familiarity with CUDA, Triton, or similar GPU programming models; working
knowledge of C++ or Rust.
- Infrastructure & observability: Kubernetes for containerized AI workloads; experience with
LangSmith, Arize, W&B, Phoenix, or Prometheus/Grafana for ML observability.
WAYS TO STAND OUT FROM THE CROWD
- You have built and deployed a production agentic system and can speak to the failure modes and
design decisions that only emerge at runtime.
- You have done inference optimization at a systems level - tuning serving infrastructure, implementing
custom batching logic, or optimizing a quantization pipeline to hit real SLAs.
- You have open-source contributions to prominent ML systems repositories - vLLM, SGLang,
llama.cpp, TGI, LangChain, LlamaIndex, or similar - demonstrating work that holds up to community
scrutiny.
- You have designed custom LLM evaluation frameworks with structured regression harnesses,
domain-specific evals, or human-in-the-loop feedback loops - beyond off-the-shelf metrics.
- You bring a client-facing engineering mindset and can defend opinions on reasoning models, long-
context retrieval, or inference hardware tradeoffs based on hands-on experimentation.
Pay: ₹500,000.00 - ₹1,800,000.00 per year
Benefits:
Work Location: In person