ML Systems Engineer (Agentic AI & Inference Engineering)

CRUTZ LEELA ENTERPRISES
Pune, Maharashtra

Quick apply

Job details

Permanent
₹5,00,000 - ₹18,00,000 a year
1 day ago

Benefits

Provident Fund

Qualifications

Rust (programming language)
Law
Kubernetes
PyTorch
Software deployment
Master's degree
C++
Docker
C
SDKs
APIs
Linux
AI
Python

Full job description

OVERVIEW

We are hiring an ML Systems Engineer to design and deliver cutting-edge AI solutions for enterprise

clients at the frontier of agentic AI, inference engineering, and ML systems architecture. You will go

beyond applied ML - dissecting how AI systems are built, optimized, and scaled - designing production-

grade architectures spanning retrieval systems, inference pipelines, and agentic workflows. You will

translate state-of-the-art capabilities into robust, performant solutions, operating at the intersection of ML

research awareness and engineering discipline.

KEY RESPONSIBILITIES

Design and deliver production-grade AI systems for enterprise clients spanning agentic workflows,

LLM inference pipelines, and retrieval-augmented architectures.

Lead ML systems architecture decisions - model serving topology, inference backend selection, KV

cache management, batching strategies, and memory optimization - alongside ML performance

engineering to profile bottlenecks, benchmark throughput/latency, and evaluate quantization

strategies (GPTQ, AWQ, GGUF).

Architect RAG pipelines and agentic AI systems - from chunking, embedding, hybrid retrieval, and re-

ranking through to multi-agent orchestration, tool use, and memory architectures.

Evaluate frontier model capabilities - reasoning models, multimodal systems, fine-tuned variants - and

make principled architectural trade-off decisions for client contexts.

Build reusable accelerators, reference implementations, and evaluation/observability frameworks

encoding best practices across engagements.

Contribute to technical solutioning - architecture designs, proof-of-concepts, and feasibility

assessments - in client-facing contexts.

TECHNICAL QUALIFICATIONS

Core Requirements

Python & ML ecosystem: Strong programming skills with production AI system experience; hands-

on with the PyTorch ecosystem including Hugging Face Transformers, PEFT, Accelerate, and

Datasets.

LLM inference & serving: Deep knowledge of KV cache mechanics, quantization, and batching;

hands-on with at least one inference runtime (vLLM, TGI, TensorRT-LLM, SGLang, or similar).

Hands-on experience supporting AI/ML and LLM inference platforms at scale, including

working with vLLM for high-performance LLM serving, optimization, and large-scale inference.

RAG & Agentic Systems: Experience designing retrieval architectures and building agentic systems

using LangGraph, LlamaIndex Workflows, AutoGen, or CrewAI - including tool use, memory, and

multi-agent coordination.

LLM APIs & prompt engineering: Strong grasp of structured output generation, function calling, and

provider SDK usage across OpenAI, Anthropic, Mistral, Hugging Face, and similar.

Deployment fundamentals: Proficiency with Docker, containerization, and Linux environments for

packaging, deploying, and debugging AI systems.

Comfortable leveraging AI-assisted tools for collaborative development, code generation, refactoring,

and productivity enhancement.

Preferred

Fine-tuning: Experience with LoRA/QLoRA, dataset curation, and instruction tuning; understanding

of when fine-tuning is the right lever vs. prompting or RAG.

Low-level AI systems: Familiarity with CUDA, Triton, or similar GPU programming models; working

knowledge of C++ or Rust.

Infrastructure & observability: Kubernetes for containerized AI workloads; experience with

LangSmith, Arize, W&B, Phoenix, or Prometheus/Grafana for ML observability.

WAYS TO STAND OUT FROM THE CROWD

You have built and deployed a production agentic system and can speak to the failure modes and

design decisions that only emerge at runtime.

You have done inference optimization at a systems level - tuning serving infrastructure, implementing

custom batching logic, or optimizing a quantization pipeline to hit real SLAs.

You have open-source contributions to prominent ML systems repositories - vLLM, SGLang,

llama.cpp, TGI, LangChain, LlamaIndex, or similar - demonstrating work that holds up to community

scrutiny.

You have designed custom LLM evaluation frameworks with structured regression harnesses,

domain-specific evals, or human-in-the-loop feedback loops - beyond off-the-shelf metrics.

You bring a client-facing engineering mindset and can defend opinions on reasoning models, long-

context retrieval, or inference hardware tradeoffs based on hands-on experimentation.

Pay: ₹500,000.00 - ₹1,800,000.00 per year

Benefits:

Provident Fund

Work Location: In person

Quick apply

Jobseeker tools

Employer Tools

Browse

Stay Connected