ROLE: AI / ML + Backend Engineer ₹25–50K / month
3-6 months · Remote
WHAT YOU OWN
The intelligence layer: LLM pipelines, RAG, multi-agent systems, vector search, fine-tuning data preparation, and the backend that makes personalised learning work at scale. You build production systems, not research notebooks. Every AI decision you ship gets used to guide real students.
WHAT THE WORK ACTUALLY LOOKS LIKE
On any given week you could be:
▪ Building an agent that analyses quiz patterns and flags at-risk students before the exam happens, not after
▪ Moving a RAG pipeline from fixed-k retrieval to dynamic k with confidence-score gating, then running the benchmarks to prove it works
▪ Writing the context assembly layer that ranks, deduplicates, and compresses retrieved chunks before the LLM sees them
▪ Running a QLoRA fine-tune on a domain-specific dataset and evaluating whether it actually improves pedagogical correctness
▪ Logging every generation run with its prompt, retrieved context, and output so the system is fully auditable
In education, a hallucination is not a minor issue. You build with that in mind.
- Python: primary language. Clean, modular, production-ready.
- LLM APIs: Claude, Gemini. Prompt engineering, structured outputs, cost optimisation.
- RAG foundations: vector embeddings, pgvector, semantic search, chunking strategies, retrieval pipelines from scratch.
- Agent frameworks: multi-agent orchestration, tool use, LangChain or custom.
- GCP: Vertex AI, Cloud Run, BigQuery. Not just local experiments.
- Cost engineering: token tracking, caching, model routing. LLM cost per student matters.
- Backend: FastAPI, async Python, database design for AI workloads.
RAG ENGINEERING & PERSONALISATION
Personalisation is mostly a retrieval problem. You will own the full RAG stack, not just wiring an API but engineering the retrieval layer that makes each student's experience actually adaptive:
- RAG fine-tuning: fine-tune embedding models and re-rankers on domain-specific content (CBSE curriculum, question bank, concept graph) so retrieval understands pedagogy, not just keyword overlap
- Chunk optimisation: figure out the right chunk size and overlap for each content type (lesson text, worked examples, MCQs, formula sheets); back every decision with precision/recall numbers
- k-selection: dynamic k based on query complexity, confidence scores, and context window budget; know the precision-recall-latency tradeoff and make deliberate choices
- Retrieval speed: fast enough for real-time student interactions; HNSW tuning, ANN vs exact search, query batching, embedding caching, async pipelines
- Personalisation via retrieval: enrich queries with each student's learning history (weak concepts, error patterns, last-seen content) so retrieval surfaces what is contextually relevant, not just semantically close
- Context assembly: rank, deduplicate, and compress retrieved chunks before the LLM sees them; handle lost-in-the-middle degradation
- Retrieval evals: hit rate, MRR, NDCG harness; no index change ships without a benchmark
Most teams treat RAG as a solved problem. We do not. The choice between k=3 and k=7, sentence vs paragraph chunks, cosine vs BM25 hybrid directly decides whether a student gets the right explanation or a wrong one.
TRAINING DATA & FINE-TUNING
A core part of this role is building the training pipeline for our own models. You will:
- Audit pipeline: audit educational simulations for pedagogical, UX, and logic failures; convert findings into structured JSON feedback payloads
- SFT data prep: generate (prompt, chosen, rejected) training triplets in JSONL format for supervised fine-tuning
- DPO data prep: write high-quality rejected outputs that represent plausible-but-wrong model behaviour; know the difference between a hallucinated fix and a precise one
- RAG-aware fine-tuning: fine-tune models to consume retrieved context properly: cite, synthesise, flag gaps, not hallucinate over them
- Fine-tuning: LoRA or QLoRA on open-source base models (Qwen, Mistral, LLaMA)
- Eval harness: automated evals scoring model output on pedagogical correctness, code validity, fix precision
- Data governance: log all generation runs with prompts, retrieved context, and outputs; full reproducibility
The training data you produce directly determines how good our autonomous simulation engineer becomes. Annotation quality matters as much as quantity.
NICE TO HAVE
▪ Fine-tuning experience: LoRA, QLoRA, or full fine-tune on open-source models
▪ Embedding model fine-tuning: BGE, E5, GTE for domain-specific retrieval
▪ Re-ranker experience: cross-encoders, Cohere Rerank, or custom
▪ Annotation pipelines or RLHF/DPO dataset experience
▪ Hybrid search: BM25 + dense retrieval, Reciprocal Rank Fusion
▪ Manim or Matplotlib for programmatic diagram generation
▪ GCP Vertex AI hands-on experience
▪ LLM feature shipped to real users in production
▪ HuggingFace Trainer, TRL, or Axolotl
You are Final or pre-final year at IITs, NITs, BITS, or any other good institute. You have built something with LLMs beyond a ChatGPT wrapper: RAG, agents, fine-tuning, or embeddings that ran in production. You know the difference between a demo that works and a system that scales cheaply. You understand that in education, the model being wrong has real consequences. You care about engineering, not just accuracy numbers.
Post 3 months: option to convert to full time is available depending solely on your performance.
Pay: ₹25,000.00 - ₹50,000.00 per month
Benefits:
Work Location: Remote