Job Description
ABOUT THE ROLE
We are looking for a Senior Engineer to join our AI team at the intersection of evaluation science, post-training, and foundation model development. You will own our end-to-end eval and benchmarking infrastructure — the critical feedback loop that drives every major model improvement — while contributing hands-on to post-training pipelines for industry-specific vertical foundation models.
This role is ideal for someone who has worked directly inside an LLM lab and understands what rigorous evaluation looks like at scale: designing the taxonomy of skills being measured, identifying failure modes, engineering synthetic data to close capability gaps, and translating eval signals into actionable training decisions.
WHAT YOU'LL DO
Evaluation & Benchmarking
-
Design and own task-level evaluation frameworks for LLM agents and base models, covering multi-step reasoning, tool/API use, instruction following, and domain knowledge — grounded in real user failure modes rather than off-the-shelf benchmark suites.
-
Build comparative benchmarking pipelines to assess leading frontier models (GPT-4o, Gemini, Claude, Llama, Mistral, etc.) against each other and against internal models, with structured analysis of where each model family fails, regresses, or excels across subjects, topics, and task types.
-
Produce capability gap reports that quantify performance deltas across dimensions such as subject-matter accuracy, reasoning depth, factual consistency, and refusal behaviour.
-
Track model version regressions across provider releases to maintain a living competitive intelligence benchmark.
-
Develop domain-specific benchmarks tailored to vertical use-cases (e.g., STEM tutoring, legal, finance, healthcare) — including problem taxonomy design, rubric definition, and inter-annotator agreement pipelines.
-
Define and drive synthetic data generation strategies to systematically address model shortcomings in specific subjects, topics, and skill areas:
-
Identify low-performance clusters from eval results and translate them into targeted data generation prompts and pipelines.
-
Design LLM-assisted pipelines for generating high-quality, diverse, and verifiable synthetic training and evaluation data at scale.
-
Validate synthetic data quality through auto-eval, human review, and downstream model performance lift experiments.
-
Build automated regression suites integrated into CI/CD workflows to detect capability degradation across fine-tuning runs and model updates.
-
Partner with product, curriculum, and research teams to translate eval insights into prioritized post-training and data flywheel decisions.
Post-Training & Fine-Tuning
-
Lead or directly contribute to SFT, RLHF, RLAIF, and DPO training runs on industry-specific vertical foundation models — from dataset design through training execution and eval-gated release.
-
Curate and engineer high-quality instruction-tuning and preference datasets for domain adaptation, with hands-on experience distinguishing signal from noise in annotation pipelines.
-
Define data quality criteria, rejection sampling strategies, and deduplication pipelines for SFT corpora.
-
Design preference pair construction methodologies and reward model training setups grounded in domain-specific quality rubrics.
-
Implement and experiment with alignment techniques including reward modelling, process reward models (PRMs), and constitutional/RLAIF approaches.
-
Run ablation studies and controlled experiments to attribute model behaviour changes to specific data or training interventions — not just report final numbers.
-
Contribute to continual pre-training and domain-adaptive fine-tuning pipelines for vertical models, including domain data sourcing, mixing strategies, and curriculum design.
Infrastructure & Tooling
-
Build scalable eval pipelines that run automatically on every training checkpoint and integrate into CI/CD for continuous model quality tracking.
-
Maintain model cards, eval leaderboards, and internal dashboards providing visibility across experiments for both technical and non-technical stakeholders.
-
Ensure reproducibility through rigorous experiment tracking (W&B, MLflow, or equivalent), versioned datasets, and documented training configs.
WHO YOU ARE
Required
-
5+ years of ML/AI engineering experience, with at least 2–3 years focused on large language models.
-
Lab pedigree: Direct, hands-on experience at an LLM lab, AI research organization, or equivalent frontier AI team — you have shipped models, not just called APIs.
-
Familiarity with the full model lifecycle: pre-training data, post-training alignment, eval, and production deployment.
-
Deep practical expertise in post-training methods:
-
SFT, RLHF, RLAIF, DPO, PPO — from dataset construction through training and eval-gated release.
-
Experience with reward modeling, preference data curation, and quality control for alignment pipelines.
-
Demonstrated experience designing LLM evaluation frameworks beyond standard benchmarks — including task-level evals for agentic or multi-step workflows.
-
Hands-on experience building synthetic data generation pipelines to address specific model capability gaps:
-
Designing targeted generation prompts based on eval failure analysis.
-
Validating synthetic data quality through downstream model performance experiments.
-
Proven track record of comparative benchmarking across leading foundation models, with structured analysis of capability shortcomings by subject, skill, or task type.
-
Experience training or fine-tuning vertical/industry-specific foundation models — domain data curation, continual pre-training, or domain-adaptive SFT.
-
Strong software engineering fundamentals: Python, PyTorch or JAX, distributed training
Preferred
-
Publications or applied research contributions in LLM evaluation, alignment, or post-training.
-
Experience with multi-modal models or agents with external tool/API use.
-
Exposure to red-teaming, adversarial evaluation, or safety benchmarking.
-
Model distillation, speculative decoding, or inference optimization experience.
-
Prior experience in an education, STEM, legal, biomedical, or enterprise software vertical.
WHAT SUCCESS LOOKS LIKE
30 Days
Fully onboarded into training infra and eval repos. Running existing benchmarks end-to-end and producing a written gap analysis identifying missing coverage.
60 Days
Shipped at least one new domain-specific benchmark and one synthetic data generation pipeline addressing a known model gap. CI-integrated eval running on every checkpoint.
3 Months
Standardize model evaluation framework for foundation models. Own golden dataset strategies for fine-tuning with measurable subject-accuracy gains
6 Months
Recognized internally as the authority on model quality and competitive benchmarking. Eval insights are directly driving roadmap prioritization.
Why do we exist?
Students are working harder than ever before to stabilize their future. Our recent research study called State of the Student shows that nearly 3 out of 4 students are working to support themselves through college and 1 in 3 students feel pressure to spend more than they can afford. We founded our business on provided affordable textbook rental options to address these issues. Since then, we’ve expanded our offerings to supplement many facets of higher educational learning through Chegg Study, Chegg Math, Chegg Writing, Chegg Internships, Chegg Skills, and more to support students beyond their college experience. These offerings lower financial concerns for students by modernizing their learning experience. We exist so students everywhere have a smarter, faster, more affordable way to student.