We are building a large-scale benchmark for evaluating the cybersecurity capabilities of frontier AI Large Language Models (LLMs). To grow this benchmark, we need hands-on security engineers who can craft real-world vulnerability tasks that are genuinely difficult for state-of-the-art LLMs and agentic systems.
Your core output will be carefully designed benchmark instances: real software vulnerabilities paired with well-formed task specifications and validated evaluation oracles that expose the limits of current AI systems and drive progress in AI safety research.
Key Responsibilities
- Create Benchmark Tasks: Design cybersecurity benchmark tasks engineered to challenge and fail frontier LLMs.
- Environment Maintenance: Build and maintain containerized benchmark environments using Docker, libFuzzer, and sanitizers (ASan/MSan).
- Develop Difficulty Tiers: Produce multi-level difficulty variants, ranging from Level 0 (no description provided) through Level 3 (patch diff supplied).
- Collaborate with Research: Partner with researchers to analyze and document the specific failure patterns of AI agents.
- Technical Documentation: Write clear, reproducible vulnerability descriptions ($\le200$ words) to be used directly as task prompts.
- Agent Stress-Testing: Stress-test developed tasks against frontier LLM agents (e.g., OpenHands, Codex CLI) and document their failure modes.
- Quality Assurance: Ensure strict benchmark quality, including zero data duplication, sufficient locating information, and a 96%+ precision target.
- Responsible Disclosure: Follow standard responsible disclosure protocols for any zero-day vulnerabilities discovered during benchmark development.
Required Skills & Qualifications1. Vulnerability Research Expertise
- 3+ years of hands-on experience identifying and analyzing memory safety vulnerabilities in C/C++ codebases (e.g., heap/stack overflows, use-after-free, null dereferences, uninitialized memory).
- Demonstrated ability to reproduce known CVEs and write proof-of-concept (PoC) inputs that reliably trigger sanitizer crashes (ASan, MSan, UBSan).
- Comfort navigating large, unfamiliar codebases (ranging from 100k to 7M+ lines of code) to locate vulnerable code paths.
2. Fuzzing & Toolchain
- Working knowledge of coverage-guided fuzzers such as libFuzzer, AFL++, or OSS-Fuzz workflows.
- Experience compiling projects with sanitizer flags (AddressSanitizer, MemorySanitizer) using GCC or Clang.
- Familiarity with Docker for building and distributing reproducible execution environments.
3. Patch & Exploit Analysis
- Ability to read unified diffs and extract semantic meaning about the specific vulnerability being patched.
- Solid understanding of 1-day / N-day attack workflows, moving successfully from a patch diff to a working PoC.
- Experience with binary search over commit history (e.g., git bisect or equivalent) to pinpoint exact patch commits.
4. Communication & Automation Rigor
- Ability to write concise, technically precise vulnerability descriptions (target: $\le200$ words) containing sufficient localization info for reproduction without leaking the fix.
- Comfortable scripting in Python or Bash to automate build, evaluation, and filtering pipelines.
Nice to Have (Preferred Skills)
- Prior experience contributing to or evaluating AI coding agents (e.g., OpenHands, Codex CLI, SWE-agent).
- Familiarity with LLM APIs and prompt engineering for automated quality-judgment pipelines.
- A research background, including publications or detailed write-ups on vulnerability discovery, fuzzing, or program analysis.
- Direct experience with CVE reporting and coordinated vulnerability disclosure processes.
- Knowledge of broader vulnerability classes beyond memory safety (e.g., logic flaws, cryptographic weaknesses, web/mobile vulnerabilities).
- Hands-on Capture the Flag (CTF) competition experience, particularly in pwn or reverse-engineering categories.
- Familiarity with symbolic execution or static analysis tools (e.g., angr, CodeQL, Infer).
Core Values We Look For
- Strong curiosity and a research-oriented mindset.
- The ability to seamlessly translate theory into practical, functional systems.
- High ownership, a bias toward execution, and comfort with ambiguity in evolving problem spaces.
- Clear, highly structured technical communication.
- Ability to thrive and maintain high autonomy in a fast-paced environment.
Why Join This Project
- Work on cutting-edge problems at the intersection of AI evaluation, safety, and reliability.
- Help bridge the gap between security research and real-world AI systems.
- Enjoy high ownership and autonomy in a fast-moving team environment.
- Opportunity to actively shape how AI agents are evaluated at scale while gaining exposure to both research-driven innovation and production systems.
Pay: ₹552,138.32 - ₹830,854.57 per year
Work Location: In person