Overview: Design and build challenging, real-world terminal-based tasks for evaluating
frontier AI agents. Tasks must be genuinely difficult, clearly specified, and programmatically
verifiable.
Responsibilities:
● Design high-quality task ideas rooted in real-world workflows (debugging, infra setup,
data pipelines, security, ML training, etc.)
● Write clear, unambiguous task instructions with defined end states
● Build Docker environments and write oracle solutions that pass all tests
● Write deterministic pytest-based verification scripts
● Identify edge cases and ensure tasks can't be shortcut or gamed by AI agents
● Iterate with reviewers based on QC and platform gate feedback
Must-Haves:
● 3–5+ years of hands-on engineering experience in at least one domain (SWE,
DevOps, ML, security, data engineering, scientific computing)
● Proficiency in Python and shell scripting (bash)
● Comfortable writing Dockerfiles, building images, and debugging containers
● Experience writing automated tests (pytest, unittest)
● Familiarity with Git workflows (PRs, diffs, branching)
● Strong technical writing - ability to produce precise, unambiguous specifications
Nice-to-Haves:
● Experience with AI coding benchmarks (SWE-bench, Terminal-Bench, GPQA)
● Open-source contributions or GitHub PR history in relevant repos
● Experience with the Harbor evaluation framework
● Background in competitive programming or Kaggle
● Domain depth in niche areas (kernel dev, cryptography, HPC, media processing)
● Masters or PhD in CS is preferred
Engagement:
● Fully remote
● Fixed rate per accepted task: $40 - $60 + performance based bonus
Pay: ₹3,800.00 - ₹5,500.00 per hour
Benefits:
Work Location: Remote