Terminal-Bench Task Contributor ( Freelancer )

Job details

Overview: Design and build challenging, real-world terminal-based tasks for evaluating

frontier AI agents. Tasks must be genuinely difficult, clearly specified, and programmatically

verifiable.

Responsibilities:

● Design high-quality task ideas rooted in real-world workflows (debugging, infra setup,

data pipelines, security, ML training, etc.)

● Write clear, unambiguous task instructions with defined end states

● Build Docker environments and write oracle solutions that pass all tests

● Write deterministic pytest-based verification scripts

● Identify edge cases and ensure tasks can't be shortcut or gamed by AI agents

● Iterate with reviewers based on QC and platform gate feedback

Must-Haves:

● 3–5+ years of hands-on engineering experience in at least one domain (SWE,

DevOps, ML, security, data engineering, scientific computing)

● Proficiency in Python and shell scripting (bash)

● Comfortable writing Dockerfiles, building images, and debugging containers

● Experience writing automated tests (pytest, unittest)

● Familiarity with Git workflows (PRs, diffs, branching)

● Strong technical writing - ability to produce precise, unambiguous specifications

Nice-to-Haves:

● Experience with AI coding benchmarks (SWE-bench, Terminal-Bench, GPQA)

● Open-source contributions or GitHub PR history in relevant repos

● Experience with the Harbor evaluation framework

● Background in competitive programming or Kaggle

● Domain depth in niche areas (kernel dev, cryptography, HPC, media processing)

● Masters or PhD in CS is preferred

Engagement:

● Fully remote

● Fixed rate per accepted task: $40 - $60 + performance based bonus

Pay: ₹3,800.00 - ₹5,500.00 per hour

Benefits:

Work Location: Remote