rexbench/rexbench

Coding

RExBench - 2 tasks (cogs, othello) evaluating AI agents’ ability to extend existing AI research through research experiment implementation. Tasks require an A100 40GB GPU and may take multiple hours (cogs: ~2.5h oracle; othello: ~45min oracle). Parity: openhands@1.4.0 / anthropic/claude-sonnet-4-5, 3 runs — cogs: 100% Harbor vs 66.67% original; othello: 100% Harbor vs 100% original. Adapter by Nicholas Edwards (nicholas.edwards@univie.ac.at), one of the original RExBench authors. Acknowledgements: We thank 2077 AI for generously funding API credits to support running parity experiments.

← Back to Registry

Run this task

CLI:

inspect eval inspect_harbor/rexbench --model openai/gpt-5

Python:

from inspect_ai import eval
from inspect_harbor import rexbench

eval(rexbench(), model="openai/gpt-5")

Dataset information

Harbor registry rexbench/rexbench
Inspect task rexbench
Latest digest sha256:bc23e94793e8c74aceb8f6fdb3a268dc834b4699f71b1329db9222e83bb5ac4f
Samples 2
Paper arxiv

See Task Parameters for the parameter set shared across all Harbor tasks.