rexbench/rexbench
Coding
RExBench - 2 tasks (cogs, othello) evaluating AI agents’ ability to extend existing AI research through research experiment implementation. Tasks require an A100 40GB GPU and may take multiple hours (cogs: ~2.5h oracle; othello: ~45min oracle). Parity: openhands@1.4.0 / anthropic/claude-sonnet-4-5, 3 runs — cogs: 100% Harbor vs 66.67% original; othello: 100% Harbor vs 100% original. Adapter by Nicholas Edwards (nicholas.edwards@univie.ac.at), one of the original RExBench authors. Acknowledgements: We thank 2077 AI for generously funding API credits to support running parity experiments.
Run this task
CLI:
inspect eval inspect_harbor/rexbench --model openai/gpt-5Python:
from inspect_ai import eval
from inspect_harbor import rexbench
eval(rexbench(), model="openai/gpt-5")Dataset information
| Harbor registry | rexbench/rexbench |
| Inspect task | rexbench |
| Latest digest | sha256:bc23e94793e8c74aceb8f6fdb3a268dc834b4699f71b1329db9222e83bb5ac4f |
| Samples | 2 |
| Paper | arxiv |
See Task Parameters for the parameter set shared across all Harbor tasks.