terminal-bench/terminal-bench-2
Coding
Assistants
Terminal-Bench v2: benchmark for testing AI agents in real terminal environments — from compiling code to training models and setting up servers.
Run this task
CLI:
inspect eval inspect_harbor/terminal_bench_2 --model openai/gpt-5Python:
from inspect_ai import eval
from inspect_harbor import terminal_bench_2
eval(terminal_bench_2(), model="openai/gpt-5")Dataset information
| Harbor registry | terminal-bench/terminal-bench-2 |
| Inspect task | terminal_bench_2 |
| Latest digest | sha256:c6fc2e2382c1dbae99b2d5ecd2f4f4a60c3c01e0d84642d69b4afd92e99d078b |
| Samples | 89 |
| Paper | arxiv |
| Source | https://github.com/laude-institute/terminal-bench |
See Task Parameters for the parameter set shared across all Harbor tasks.