terminal-bench-pro@1.0
Coding
Terminal-Bench Pro (Public Set) is an extended benchmark dataset for testing AI agents in real terminal environments. From compiling code to training models and setting up servers, Terminal-Bench Pro evaluates how well agents can handle real-world, end-to-end tasks autonomously.
Run this task
CLI:
inspect eval inspect_harbor/terminal_bench_pro_1_0 --model openai/gpt-5Python:
from inspect_ai import eval
from inspect_harbor import terminal_bench_pro_1_0
eval(terminal_bench_pro_1_0(), model="openai/gpt-5")Dataset information
| Harbor registry | terminal-bench-pro@1.0 |
| Inspect task | terminal_bench_pro_1_0 |
| Version | 1.0 |
| Samples | 200 |
| Paper | arxiv |
See Task Parameters for the parameter set shared across all Harbor tasks.