terminal-bench/terminal-bench-2

Coding
Assistants

Terminal-Bench v2: benchmark for testing AI agents in real terminal environments — from compiling code to training models and setting up servers.

← Back to Registry

Run this task

CLI:

inspect eval inspect_harbor/terminal_bench_2 --model openai/gpt-5

Python:

from inspect_ai import eval
from inspect_harbor import terminal_bench_2

eval(terminal_bench_2(), model="openai/gpt-5")

Dataset information

Harbor registry terminal-bench/terminal-bench-2
Inspect task terminal_bench_2
Latest digest sha256:c6fc2e2382c1dbae99b2d5ecd2f4f4a60c3c01e0d84642d69b4afd92e99d078b
Samples 89
Paper arxiv
Source https://github.com/laude-institute/terminal-bench

See Task Parameters for the parameter set shared across all Harbor tasks.