swt-bench/swt-bench-verified

Coding

SWT-Bench Verified: human-validated subset of SWT-Bench evaluating LLMs on generating reproducing unit tests for real GitHub issues — tests must fail on buggy code and pass after the fix.

← Back to Registry

Run this task

CLI:

inspect eval inspect_harbor/swt_bench_verified --model openai/gpt-5

Python:

from inspect_ai import eval
from inspect_harbor import swt_bench_verified

eval(swt_bench_verified(), model="openai/gpt-5")

Dataset information

Harbor registry swt-bench/swt-bench-verified
Inspect task swt_bench_verified
Latest digest sha256:0263229761bc27de1c646dd5d41f46073653d6cadb4efdeda6c05c879551c94a
Samples 433
Paper arxiv
Source https://github.com/logic-star-ai/swt-bench

See Task Parameters for the parameter set shared across all Harbor tasks.