swt-bench/swt-bench-verified
Coding
SWT-Bench Verified: human-validated subset of SWT-Bench evaluating LLMs on generating reproducing unit tests for real GitHub issues — tests must fail on buggy code and pass after the fix.
Run this task
CLI:
inspect eval inspect_harbor/swt_bench_verified --model openai/gpt-5Python:
from inspect_ai import eval
from inspect_harbor import swt_bench_verified
eval(swt_bench_verified(), model="openai/gpt-5")Dataset information
| Harbor registry | swt-bench/swt-bench-verified |
| Inspect task | swt_bench_verified |
| Latest digest | sha256:0263229761bc27de1c646dd5d41f46073653d6cadb4efdeda6c05c879551c94a |
| Samples | 433 |
| Paper | arxiv |
| Source | https://github.com/logic-star-ai/swt-bench |
See Task Parameters for the parameter set shared across all Harbor tasks.