swebenchpro@1.0
Coding
SWE-bench Pro: A multi-language software engineering benchmark with 731 instances covering Python, JavaScript/TypeScript, and Go. Evaluates AI systems’ ability to resolve real-world bugs and implement features across diverse production codebases.
Run this task
CLI:
inspect eval inspect_harbor/swebenchpro_1_0 --model openai/gpt-5Python:
from inspect_ai import eval
from inspect_harbor import swebenchpro_1_0
eval(swebenchpro_1_0(), model="openai/gpt-5")Dataset information
| Harbor registry | swebenchpro@1.0 |
| Inspect task | swebenchpro_1_0 |
| Version | 1.0 |
| Samples | 731 |
| Paper | arxiv |
See Task Parameters for the parameter set shared across all Harbor tasks.