Registry

All Harbor datasets available as Inspect tasks. Use the search box to filter by name or description, the category chips to filter by topic, and the column headers to sort. Click a dataset’s name to open its details page.

Usage

CLI:

inspect eval inspect_harbor/aider_polyglot --model openai/gpt-5

Python:

from inspect_ai import eval
from inspect_harbor import aider_polyglot

eval(aider_polyglot(), model="openai/gpt-5")

Available Datasets

Harbor Dataset Inspect Task Description Samples
LiteCoder/LiteCoder-rl litecoder_rl LiteCoder: terminal-based RL training environments spanning developer workflows, scientific/numeric… 602
MichaelY310/devopsgym michaely310_devopsgym DevOps-Gym benchmark adapted to Harbor format - 729 tasks across 5 categories: Build, Monitoring, I… 728
aarr/aarri-bench aarr_aarri_bench aarri(act as a real research intern)-bench, for evaluating LLM agents in academic research tasks. 82
abundant/swe-gen-cpp abundant_swe_gen_cpp Dataset of C++ SWE tasks. Generated by abundant-ai/SWE-gen tool. 999
abundant/swe-gen-go abundant_swe_gen_go Dataset of Go SWE tasks. Generated by abundant-ai/SWE-gen tool. 1000
abundant/swe-gen-java abundant_swe_gen_java Dataset of Java SWE tasks. Generated by abundant-ai/SWE-gen tool. 1000
abundant/swe-gen-js abundant_swe_gen_js Dataset of JS/TS SWE tasks. Generated by abundant-ai/SWE-gen tool. 1000
abundant/swe-gen-rust abundant_swe_gen_rust Dataset of Rust SWE tasks. Generated by abundant-ai/SWE-gen tool. 1000
actava-ai/chi-bench actava_ai_chi_bench χ-Bench: long-horizon, policy-rich U.S. healthcare workflow agent benchmark spanning provider prior… 78
adyen/dabstep adyen_dabstep DABstep: real-world data analysis tasks from Adyen’s workloads requiring multi-step reasoning by LL… 450
agentic-labs/erp-bench agentic_labs_erp_bench ERP-Bench is the Odoo 19 benchmark used in the Anchor paper, “Preventing Artifact Drift in Agent Be… 300
ai-forever/harness-bench-fast ai_forever_harness_bench_fast Self-contained file-operation agent benchmark. 231
aider/aider-polyglot aider_polyglot Aider’s polyglot coding benchmark: Exercism exercises across C++, Go, Java, JavaScript, Python, and… 225
aime/aime aime Problems from the American Invitational Mathematics Examination (AIME), a 3-hour high-school compet… 60
algotune/algotune algotune AlgoTune: NeurIPS 2025 benchmark of math/physics/CS problems where the model writes code that match… 154
apple/mmau apple_mmau MMAU (Massive Multitask Agent Understanding): Apple’s holistic agent benchmark covering tool-use, D… 1000
arcprize/arc-agi-2 arcprize_arc_agi_2 ARC-AGI-2: visual reasoning tasks testing general fluid intelligence — humans solve them easily but… 167
bigcode/bigcodebench-hard-complete bigcode_bigcodebench_hard_complete BigCodeBench-Hard (Complete split): hard subset evaluating LLMs on code generation with diverse fun… 145
bigcode/humanevalfix bigcode_humanevalfix HumanEvalFix (OctoPack): buggy functions across Python, JavaScript, Java, Go, C++, and Rust that mo… 164
binary-audit/binary-audit binary_audit BinaryAudit: AI-agent benchmark for finding backdoors hidden in compiled binaries via reverse engin… 46
cais/swebenchpro cais_swebenchpro SWE-bench Pro with anti-exploitation (git history isolation + GitHub network blocking). 731 tasks,… 731
camel-ai/seta-env camel_ai_seta_env SETA (Scaling Environments for Terminal Agents): CAMEL-AI’s verifiable terminal-agent tasks spannin… 1000
cmu/refav cmu_refav Autonomous-vehicle scenario mining via VLM. 1000
codepde/codepde codepde CodePDE: framing partial-differential-equation solving as a code-generation task to benchmark LLMs… 5
crustbench/crustbench crustbench CRUST-Bench: real-world C repositories paired with hand-written safe-Rust interfaces and tests, ben… 100
datacurve/deep-swe datacurve_deep_swe DeepSWE: Measuring frontier coding agents on original, long-horizon engineering tasks. 113
dbt-labs/ade-bench dbt_labs_ade_bench Analytics Data Engineer Bench: dbt and SQL data-engineering tasks across DuckDB and Snowflake backe… 48
deveval/deveval deveval DevEval: manually-annotated code-generation samples from real-world Python repositories, aligned to… 63
evoeval/evoeval evoeval EvoEval: evolving suite that mutates HumanEval problems along several axes (difficulty, creative, s… 100
factory-ai/legacy-bench factory_ai_legacy_bench Legacy-Bench public sample tasks for evaluating AI coding agents on legacy software engineering tas… 10
featurebench/featurebench featurebench FeatureBench: agentic coding on end-to-end feature-development tasks derived from open-source repos… 200
featurebench/featurebench-lite featurebench_lite Lightweight subset of FeatureBench for cheaper evaluation while preserving model rankings. 30
featurebench/featurebench-lite-modal featurebench_lite_modal FeatureBench-Lite executed on Modal’s cloud sandbox runner. 30
featurebench/featurebench-modal featurebench_modal FeatureBench’s full task suite executed on Modal’s cloud sandbox runner. 200
futurehouse/bixbench futurehouse_bixbench BixBench: real-world bioinformatics analysis capsules with open-answer questions evaluating LLM age… 205
futurehouse/bixbench-cli futurehouse_bixbench_cli CLI variant of BixBench: agents solve the same bioinformatics analysis tasks via a command-line / s… 205
futurehouse/labbench futurehouse_labbench LAB-Bench (Language Agent Biology Benchmark): questions across 8 categories (literature QA, databas… 181
gabeorlanski/slopcodebench gabeorlanski_slopcodebench SlopCodeBench multi-checkpoint coding benchmark tasks converted for Harbor. 36
gaia/gaia gaia GAIA: real-world questions across three difficulty levels evaluating general AI assistants on reaso… 165
gnucleus-ai/cad-bench gnucleus_ai_cad_bench gNucleus AI CAD-generation benchmark — 100 parametric FreeCAD tasks. 100
gorilla/bfcl gorilla_bfcl Berkeley Function-Calling Leaderboard: LLM tool-use across function-calling categories spanning Pyt… 1000
gorilla/bfcl_parity gorilla_bfcl_parity Stratified parity subset of BFCL validating that Harbor’s adapter matches the upstream implementati… 123
gpqa-diamond/gpqa-diamond gpqa_diamond GPQA Diamond: expert-validated graduate-level multiple-choice questions in biology, physics, and ch… 198
grafana/o11y-bench grafana_o11y_bench o11y-bench: an open agentic observability benchmark. Measures how well AI agents perform 63 real-wo… 63
harbor/rewardhackbench harbor_rewardhackbench RewardHackBench: judge benchmark for detecting reward hacking in agent trajectories — each trace is… 846
harveyai/lab harveyai_lab Harvey LAB - open-source benchmark for evaluating agents on real legal work. 1000
ineqmath/ineqmath ineqmath IneqMath: Olympiad-level inequality benchmark with expert-reviewed test problems, formulated as bou… 100
ivanleo/agent-search ivanleo_agent_search Agent answers Gemini API questions by querying an indexed documentation database. 20
kgmon/deepsearchqa kgmon_deepsearchqa DeepSearchQA is a 900-prompt factuality benchmark from Google DeepMind for evaluating deep research… 900
kumo/kumo-1 kumo_1 KUMO (kumo-1 split): procedurally-generated multi-turn reasoning games combining LLMs with symbolic… 1000
kumo/kumo-easy kumo_easy KUMO (easy split): easier-difficulty procedurally-generated reasoning tasks from KUMO’s benchmark a… 1000
kumo/kumo-hard kumo_hard KUMO (hard split): hard-difficulty procedurally-generated reasoning tasks from KUMO’s benchmark acr… 250
kumo/kumo-parity kumo_parity KUMO (parity split): subset of the KUMO procedural-reasoning benchmark used for parity / regression… 212
lawbench/lawbench lawbench LawBench: tasks evaluating LLMs on Chinese-law knowledge — legal entity recognition, reading compre… 1000
lcb/longswebench-32k lcb_longswebench_32k LongCodeBench (LCB) LongSWE-Bench tasks — 32k context window bucket. 3
lica-world/gdb lica_world_gdb GraphicDesignBench (GDB): evaluating AI on graphic design tasks across layout, typography, infograp… 1000
livecodebench/livecodebench livecodebench LiveCodeBench: contamination-free coding benchmark continuously collected from LeetCode, AtCoder, a… 100
maxbittker/runebench maxbittker_runebench Benchmark suite for evaluating AI agents on RuneScape gameplay tasks. 32
meta/mlgym-bench meta_mlgym_bench MLGym-Bench: Meta’s framework and benchmark for AI research agents covering CV, NLP, RL, and game-t… 12
minnesotanlp/aar minnesotanlp_aar The Amazing Agent Race (AAR): 1400 multi-step scavenger-hunt puzzles for evaluating LLM agents on t… 1000
mmtb/multimedia-terminalbench mmtb_multimedia_terminalbench MultiMedia-TerminalBench (MMTB): a benchmark of 105 realistic multimedia-file tasks in persistent t… 105
nvats/codeskills-bench nvats_codeskills_bench A small set of real-life programming tasks: bug fixes, merge-conflict resolution, dependency cleanu… 23
openai/mmmlu openai_mmmlu MMMLU (Multilingual MMLU): OpenAI’s professional-human-translation of the MMLU test set into 14 lan… 150
openai/simpleqa openai_simpleqa SimpleQA: short, fact-seeking questions adversarially collected against GPT-4 to measure short-form… 1000
openai/swe-lancer-diamond-all openai_swe_lancer_diamond_all SWE-Lancer Diamond (full): public split of OpenAI’s SWE-Lancer benchmark — real Upwork freelance so… 463
openai/swe-lancer-diamond-ic openai_swe_lancer_diamond_ic A benchmark of freelance software engineering tasks from Upwork, valued at $1 million USD total in… 198
openai/swe-lancer-diamond-manager openai_swe_lancer_diamond_manager A benchmark of freelance software engineering tasks from Upwork, valued at $1 million USD total in… 265
pgcodellm/rebench-v2-test pgcodellm_rebench_v2_test SWE-rebench V2: language-agnostic dataset of executable SWE tasks across 20 languages, with pre-bui… 20
qcircuitbench/qcircuitbench qcircuitbench QCircuitBench: large-scale benchmark for LLM-driven quantum-algorithm design, spanning oracle const… 28
quesma/compilebench quesma_compilebench CompileBench: real-world build/compile tasks (curl, GNU coreutils, jq, etc.) ranging from easy buil… 15
quesma/otel-bench quesma_otel_bench AI-agent benchmark for OpenTelemetry instrumentation tasks across multiple programming languages. 26
quixbugs/quixbugs quixbugs QuixBugs: small classic-algorithm programs (Python and Java) each containing a one-line bug, used t… 80
reasoning-gym/reasoning-gym-easy reasoning_gym_easy Reasoning Gym (easy split): procedurally-generated, algorithmically-verifiable reasoning tasks (alg… 288
reasoning-gym/reasoning-gym-hard reasoning_gym_hard Reasoning Gym (hard split): procedurally-generated, algorithmically-verifiable reasoning tasks at h… 288
replicationbench/replicationbench replicationbench ReplicationBench: end-to-end replication of astrophysics research papers — agents reproduce impleme… 90
rexbench/rexbench rexbench RExBench - 2 tasks (cogs, othello) evaluating AI agents’ ability to extend existing AI research thr… 2
satbench/satbench satbench SATBench: logical-reasoning puzzles automatically generated from SAT formulas with adjustable diffi… 1000
scale-ai/hil-bench scale_ai_hil_bench HiL-Bench (Human-in-the-Loop): tests if agents know when to ask for help rather than proceed with u… 600
scale-ai/swe-atlas-qna scale_ai_swe_atlas_qna SWE-Atlas - Codebase QnA is a benchmark of deep codebase comprehension and QnA problems for coding… 124
scale-ai/swe-atlas-rf scale_ai_swe_atlas_rf SWE-Atlas - Refactoring – A benchmark of refactoring tasks for coding agents. 70
scale-ai/swe-atlas-tw scale_ai_swe_atlas_tw SWE-Atlas - Test Writing – A benchmark of comprehensive test writing problems for coding agents. C… 90
scale-ai/swe-bench-pro scale_ai_swe_bench_pro SWE-Bench-Pro: long-horizon enterprise software engineering tasks. 731
scienceagentbench/scienceagentbench scienceagentbench ScienceAgentBench: data-driven scientific discovery via Python programs across 4 disciplines. 102
sierra-research/tau3-bench sierra_research_tau3_bench Third generation of τ-bench, extending the original with knowledge and voice. A simulation framewor… 375
sldbench/sldbench sldbench SLDBench: first benchmark for scaling-law discovery — tasks curated from LLM training experiments w… 8
stanford/medagentbench stanford_medagentbench MedAgentBench: clinically-relevant tasks across 10 categories in a FHIR-compliant virtual EHR, benc… 300
strongreject/strongreject strongreject StrongREJECT: forbidden prompts plus an automated evaluator for measuring how effective jailbreaks… 150
swe-bench/swe-bench-verified swe_bench_verified SWE-bench Verified: human-filtered subset of SWE-bench (collaboration with OpenAI) where human SWEs… 500
swe-bench/swe-smith swe_bench_swe_smith SWE-smith: NeurIPS 2025 toolkit for synthesizing unlimited SWE-bench-style task instances from any… 100
swt-bench/swt-bench-verified swt_bench_verified SWT-Bench Verified: human-validated subset of SWT-Bench evaluating LLMs on generating reproducing u… 433
tencent/autocodebench tencent_autocodebench Multilingual automated code generation benchmark evaluating LLMs across diverse programming tasks a… 200
termigen/termigen-environments termigen_environments TermiGen-Environments: verified Docker environments with executable terminal-agent tasks across 11… 1000
terminal-bench-pro/terminal-bench-pro terminal_bench_pro Terminal-Bench Pro: tasks across 8 domains — data processing, games, debugging, sysadmin, scientifi… 200
terminal-bench/terminal-bench-2 terminal_bench_2 Terminal-Bench v2: benchmark for testing AI agents in real terminal environments — from compiling c… 89
terminal-bench/terminal-bench-2-1 terminal_bench_2_1 Terminal-Bench v2.1 (point release of v2): benchmark for testing AI agents in real terminal environ… 89
theagentcompany/theagentcompany theagentcompany An agent benchmark with tasks in a simulated software company across GitLab, Plane, OwnCloud, and R… 174
thetalab/vector-edit-gym thetalab_vector_edit_gym 106 hand-authored SVG editing tasks across four difficulty tiers (easy / medium / hard / very_hard)… 106
usaco/usaco usaco USACO: USA Computing Olympiad problems across bronze/silver/gold/platinum tiers with high-quality u… 304
vals/financeagent vals_financeagent Vals AI Finance Agent Benchmark: expert-validated finance questions across nine task categories (re… 50
vmax/vmax-tasks vmax_tasks Code-transformation tasks across JavaScript projects (Docusaurus, Vue, Redux). 1000
webgen-bench/webgen-bench webgen_bench WebGen-Bench: evaluating LLMs on generating interactive and functional websites from scratch. 101
xlang/ds-1000 xlang_ds_1000 DS-1000: data-science code-generation problems from StackOverflow across NumPy, Pandas, TensorFlow,… 1000
yanagiorigami/frontier-cs yanagiorigami_frontier_cs Frontier-CS competitive programming benchmark: 172 open-ended algorithmic problems with partial sco… 172
No matching items