Registry
All Harbor datasets available as Inspect tasks. Use the search box to filter by name or description, the category chips to filter by topic, and the column headers to sort. Click a dataset’s name to open its details page.
Usage
CLI:
inspect eval inspect_harbor/aider_polyglot --model openai/gpt-5Python:
from inspect_ai import eval
from inspect_harbor import aider_polyglot
eval(aider_polyglot(), model="openai/gpt-5")Available Datasets
| Harbor Dataset | Inspect Task | Description | Samples |
|---|---|---|---|
| LiteCoder/LiteCoder-rl | litecoder_rl | LiteCoder: terminal-based RL training environments spanning developer workflows, scientific/numeric… | 602 |
| MichaelY310/devopsgym | michaely310_devopsgym | DevOps-Gym benchmark adapted to Harbor format - 729 tasks across 5 categories: Build, Monitoring, I… | 728 |
| aarr/aarri-bench | aarr_aarri_bench | aarri(act as a real research intern)-bench, for evaluating LLM agents in academic research tasks. | 82 |
| abundant/swe-gen-cpp | abundant_swe_gen_cpp | Dataset of C++ SWE tasks. Generated by abundant-ai/SWE-gen tool. | 999 |
| abundant/swe-gen-go | abundant_swe_gen_go | Dataset of Go SWE tasks. Generated by abundant-ai/SWE-gen tool. | 1000 |
| abundant/swe-gen-java | abundant_swe_gen_java | Dataset of Java SWE tasks. Generated by abundant-ai/SWE-gen tool. | 1000 |
| abundant/swe-gen-js | abundant_swe_gen_js | Dataset of JS/TS SWE tasks. Generated by abundant-ai/SWE-gen tool. | 1000 |
| abundant/swe-gen-rust | abundant_swe_gen_rust | Dataset of Rust SWE tasks. Generated by abundant-ai/SWE-gen tool. | 1000 |
| actava-ai/chi-bench | actava_ai_chi_bench | χ-Bench: long-horizon, policy-rich U.S. healthcare workflow agent benchmark spanning provider prior… | 78 |
| adyen/dabstep | adyen_dabstep | DABstep: real-world data analysis tasks from Adyen’s workloads requiring multi-step reasoning by LL… | 450 |
| agentic-labs/erp-bench | agentic_labs_erp_bench | ERP-Bench is the Odoo 19 benchmark used in the Anchor paper, “Preventing Artifact Drift in Agent Be… | 300 |
| ai-forever/harness-bench-fast | ai_forever_harness_bench_fast | Self-contained file-operation agent benchmark. | 231 |
| aider/aider-polyglot | aider_polyglot | Aider’s polyglot coding benchmark: Exercism exercises across C++, Go, Java, JavaScript, Python, and… | 225 |
| aime/aime | aime | Problems from the American Invitational Mathematics Examination (AIME), a 3-hour high-school compet… | 60 |
| algotune/algotune | algotune | AlgoTune: NeurIPS 2025 benchmark of math/physics/CS problems where the model writes code that match… | 154 |
| apple/mmau | apple_mmau | MMAU (Massive Multitask Agent Understanding): Apple’s holistic agent benchmark covering tool-use, D… | 1000 |
| arcprize/arc-agi-2 | arcprize_arc_agi_2 | ARC-AGI-2: visual reasoning tasks testing general fluid intelligence — humans solve them easily but… | 167 |
| bigcode/bigcodebench-hard-complete | bigcode_bigcodebench_hard_complete | BigCodeBench-Hard (Complete split): hard subset evaluating LLMs on code generation with diverse fun… | 145 |
| bigcode/humanevalfix | bigcode_humanevalfix | HumanEvalFix (OctoPack): buggy functions across Python, JavaScript, Java, Go, C++, and Rust that mo… | 164 |
| binary-audit/binary-audit | binary_audit | BinaryAudit: AI-agent benchmark for finding backdoors hidden in compiled binaries via reverse engin… | 46 |
| cais/swebenchpro | cais_swebenchpro | SWE-bench Pro with anti-exploitation (git history isolation + GitHub network blocking). 731 tasks,… | 731 |
| camel-ai/seta-env | camel_ai_seta_env | SETA (Scaling Environments for Terminal Agents): CAMEL-AI’s verifiable terminal-agent tasks spannin… | 1000 |
| cmu/refav | cmu_refav | Autonomous-vehicle scenario mining via VLM. | 1000 |
| codepde/codepde | codepde | CodePDE: framing partial-differential-equation solving as a code-generation task to benchmark LLMs… | 5 |
| crustbench/crustbench | crustbench | CRUST-Bench: real-world C repositories paired with hand-written safe-Rust interfaces and tests, ben… | 100 |
| datacurve/deep-swe | datacurve_deep_swe | DeepSWE: Measuring frontier coding agents on original, long-horizon engineering tasks. | 113 |
| dbt-labs/ade-bench | dbt_labs_ade_bench | Analytics Data Engineer Bench: dbt and SQL data-engineering tasks across DuckDB and Snowflake backe… | 48 |
| deveval/deveval | deveval | DevEval: manually-annotated code-generation samples from real-world Python repositories, aligned to… | 63 |
| evoeval/evoeval | evoeval | EvoEval: evolving suite that mutates HumanEval problems along several axes (difficulty, creative, s… | 100 |
| factory-ai/legacy-bench | factory_ai_legacy_bench | Legacy-Bench public sample tasks for evaluating AI coding agents on legacy software engineering tas… | 10 |
| featurebench/featurebench | featurebench | FeatureBench: agentic coding on end-to-end feature-development tasks derived from open-source repos… | 200 |
| featurebench/featurebench-lite | featurebench_lite | Lightweight subset of FeatureBench for cheaper evaluation while preserving model rankings. | 30 |
| featurebench/featurebench-lite-modal | featurebench_lite_modal | FeatureBench-Lite executed on Modal’s cloud sandbox runner. | 30 |
| featurebench/featurebench-modal | featurebench_modal | FeatureBench’s full task suite executed on Modal’s cloud sandbox runner. | 200 |
| futurehouse/bixbench | futurehouse_bixbench | BixBench: real-world bioinformatics analysis capsules with open-answer questions evaluating LLM age… | 205 |
| futurehouse/bixbench-cli | futurehouse_bixbench_cli | CLI variant of BixBench: agents solve the same bioinformatics analysis tasks via a command-line / s… | 205 |
| futurehouse/labbench | futurehouse_labbench | LAB-Bench (Language Agent Biology Benchmark): questions across 8 categories (literature QA, databas… | 181 |
| gabeorlanski/slopcodebench | gabeorlanski_slopcodebench | SlopCodeBench multi-checkpoint coding benchmark tasks converted for Harbor. | 36 |
| gaia/gaia | gaia | GAIA: real-world questions across three difficulty levels evaluating general AI assistants on reaso… | 165 |
| gnucleus-ai/cad-bench | gnucleus_ai_cad_bench | gNucleus AI CAD-generation benchmark — 100 parametric FreeCAD tasks. | 100 |
| gorilla/bfcl | gorilla_bfcl | Berkeley Function-Calling Leaderboard: LLM tool-use across function-calling categories spanning Pyt… | 1000 |
| gorilla/bfcl_parity | gorilla_bfcl_parity | Stratified parity subset of BFCL validating that Harbor’s adapter matches the upstream implementati… | 123 |
| gpqa-diamond/gpqa-diamond | gpqa_diamond | GPQA Diamond: expert-validated graduate-level multiple-choice questions in biology, physics, and ch… | 198 |
| grafana/o11y-bench | grafana_o11y_bench | o11y-bench: an open agentic observability benchmark. Measures how well AI agents perform 63 real-wo… | 63 |
| harbor/rewardhackbench | harbor_rewardhackbench | RewardHackBench: judge benchmark for detecting reward hacking in agent trajectories — each trace is… | 846 |
| harveyai/lab | harveyai_lab | Harvey LAB - open-source benchmark for evaluating agents on real legal work. | 1000 |
| ineqmath/ineqmath | ineqmath | IneqMath: Olympiad-level inequality benchmark with expert-reviewed test problems, formulated as bou… | 100 |
| ivanleo/agent-search | ivanleo_agent_search | Agent answers Gemini API questions by querying an indexed documentation database. | 20 |
| kgmon/deepsearchqa | kgmon_deepsearchqa | DeepSearchQA is a 900-prompt factuality benchmark from Google DeepMind for evaluating deep research… | 900 |
| kumo/kumo-1 | kumo_1 | KUMO (kumo-1 split): procedurally-generated multi-turn reasoning games combining LLMs with symbolic… | 1000 |
| kumo/kumo-easy | kumo_easy | KUMO (easy split): easier-difficulty procedurally-generated reasoning tasks from KUMO’s benchmark a… | 1000 |
| kumo/kumo-hard | kumo_hard | KUMO (hard split): hard-difficulty procedurally-generated reasoning tasks from KUMO’s benchmark acr… | 250 |
| kumo/kumo-parity | kumo_parity | KUMO (parity split): subset of the KUMO procedural-reasoning benchmark used for parity / regression… | 212 |
| lawbench/lawbench | lawbench | LawBench: tasks evaluating LLMs on Chinese-law knowledge — legal entity recognition, reading compre… | 1000 |
| lcb/longswebench-32k | lcb_longswebench_32k | LongCodeBench (LCB) LongSWE-Bench tasks — 32k context window bucket. | 3 |
| lica-world/gdb | lica_world_gdb | GraphicDesignBench (GDB): evaluating AI on graphic design tasks across layout, typography, infograp… | 1000 |
| livecodebench/livecodebench | livecodebench | LiveCodeBench: contamination-free coding benchmark continuously collected from LeetCode, AtCoder, a… | 100 |
| maxbittker/runebench | maxbittker_runebench | Benchmark suite for evaluating AI agents on RuneScape gameplay tasks. | 32 |
| meta/mlgym-bench | meta_mlgym_bench | MLGym-Bench: Meta’s framework and benchmark for AI research agents covering CV, NLP, RL, and game-t… | 12 |
| minnesotanlp/aar | minnesotanlp_aar | The Amazing Agent Race (AAR): 1400 multi-step scavenger-hunt puzzles for evaluating LLM agents on t… | 1000 |
| mmtb/multimedia-terminalbench | mmtb_multimedia_terminalbench | MultiMedia-TerminalBench (MMTB): a benchmark of 105 realistic multimedia-file tasks in persistent t… | 105 |
| nvats/codeskills-bench | nvats_codeskills_bench | A small set of real-life programming tasks: bug fixes, merge-conflict resolution, dependency cleanu… | 23 |
| openai/mmmlu | openai_mmmlu | MMMLU (Multilingual MMLU): OpenAI’s professional-human-translation of the MMLU test set into 14 lan… | 150 |
| openai/simpleqa | openai_simpleqa | SimpleQA: short, fact-seeking questions adversarially collected against GPT-4 to measure short-form… | 1000 |
| openai/swe-lancer-diamond-all | openai_swe_lancer_diamond_all | SWE-Lancer Diamond (full): public split of OpenAI’s SWE-Lancer benchmark — real Upwork freelance so… | 463 |
| openai/swe-lancer-diamond-ic | openai_swe_lancer_diamond_ic | A benchmark of freelance software engineering tasks from Upwork, valued at $1 million USD total in… | 198 |
| openai/swe-lancer-diamond-manager | openai_swe_lancer_diamond_manager | A benchmark of freelance software engineering tasks from Upwork, valued at $1 million USD total in… | 265 |
| pgcodellm/rebench-v2-test | pgcodellm_rebench_v2_test | SWE-rebench V2: language-agnostic dataset of executable SWE tasks across 20 languages, with pre-bui… | 20 |
| qcircuitbench/qcircuitbench | qcircuitbench | QCircuitBench: large-scale benchmark for LLM-driven quantum-algorithm design, spanning oracle const… | 28 |
| quesma/compilebench | quesma_compilebench | CompileBench: real-world build/compile tasks (curl, GNU coreutils, jq, etc.) ranging from easy buil… | 15 |
| quesma/otel-bench | quesma_otel_bench | AI-agent benchmark for OpenTelemetry instrumentation tasks across multiple programming languages. | 26 |
| quixbugs/quixbugs | quixbugs | QuixBugs: small classic-algorithm programs (Python and Java) each containing a one-line bug, used t… | 80 |
| reasoning-gym/reasoning-gym-easy | reasoning_gym_easy | Reasoning Gym (easy split): procedurally-generated, algorithmically-verifiable reasoning tasks (alg… | 288 |
| reasoning-gym/reasoning-gym-hard | reasoning_gym_hard | Reasoning Gym (hard split): procedurally-generated, algorithmically-verifiable reasoning tasks at h… | 288 |
| replicationbench/replicationbench | replicationbench | ReplicationBench: end-to-end replication of astrophysics research papers — agents reproduce impleme… | 90 |
| rexbench/rexbench | rexbench | RExBench - 2 tasks (cogs, othello) evaluating AI agents’ ability to extend existing AI research thr… | 2 |
| satbench/satbench | satbench | SATBench: logical-reasoning puzzles automatically generated from SAT formulas with adjustable diffi… | 1000 |
| scale-ai/hil-bench | scale_ai_hil_bench | HiL-Bench (Human-in-the-Loop): tests if agents know when to ask for help rather than proceed with u… | 600 |
| scale-ai/swe-atlas-qna | scale_ai_swe_atlas_qna | SWE-Atlas - Codebase QnA is a benchmark of deep codebase comprehension and QnA problems for coding… | 124 |
| scale-ai/swe-atlas-rf | scale_ai_swe_atlas_rf | SWE-Atlas - Refactoring – A benchmark of refactoring tasks for coding agents. | 70 |
| scale-ai/swe-atlas-tw | scale_ai_swe_atlas_tw | SWE-Atlas - Test Writing – A benchmark of comprehensive test writing problems for coding agents. C… | 90 |
| scale-ai/swe-bench-pro | scale_ai_swe_bench_pro | SWE-Bench-Pro: long-horizon enterprise software engineering tasks. | 731 |
| scienceagentbench/scienceagentbench | scienceagentbench | ScienceAgentBench: data-driven scientific discovery via Python programs across 4 disciplines. | 102 |
| sierra-research/tau3-bench | sierra_research_tau3_bench | Third generation of τ-bench, extending the original with knowledge and voice. A simulation framewor… | 375 |
| sldbench/sldbench | sldbench | SLDBench: first benchmark for scaling-law discovery — tasks curated from LLM training experiments w… | 8 |
| stanford/medagentbench | stanford_medagentbench | MedAgentBench: clinically-relevant tasks across 10 categories in a FHIR-compliant virtual EHR, benc… | 300 |
| strongreject/strongreject | strongreject | StrongREJECT: forbidden prompts plus an automated evaluator for measuring how effective jailbreaks… | 150 |
| swe-bench/swe-bench-verified | swe_bench_verified | SWE-bench Verified: human-filtered subset of SWE-bench (collaboration with OpenAI) where human SWEs… | 500 |
| swe-bench/swe-smith | swe_bench_swe_smith | SWE-smith: NeurIPS 2025 toolkit for synthesizing unlimited SWE-bench-style task instances from any… | 100 |
| swt-bench/swt-bench-verified | swt_bench_verified | SWT-Bench Verified: human-validated subset of SWT-Bench evaluating LLMs on generating reproducing u… | 433 |
| tencent/autocodebench | tencent_autocodebench | Multilingual automated code generation benchmark evaluating LLMs across diverse programming tasks a… | 200 |
| termigen/termigen-environments | termigen_environments | TermiGen-Environments: verified Docker environments with executable terminal-agent tasks across 11… | 1000 |
| terminal-bench-pro/terminal-bench-pro | terminal_bench_pro | Terminal-Bench Pro: tasks across 8 domains — data processing, games, debugging, sysadmin, scientifi… | 200 |
| terminal-bench/terminal-bench-2 | terminal_bench_2 | Terminal-Bench v2: benchmark for testing AI agents in real terminal environments — from compiling c… | 89 |
| terminal-bench/terminal-bench-2-1 | terminal_bench_2_1 | Terminal-Bench v2.1 (point release of v2): benchmark for testing AI agents in real terminal environ… | 89 |
| theagentcompany/theagentcompany | theagentcompany | An agent benchmark with tasks in a simulated software company across GitLab, Plane, OwnCloud, and R… | 174 |
| thetalab/vector-edit-gym | thetalab_vector_edit_gym | 106 hand-authored SVG editing tasks across four difficulty tiers (easy / medium / hard / very_hard)… | 106 |
| usaco/usaco | usaco | USACO: USA Computing Olympiad problems across bronze/silver/gold/platinum tiers with high-quality u… | 304 |
| vals/financeagent | vals_financeagent | Vals AI Finance Agent Benchmark: expert-validated finance questions across nine task categories (re… | 50 |
| vmax/vmax-tasks | vmax_tasks | Code-transformation tasks across JavaScript projects (Docusaurus, Vue, Redux). | 1000 |
| webgen-bench/webgen-bench | webgen_bench | WebGen-Bench: evaluating LLMs on generating interactive and functional websites from scratch. | 101 |
| xlang/ds-1000 | xlang_ds_1000 | DS-1000: data-science code-generation problems from StackOverflow across NumPy, Pandas, TensorFlow,… | 1000 |
| yanagiorigami/frontier-cs | yanagiorigami_frontier_cs | Frontier-CS competitive programming benchmark: 172 open-ended algorithmic problems with partial sco… | 172 |
No matching items