aarr/aarri-bench

Science
Assistants

aarri(act as a real research intern)-bench, for evaluating LLM agents in academic research tasks.

← Back to Registry

Run this task

CLI:

inspect eval inspect_harbor/aarr_aarri_bench --model openai/gpt-5

Python:

from inspect_ai import eval
from inspect_harbor import aarr_aarri_bench

eval(aarr_aarri_bench(), model="openai/gpt-5")

Dataset information

Harbor registry aarr/aarri-bench
Inspect task aarr_aarri_bench
Latest digest sha256:4cdb4aea9a63c16202e4b94b6f800ff2b3e0ba9969d70ca63c4289e185f20438
Samples 82

See Task Parameters for the parameter set shared across all Harbor tasks.