aarr/aarri-bench
Science
Assistants
aarri(act as a real research intern)-bench, for evaluating LLM agents in academic research tasks.
Run this task
CLI:
inspect eval inspect_harbor/aarr_aarri_bench --model openai/gpt-5Python:
from inspect_ai import eval
from inspect_harbor import aarr_aarri_bench
eval(aarr_aarri_bench(), model="openai/gpt-5")Dataset information
| Harbor registry | aarr/aarri-bench |
| Inspect task | aarr_aarri_bench |
| Latest digest | sha256:4cdb4aea9a63c16202e4b94b6f800ff2b3e0ba9969d70ca63c4289e185f20438 |
| Samples | 82 |
See Task Parameters for the parameter set shared across all Harbor tasks.