stanford/medagentbench
Professional
Medicine
Assistants
MedAgentBench: clinically-relevant tasks across 10 categories in a FHIR-compliant virtual EHR, benchmarking LLM agents on medical decision-making, planning, and execution.
Run this task
CLI:
inspect eval inspect_harbor/stanford_medagentbench --model openai/gpt-5Python:
from inspect_ai import eval
from inspect_harbor import stanford_medagentbench
eval(stanford_medagentbench(), model="openai/gpt-5")Dataset information
| Harbor registry | stanford/medagentbench |
| Inspect task | stanford_medagentbench |
| Latest digest | sha256:c52d82b3462fb26417707682095e43f224a31f8f785eb7da615c1ab6adc20bf0 |
| Samples | 300 |
| Paper | arxiv |
| Source | https://github.com/stanfordmlgroup/MedAgentBench |
See Task Parameters for the parameter set shared across all Harbor tasks.