sierra-research/tau3-bench
Assistants
Professional
Behavior
Third generation of τ-bench, extending the original with knowledge and voice. A simulation framework for evaluating customer service agents across airline, retail, telecom, and banking knowledge domains.
Run this task
CLI:
inspect eval inspect_harbor/sierra_research_tau3_bench --model openai/gpt-5Python:
from inspect_ai import eval
from inspect_harbor import sierra_research_tau3_bench
eval(sierra_research_tau3_bench(), model="openai/gpt-5")Dataset information
| Harbor registry | sierra-research/tau3-bench |
| Inspect task | sierra_research_tau3_bench |
| Latest digest | sha256:a57304f682894ac061090769af771a3617664f3ff6e5417d4eadf8e30433e4d9 |
| Samples | 375 |
| Paper | arxiv |
| Source | https://github.com/sierra-research/tau2-bench |
See Task Parameters for the parameter set shared across all Harbor tasks.