harbor/rewardhackbench
Safeguards
RewardHackBench: judge benchmark for detecting reward hacking in agent trajectories — each trace is labelled for harness-level cheating, task-level reward hacking, and refusals.
Run this task
CLI:
inspect eval inspect_harbor/harbor_rewardhackbench --model openai/gpt-5Python:
from inspect_ai import eval
from inspect_harbor import harbor_rewardhackbench
eval(harbor_rewardhackbench(), model="openai/gpt-5")Dataset information
| Harbor registry | harbor/rewardhackbench |
| Inspect task | harbor_rewardhackbench |
| Latest digest | sha256:0dd16e1029495cba180809b7ecfbae375089881b11ff11e369bfbbf3c72a2fd8 |
| Samples | 846 |
See Task Parameters for the parameter set shared across all Harbor tasks.