harbor/rewardhackbench

Safeguards

RewardHackBench: judge benchmark for detecting reward hacking in agent trajectories — each trace is labelled for harness-level cheating, task-level reward hacking, and refusals.

← Back to Registry

Run this task

CLI:

inspect eval inspect_harbor/harbor_rewardhackbench --model openai/gpt-5

Python:

from inspect_ai import eval
from inspect_harbor import harbor_rewardhackbench

eval(harbor_rewardhackbench(), model="openai/gpt-5")

Dataset information

Harbor registry harbor/rewardhackbench
Inspect task harbor_rewardhackbench
Latest digest sha256:0dd16e1029495cba180809b7ecfbae375089881b11ff11e369bfbbf3c72a2fd8
Samples 846

See Task Parameters for the parameter set shared across all Harbor tasks.