Task Format
What is a Harbor Task?
Harbor is a framework for building, evaluating, and optimizing agents and models in containerized environments. A Harbor task is a self-contained evaluation unit that includes an instruction, execution environment, scoring criteria, and optionally a reference solution.
For comprehensive details about Harbor tasks, see the Harbor documentation.
Harbor Task File Structure
A typical Harbor task directory contains the following components:
my_task/
├── instruction.md # Task instructions/prompt shown to the agent
├── task.toml # Metadata, timeouts, resource specs (CPU/memory/GPU), env vars
├── environment/ # Environment setup - Dockerfile or docker-compose.yaml
│ └── Dockerfile # Docker environment spec (varies by sandbox provider)
├── solution/ # (Optional) Reference solution for sanity checking
│ ├── solve.sh # Executable solution script used by Oracle solver
│ └── ... # Supporting solution files and dependencies
└── tests/ # Verification and scoring
├── test.sh # Test script executed by verifier
└── ... # Outputs reward.txt or reward.json to /logs/verifier/
Harbor to Inspect Mapping
Inspect Harbor bridges Harbor tasks to the Inspect AI evaluation framework using the following mappings:
| Harbor Concept | Inspect Concept | Description |
|---|---|---|
| Harbor Task | Sample |
A single evaluation instance with instructions and environment |
| Harbor Dataset | Task |
A collection of related evaluation instances |
| instruction.md | Sample.input |
The prompt/instructions given to the agent |
| environment/ | SandboxEnvironmentSpec |
Docker/environment configuration for isolated execution |
| tests/test.sh | Scorer (harbor_scorer) |
Test script executed by the scorer to produce reward/metrics |
| solution/solve.sh | Solver (oracle) |
Reference solution script executed by the Oracle solver for sanity checking |
| task.toml[metadata] | Sample.metadata |
Task metadata: author, difficulty, category, tags |
| task.toml[verifier] | Scorer timeout/env vars | Timeout and environment configuration for scorer execution |
| task.toml[agent] | Agent solver env vars | Environment variables for agent execution. Agent timeout_sec is ignored. |
| task.toml[solution] | Oracle solver env vars | Environment variables to set when running the solution script |
| task.toml[environment] | SandboxEnvironmentSpec.config |
Resource specifications (CPU, memory, storage, GPU, internet). Overwrites resource limits in environment/docker-compose.yaml |
LLM Judges in Verification
Some Harbor tasks use LLM judges for verification (e.g. evaluating open-ended responses or code quality). These tasks specify the model in their task.toml:
[verifier.env]
MODEL_NAME = "claude-haiku-4-5"
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"The verifier script (tests/test.sh) uses these environment variables to call the LLM. Make sure to set the appropriate API key (e.g. ANTHROPIC_API_KEY) when running tasks with LLM judges.