Task Format

What is a Harbor Task?

Harbor is a framework for building, evaluating, and optimizing agents and models in containerized environments. A Harbor task is a self-contained evaluation unit that includes an instruction, execution environment, scoring criteria, and optionally a reference solution.

For comprehensive details about Harbor tasks, see the Harbor documentation.

Harbor Task File Structure

A typical Harbor task directory contains the following components:

my_task/
├── instruction.md             # Task instructions/prompt shown to the agent
├── task.toml                  # Metadata, timeouts, resource specs, env vars
├── environment/               # Environment setup - Dockerfile or docker-compose.yaml
│   └── Dockerfile             # Docker environment spec (varies by sandbox provider)
├── solution/                  # (Optional) Reference solution for sanity checking
│   ├── solve.sh / solve.bat   # Executable solution script used by Oracle solver
│   └── ...                    # Supporting solution files and dependencies
└── tests/                     # Verification and scoring
    ├── test.sh / test.bat     # Test script executed by verifier
    └── ...                    # Outputs reward.txt or reward.json to /logs/verifier/

.bat script variants are used when the task targets Windows (controlled by [environment].os); Linux tasks use .sh.

Harbor to Inspect Mapping

Inspect Harbor bridges Harbor tasks to the Inspect AI evaluation framework using the following mappings:

Harbor Concept	Inspect Concept	Description
Harbor Task	`Sample`	A single evaluation instance with instructions and environment
Harbor Dataset	`Task`	A collection of related evaluation instances
instruction.md	`Sample.input`	The prompt/instructions given to the agent
environment/	`SandboxEnvironmentSpec`	Docker/environment configuration for isolated execution
tests/test.sh / test.bat	`Scorer` (`harbor_scorer`)	Test script executed by the scorer to produce reward/metrics
solution/solve.sh / solve.bat	`Solver` (`oracle`)	Reference solution script executed by the Oracle solver for sanity checking
task.toml[task]	`Sample.metadata`	Task name, description, authors, keywords
task.toml[metadata]	`Sample.metadata`	Arbitrary custom fields (difficulty, category, tags)
task.toml[verifier]	Scorer timeout/env vars	Timeout, env vars, and user for verifier execution
task.toml[agent]	Agent solver env vars/user	Environment variables and user for agent execution. Agent `timeout_sec` is ignored.
task.toml[solution]	Oracle solver env vars	Environment variables to set when running the solution script
task.toml[environment]	`SandboxEnvironmentSpec.config`	Docker image, target OS (`linux`/`windows`), resource specs (CPU, memory, storage, GPU, internet), MCP servers, healthchecks. Overwrites resource limits in `environment/docker-compose.yaml`.

Unsupported `task.toml` Features

Loading a task that declares any of the below either raises or warns:

Field	Behavior
`[[steps]]` + `multi_step_reward_strategy` (multi-step tasks)	raise
`[environment].os = "windows"`	raise
`[environment].healthcheck`	warn (agent may run before services ready)
`[environment].mcp_servers`	warn (not exposed to agent)
`[environment].skills_dir`	warn (not copied to agent skills dir)

[environment].build_timeout_sec, storage_mb, workdir, and config.source / config.artifacts are ignored.

LLM Judges in Verification

Some Harbor tasks use LLM judges for verification (e.g. evaluating open-ended responses or code quality). These tasks specify the model in their task.toml:

[verifier.env]
MODEL_NAME = "claude-haiku-4-5"
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"

The verifier script (tests/test.sh or test.bat) uses these environment variables to call the LLM. Make sure to set the appropriate API key (e.g. ANTHROPIC_API_KEY) when running tasks with LLM judges.

What is a Harbor Task?

Harbor Task File Structure

Harbor to Inspect Mapping

Unsupported task.toml Features

LLM Judges in Verification

Unsupported `task.toml` Features