Task Format

What is a Harbor Task?

Harbor is a framework for building, evaluating, and optimizing agents and models in containerized environments. A Harbor task is a self-contained evaluation unit that includes an instruction, execution environment, scoring criteria, and optionally a reference solution.

For comprehensive details about Harbor tasks, see the Harbor documentation.

Harbor Task File Structure

A typical Harbor task directory contains the following components:

my_task/
├── instruction.md             # Task instructions/prompt shown to the agent
├── task.toml                  # Metadata, timeouts, resource specs, env vars
├── environment/               # Environment setup - Dockerfile or docker-compose.yaml
│   └── Dockerfile             # Docker environment spec (varies by sandbox provider)
├── solution/                  # (Optional) Reference solution for sanity checking
│   ├── solve.sh / solve.bat   # Executable solution script used by Oracle solver
│   └── ...                    # Supporting solution files and dependencies
└── tests/                     # Verification and scoring
    ├── test.sh / test.bat     # Test script executed by verifier
    └── ...                    # Outputs reward.txt or reward.json to /logs/verifier/

.bat script variants are used when the task targets Windows (controlled by [environment].os); Linux tasks use .sh.

Harbor to Inspect Mapping

Inspect Harbor bridges Harbor tasks to the Inspect AI evaluation framework using the following mappings:

Harbor Concept Inspect Concept Description
Harbor Task Sample A single evaluation instance with instructions and environment
Harbor Dataset Task A collection of related evaluation instances
instruction.md Sample.input The prompt/instructions given to the agent
environment/ SandboxEnvironmentSpec Docker/environment configuration for isolated execution
tests/test.sh / test.bat Scorer (harbor_scorer) Test script executed by the scorer to produce reward/metrics
solution/solve.sh / solve.bat Solver (oracle) Reference solution script executed by the Oracle solver for sanity checking
task.toml[task] Sample.metadata Task name, description, authors, keywords
task.toml[metadata] Sample.metadata Arbitrary custom fields (difficulty, category, tags)
task.toml[verifier] Scorer timeout/env vars Timeout, env vars, and user for verifier execution
task.toml[agent] Agent solver env vars/user Environment variables and user for agent execution. Agent timeout_sec is ignored.
task.toml[solution] Oracle solver env vars Environment variables to set when running the solution script
task.toml[environment] SandboxEnvironmentSpec.config Docker image, target OS (linux/windows), resource specs (CPU, memory, storage, GPU, internet), MCP servers, healthchecks. Overwrites resource limits in environment/docker-compose.yaml.

Unsupported task.toml Features

Loading a task that declares any of the below either raises or warns:

Field Behavior
[[steps]] + multi_step_reward_strategy (multi-step tasks) raise
[environment].os = "windows" raise
[environment].healthcheck warn (agent may run before services ready)
[environment].mcp_servers warn (not exposed to agent)
[environment].skills_dir warn (not copied to agent skills dir)

[environment].build_timeout_sec, storage_mb, workdir, and config.source / config.artifacts are ignored.

LLM Judges in Verification

Some Harbor tasks use LLM judges for verification (e.g. evaluating open-ended responses or code quality). These tasks specify the model in their task.toml:

[verifier.env]
MODEL_NAME = "claude-haiku-4-5"
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"

The verifier script (tests/test.sh or test.bat) uses these environment variables to call the LLM. Make sure to set the appropriate API key (e.g. ANTHROPIC_API_KEY) when running tasks with LLM judges.