Daytona

Cloud development environment sandbox for Inspect AI using Daytona.

Setup

Create a Daytona account and set your API key:

export DAYTONA_API_KEY=your_api_key

Usage

Default snapshot

NoteWhat’s a snapshot?

A Daytona snapshot is a sandbox template built from a Docker/OCI image, bundled with resource allocation, entrypoint, and lifecycle metadata. Functionally similar to a container image, but Daytona-managed so sandboxes can be started from it quickly. See Daytona’s snapshots docs.

A minimal eval that uses Daytona’s default snapshot (daytonaio/sandbox), which comes with Python, Node.js, their language servers, and common packages (pandas, torch, anthropic, langchain, typescript, bun, etc.).

from inspect_ai import Task, eval
from inspect_ai.solver import generate, system_message

task = Task(
    dataset=[{"input": "What is 2+2?", "target": "4"}],
    solver=[
        system_message("You are a helpful assistant."),
        generate(),
    ],
    sandbox="daytona",  # Uses Daytona's default snapshot
)

eval(task)
Note

The default snapshot is only used when no config is specified AND no Dockerfile / compose.yaml / compose.yml / docker-compose.yaml / docker-compose.yml is present in the task’s source directory — if one is found there, it is auto-detected and used instead.

Dockerfile

Build the sandbox image from a local Dockerfile.

task = Task(
    dataset=[...],
    solver=[...],
    sandbox=("daytona", "path/to/Dockerfile"),
)

Docker Compose

Configure the sandbox via a Docker Compose file. A file with a single service gets a single Daytona sandbox; a file with two or more services automatically switches to Docker-in-Docker (DinD). Daytona-specific settings go under the top-level x-daytona key — see Configuration.

Single-service

# compose.yaml
services:
  default:
    image: python:3.12
    environment:
      - MY_VAR=hello
    deploy:
      resources:
        limits:
          cpus: "2.0"
          memory: 4g
        reservations:
          devices:
            - capabilities: [gpu]
              count: 1
task = Task(
    dataset=[...],
    solver=[...],
    sandbox=("daytona", "path/to/compose.yaml"),
)

Multi-service (DinD)

When a compose file defines more than one service, the provider automatically uses Docker-in-Docker: a single Daytona sandbox runs a Docker daemon, and services are brought up via docker compose inside it. Each service is exposed as a separate SandboxEnvironment.

# compose.yaml
services:
  default:
    image: python:3.12
    x-default: true
  helper:
    image: redis:7
task = Task(
    dataset=[...],
    solver=[...],
    sandbox=("daytona", "path/to/compose.yaml"),
)

# In a solver, access services by name:
default_env = sandbox()          # the x-default service
helper_env = sandbox("helper")

Default service selection (priority): x-default: true > service named "default" or "main" > first service in the file.

Resources: Per-service resources are summed across all services plus 1 CPU and 1 GiB overhead for the Docker daemon. Ensure the total fits within your Daytona per-sandbox limits.

DinD image: The DinD sandbox uses docker:28.3.3-dind as the base image.

DinD snapshots: The provider auto-creates a Daytona snapshot for the DinD base image. You can also provide a pre-created snapshot via x-daytona.snapshot.

Image caching: Each DinD sandbox gets a fresh Docker daemon with no image cache. docker compose build rebuilds from scratch every sample. For faster startup, pre-build your images and push them to a registry, then use image: instead of build: in your compose file. Daytona does not support snapshotting a running sandbox or shared volumes across sandboxes, so registry-based caching is the recommended approach.

Configuration

The top-level x-daytona key on a compose file passes settings to the Daytona sandbox.

Setting Type Default Description
auto_stop_interval int (minutes) 0 (disabled) Minutes of inactivity before the sandbox auto-stops. Daytona’s own default is 15 minutes; inspect_sandboxes disables auto-stop.
auto_archive_interval int (minutes) 10080 (7 days) Minutes before a stopped sandbox auto-archives.
auto_delete_interval int (minutes) Never Minutes before a stopped sandbox auto-deletes. Unset = sandboxes are never auto-deleted.
network_block_all bool False Block all network access. Overrides compose network_mode.
network_allow_list str None Comma-separated CIDR allowlist.
language str None Hint for language-aware features (e.g. "python", "typescript").
os_user str daytona OS user for commands. Overrides the service-level user field.
public bool False Whether the sandbox is publicly accessible.
ephemeral bool False If True, the sandbox is auto-deleted when stopped.
timeout float (seconds) 60 Seconds to wait for the sandbox creation API call to complete. For DinD, this covers only the VM provisioning — dockerd boot, image pulls, and docker compose build/up have their own internal timeouts and are not affected.
snapshot str None Pre-created Daytona snapshot name. Skips image build for single-service; used as the DinD VM snapshot for multi-service.
resources dict per-service aggregation Sandbox-level resource overrides (cpu, memory, gpu). For DinD, overrides the per-service sum.
env_vars dict {} Extra env vars. Single-service: merged with service environment (x-daytona wins). DinD: set on the VM, not on compose services.
labels dict {} Custom labels, merged with inspect_sandboxes’ own labels (which take precedence).

Example:

x-daytona:
  auto_stop_interval: 10
  resources:
    cpu: 4
    memory: 8
    gpu: 1
  env_vars:
    EXTRA_VAR: "value"

Unsupported Daytona parameters: volumes.

Finding sandboxes

Every sandbox created by inspect_sandboxes is named and labeled so you can locate it in the Daytona dashboard for debugging, audit, or manual cleanup.

Sandbox names follow inspect-{task_id}-{sample_id}-{hex} (e.g. inspect-my_eval-42-a1b2c3d4). The 8-character hex suffix guarantees uniqueness across re-runs. If task_id or sample_id is unavailable, that segment is dropped; if both are unavailable, the name is just inspect-{hex}. The sample_id segment requires inspect-ai >= 0.3.211 (PR #3619); on older versions it’s silently omitted.

Labels applied to every sandbox:

  • created_by: inspect-ai — identifies sandboxes created by this package.
  • inspect_run_id: <hex> — a per-task-run identifier; all sandboxes for the same eval run share this value.

User labels from x-daytona.labels are merged in; the two above always take precedence.

DinD snapshot names are deterministic, derived from aggregated per-service resources: inspect-dind-<cpu>cpu-<mem>gb-<gpu>gpu (e.g. inspect-dind-2cpu-4gb-0gpu). Samples with matching resource profiles reuse the same snapshot.

Bulk cleanup via the Inspect CLI — finds and deletes every sandbox tagged created_by: inspect-ai:

inspect sandbox cleanup daytona                 # delete all
inspect sandbox cleanup daytona <sandbox-id>    # delete one

Notes

  • Default user: The default sandbox user is daytona (not root), with passwordless sudo.
  • user parameter: Supported via sudo -u in single-service mode and docker compose exec --user in DinD mode. Numeric UIDs are supported. Requires sudo with passwordless access (configured by default).
  • stdin (input): Supported via input redirection from a temp file. POSIX-compatible.
  • Network: Outbound internet depends on your Daytona subscription tier. Tiers 1-2 restrict to essential services (Docker Hub, npm, PyPI, GitHub, AI providers); tiers 3-4 have full internet. Compose network_mode is translated to Daytona’s network_block_all; x-daytona.network_block_all takes precedence. For DinD, the VM always has network enabled (needed for docker pull); service-level isolation is via compose network_mode.
  • DinD startup latency: Docker daemon boot + image pulls + compose up can take 30s+. inspect_sandboxes auto-creates a Daytona snapshot of the docker:28.3.3-dind base so subsequent samples skip the VM bring-up cost (see DinD snapshots above). Building the DinD image on-demand per sample (the fallback path when a snapshot isn’t available) is prohibitively slow for eval workloads — not recommended.

Limitations

  • Architecture: Daytona runners use linux/amd64. arm64-only images are not supported.
  • stderr: The Daytona API returns combined stdout+stderr; stderr is always empty.