Daytona
Cloud development environment sandbox for Inspect AI using Daytona.
Setup
Create a Daytona account and set your API key:
export DAYTONA_API_KEY=your_api_keyUsage
Default snapshot
A Daytona snapshot is a sandbox template built from a Docker/OCI image, bundled with resource allocation, entrypoint, and lifecycle metadata. Functionally similar to a container image, but Daytona-managed so sandboxes can be started from it quickly. See Daytona’s snapshots docs.
A minimal eval that uses Daytona’s default snapshot (daytonaio/sandbox), which comes with Python, Node.js, their language servers, and common packages (pandas, torch, anthropic, langchain, typescript, bun, etc.).
from inspect_ai import Task, eval
from inspect_ai.solver import generate, system_message
task = Task(
dataset=[{"input": "What is 2+2?", "target": "4"}],
solver=[
system_message("You are a helpful assistant."),
generate(),
],
sandbox="daytona", # Uses Daytona's default snapshot
)
eval(task)The default snapshot is only used when no config is specified AND no Dockerfile / compose.yaml / compose.yml / docker-compose.yaml / docker-compose.yml is present in the task’s source directory — if one is found there, it is auto-detected and used instead.
Dockerfile
Build the sandbox image from a local Dockerfile.
task = Task(
dataset=[...],
solver=[...],
sandbox=("daytona", "path/to/Dockerfile"),
)Docker Compose
Configure the sandbox via a Docker Compose file. A file with a single service gets a single Daytona sandbox; a file with two or more services automatically switches to Docker-in-Docker (DinD). Daytona-specific settings go under the top-level x-daytona key — see Configuration.
Single-service
# compose.yaml
services:
default:
image: python:3.12
environment:
- MY_VAR=hello
deploy:
resources:
limits:
cpus: "2.0"
memory: 4g
reservations:
devices:
- capabilities: [gpu]
count: 1task = Task(
dataset=[...],
solver=[...],
sandbox=("daytona", "path/to/compose.yaml"),
)Multi-service (DinD)
When a compose file defines more than one service, the provider automatically uses Docker-in-Docker: a single Daytona sandbox runs a Docker daemon, and services are brought up via docker compose inside it. Each service is exposed as a separate SandboxEnvironment.
# compose.yaml
services:
default:
image: python:3.12
x-default: true
helper:
image: redis:7task = Task(
dataset=[...],
solver=[...],
sandbox=("daytona", "path/to/compose.yaml"),
)
# In a solver, access services by name:
default_env = sandbox() # the x-default service
helper_env = sandbox("helper")Default service selection (priority): x-default: true > service named "default" or "main" > first service in the file.
Resources: Per-service resources are summed across all services plus 1 CPU and 1 GiB overhead for the Docker daemon. Ensure the total fits within your Daytona per-sandbox limits.
DinD image: The DinD sandbox uses docker:28.3.3-dind as the base image.
DinD snapshots: The provider auto-creates a Daytona snapshot for the DinD base image. You can also provide a pre-created snapshot via x-daytona.snapshot.
Image caching: Each DinD sandbox gets a fresh Docker daemon with no image cache. docker compose build rebuilds from scratch every sample. For faster startup, pre-build your images and push them to a registry, then use image: instead of build: in your compose file. Daytona does not support snapshotting a running sandbox or shared volumes across sandboxes, so registry-based caching is the recommended approach.
Configuration
The top-level x-daytona key on a compose file passes settings to the Daytona sandbox.
| Setting | Type | Default | Description |
|---|---|---|---|
auto_stop_interval |
int (minutes) | 0 (disabled) |
Minutes of inactivity before the sandbox auto-stops. Daytona’s own default is 15 minutes; inspect_sandboxes disables auto-stop. |
auto_archive_interval |
int (minutes) | 10080 (7 days) |
Minutes before a stopped sandbox auto-archives. |
auto_delete_interval |
int (minutes) | Never | Minutes before a stopped sandbox auto-deletes. Unset = sandboxes are never auto-deleted. |
network_block_all |
bool | False |
Block all network access. Overrides compose network_mode. |
network_allow_list |
str | None |
Comma-separated CIDR allowlist. |
language |
str | None |
Hint for language-aware features (e.g. "python", "typescript"). |
os_user |
str | daytona |
OS user for commands. Overrides the service-level user field. |
public |
bool | False |
Whether the sandbox is publicly accessible. |
ephemeral |
bool | False |
If True, the sandbox is auto-deleted when stopped. |
timeout |
float (seconds) | 60 |
Seconds to wait for the sandbox creation API call to complete. For DinD, this covers only the VM provisioning — dockerd boot, image pulls, and docker compose build/up have their own internal timeouts and are not affected. |
snapshot |
str | None |
Pre-created Daytona snapshot name. Skips image build for single-service; used as the DinD VM snapshot for multi-service. |
resources |
dict | per-service aggregation | Sandbox-level resource overrides (cpu, memory, gpu). For DinD, overrides the per-service sum. |
env_vars |
dict | {} |
Extra env vars. Single-service: merged with service environment (x-daytona wins). DinD: set on the VM, not on compose services. |
labels |
dict | {} |
Custom labels, merged with inspect_sandboxes’ own labels (which take precedence). |
Example:
x-daytona:
auto_stop_interval: 10
resources:
cpu: 4
memory: 8
gpu: 1
env_vars:
EXTRA_VAR: "value"Unsupported Daytona parameters: volumes.
Finding sandboxes
Every sandbox created by inspect_sandboxes is named and labeled so you can locate it in the Daytona dashboard for debugging, audit, or manual cleanup.
Sandbox names follow inspect-{task_id}-{sample_id}-{hex} (e.g. inspect-my_eval-42-a1b2c3d4). The 8-character hex suffix guarantees uniqueness across re-runs. If task_id or sample_id is unavailable, that segment is dropped; if both are unavailable, the name is just inspect-{hex}. The sample_id segment requires inspect-ai >= 0.3.211 (PR #3619); on older versions it’s silently omitted.
Labels applied to every sandbox:
created_by: inspect-ai— identifies sandboxes created by this package.inspect_run_id: <hex>— a per-task-run identifier; all sandboxes for the same eval run share this value.
User labels from x-daytona.labels are merged in; the two above always take precedence.
DinD snapshot names are deterministic, derived from aggregated per-service resources: inspect-dind-<cpu>cpu-<mem>gb-<gpu>gpu (e.g. inspect-dind-2cpu-4gb-0gpu). Samples with matching resource profiles reuse the same snapshot.
Bulk cleanup via the Inspect CLI — finds and deletes every sandbox tagged created_by: inspect-ai:
inspect sandbox cleanup daytona # delete all
inspect sandbox cleanup daytona <sandbox-id> # delete oneNotes
- Default user: The default sandbox user is
daytona(not root), with passwordlesssudo. userparameter: Supported viasudo -uin single-service mode anddocker compose exec --userin DinD mode. Numeric UIDs are supported. Requiressudowith passwordless access (configured by default).stdin(input): Supported via input redirection from a temp file. POSIX-compatible.- Network: Outbound internet depends on your Daytona subscription tier. Tiers 1-2 restrict to essential services (Docker Hub, npm, PyPI, GitHub, AI providers); tiers 3-4 have full internet. Compose
network_modeis translated to Daytona’snetwork_block_all;x-daytona.network_block_alltakes precedence. For DinD, the VM always has network enabled (needed fordocker pull); service-level isolation is via composenetwork_mode. - DinD startup latency: Docker daemon boot + image pulls + compose up can take 30s+.
inspect_sandboxesauto-creates a Daytona snapshot of thedocker:28.3.3-dindbase so subsequent samples skip the VM bring-up cost (see DinD snapshots above). Building the DinD image on-demand per sample (the fallback path when a snapshot isn’t available) is prohibitively slow for eval workloads — not recommended.
Limitations
- Architecture: Daytona runners use
linux/amd64. arm64-only images are not supported. stderr: The Daytona API returns combined stdout+stderr;stderris always empty.