Using Flow

Core Concepts

Flow Type System

Inspect Flow mirrors Inspect AI’s object model with corresponding Flow types:

FlowJob — The top-level job definition. Under the hood it translates to an Inspect AI Eval Set. Contains a list of tasks and job settings.
FlowTask — Configuration for a single evaluation task. Maps to Inspect AI Task parameters.
FlowModel — Model configuration including API settings and generation settings. Maps to Inspect AI Model.
FlowGenerateConfig — Model generation parameters. Maps to Inspect AI GenerateConfig.
FlowSolver and FlowAgent — Solver and agent chain configuration. Map to Inspect AI Solver and Agent.
FlowOptions — Runtime execution options. Maps to Inspect AI eval_set() parameters.
FlowDefaults — System for setting default values across tasks, models, solvers, and agents.

Flow Config Structure

FlowJob is the primary interface for defining Flow workflows. All Flow operations—including parameter sweeps and matrix expansions—ultimately produce a list of tasks that FlowJob executes.

Required fields:

log_dir — Output path for logging results. Must be set before running the flow job. Supports S3 paths (e.g., s3://bucket/path)

Optional fields:

Field	Description	Default
`tasks`	List of FlowTask objects defining the evaluations to run	`None`
`includes`	List of other flow configs to include (paths as strings or `FlowInclude` objects)	`None`
`log_dir_create_unique`	If `True`, append numeric suffix to `log_dir` if it exists. If `False`, reuse existing directory (must be empty or have `log_dir_allow_dirty=True`)	`False`
`python_version`	Python version for the isolated virtual environment (e.g., `"3.11"`)	Same as current environment
`dependencies`	PyPI packages, Git URLs, or local paths to install	`None`
`env`	Environment variables to set when running tasks	`None`
`options`	Runtime options (see FlowOptions reference)	`None`
`defaults`	Default values applied across tasks, models, and solvers	`None`
`flow_metadata`	Metadata stored in the flow config (not passed to Inspect AI)	`None`

Creating Flow Configs

Your First Config

In Basic Examples, we showed a simple FlowJob. Let’s break down what’s happening:

first_config.py

from inspect_flow import FlowJob, FlowTask

FlowJob(
    log_dir="logs",
    dependencies=["inspect-evals"],
    tasks=[
        FlowTask(
            name="inspect_evals/gpqa_diamond",
            model="openai/gpt-5",
        ),
    ],
)

1: Specify the directory for storing logs
2: Install inspect-evals Python package
3: List evaluation tasks to run
4: Specify task from registry by name
5: Specify model to evaluate by name

To run the task, run the following command in your shell.

flow run config.py

What happens when you run this?

Flow creates an isolated virtual environment
Installs inspect-evals and openai (inferred from model)
Loads the gpqa_diamond task from the inspect_evals registry
Runs the evaluation with GPT-5
Stores results in logs/

Config File Evaluation

Python config files are evaluated as normal Python code. The last expression in the file is used as the FlowJob. This means you can:

Define variables and reuse them
Use loops or comprehensions to generate task lists
Import helper functions
Add comments and documentation

Flow configs are just Python!

Task Specification

In the example above, we used a registry name ("inspect_evals/gpqa_diamond"). Flow supports multiple ways to reference tasks:

# Tasks from installed packages.
FlowTask(
    name="inspect_evals/gpqa_diamond",
    model="openai/gpt-5"
)

# Auto-discovers `@task` decorated functions in the specified file and creates a task for each of them.
FlowTask(
    name="./my_task.py",
    model="openai/gpt-5"
)

# Explicitly selects a specific function from the file.
FlowTask(
    name="./my_task.py@custom_eval",
    model="openai/gpt-5"
)

Task Configuration

FlowTask accepts parameters that map to Inspect AI Task fields. The examples below show commonly used fields; see the FlowTask reference documentationfor the complete list of available parameters.

FlowTask(
    name="inspect_evals/mmlu_0_shot",
    model="openai/gpt-5",
    epochs=3,
    config=FlowGenerateConfig(
        temperature=0.7,
        max_tokens=1000,
    ),
    solver="chain_of_thought",
    args={"subject": "physics"},
    sandbox="docker",
    sample_id=[0, 1, 2],
)

1: Task name — Maps to Inspect AI Task.name. Can be a registry name ("inspect_evals/mmlu"), file path ("./task.py"), or file with function ("./task.py@eval_fn").
2: Model — Maps to Inspect AI Task.model. Optional model for this task. If not specified, uses the model from INSPECT_EVAL_MODEL environment variable. Can be a string ("openai/gpt-5") or a FlowModel object for advanced configuration.
3: Epochs — Maps to Inspect AI Task.epochs. Number of times to repeat evaluation over the dataset samples. Can be an integer (epochs=3) or a FlowEpochs object to specify custom reducer functions (FlowEpochs(epochs=3, reducer="median")). By default, scores are combined using the "mean" reducer across epochs.
4: Generation config — Maps to Inspect AI Task.config (GenerateConfig). Model generation parameters like temperature, max_tokens, top_p, reasoning_effort, etc. These settings override config on FlowJob.config but are overridden by settings on FlowModel.config.
5: Solver chain — Maps to Inspect AI Task.solver. The algorithm(s) for solving the task. Can be a string ("chain_of_thought"), FlowSolver object, FlowAgent object, or a list of solvers for chaining. Defaults to generate() if not specified.
6: Task arguments — Maps to task function parameters. Dictionary of arguments passed to the task constructor or @task decorated function. Enables parameterization of tasks (e.g., selecting dataset subsets, configuring difficulty levels).
7: Sandbox environment — Maps to Inspect AI Task.sandbox. Can be a string ("docker", "local"), a tuple with additional config, or a SandboxEnvironmentType object.
8: Sample selection — Evaluate specific samples from the dataset. Accepts a single ID (sample_id=0), list of IDs (sample_id=[0, 1, 2]), or list of string IDs.

Model Specification

Models can be specified as simple strings or as FlowModel objects for more control:

FlowTask(name="task", model="openai/gpt-5")

FlowTask(
    name="task",
    model=FlowModel(
        name="openai/gpt-5",
        config=FlowGenerateConfig(
            reasoning_effort="medium",
            max_connections=10,
        ),
        base_url="https://custom-endpoint.com",
        api_key="${CUSTOM_API_KEY}",
    )
)

# For agent evaluations with multiple roles
FlowTask(
    name="multi_agent_task",
    model_roles={
        "assistant": "openai/gpt-5",
        "critic": "anthropic/claude-3-5-sonnet",
    }
)

When to use FlowModel

Use FlowModel when you need to:

Set model-specific generation configs
Use custom API endpoints
Configure API keys per model
Organize complex multi-model setups

Dependency Management

Inspect Flow automatically creates isolated virtual environments for each workflow run, ensuring repeatability and avoiding dependency conflicts with your system Python environment.

How Virtual Environments Work

When you run flow run config.py, Flow:

Creates a temporary virtual environment with uv
Installs your specified dependencies plus auto-detected model provider packages
Executes your evaluations in this isolated environment
Cleans up the temporary environment after completion (logs persist in log_dir)

Specifying Dependencies

The dependencies field in FlowJob accepts multiple types of package specifiers:

# Standard PyPI package names with optional version specifiers.
FlowJob(
    dependencies=[
        "inspect-evals",
        "pandas==2.0.0",
    ],
    tasks=[...]
)

# Install directly from Git repositories. Use `@commit_hash` to pin to specific versions for repeatability.
FlowJob(
    dependencies=[
        "git+https://github.com/UKGovernmentBEIS/inspect_evals@ef181cd",
    ],
    tasks=[...]
)

# Install local packages using relative or absolute paths.
FlowJob(
    dependencies=[
        "./my_custom_eval",
        "../shared/utils",
    ],
    tasks=[...]
)

Python Version Control

Specify the Python version for your job’s virtual environment:

FlowJob(
    python_version="3.11",
    dependencies=["inspect-evals"],
    tasks=[...]
)

Checking Python Version

To verify which Python version will be used, run:

flow config config.py --resolve

This shows the resolved configuration including the Python version that will be used.

Repeatability Best Practices

For repeatable workflows:

Pin PyPI package versions: "inspect-evals==0.3.15"
Pin Git commits: "git+https://github.com/user/repo@commit_hash"
Specify python_version explicitly

Defaults and Overrides

Inspect Flow provides a powerful defaults system to avoid repetition when configuring evaluations. The FlowDefaults field lets you set default values that cascade across tasks, models, solvers, and agents—with more specific settings overriding less specific ones.

The FlowDefaults System

FlowDefaults supports multiple levels of default configuration:

from inspect_flow import (
    FlowJob,
    FlowDefaults,
    FlowGenerateConfig,
    FlowModel,
    FlowSolver,
    FlowAgent,
    FlowTask,
)

FlowJob(
    defaults=FlowDefaults(
        config=FlowGenerateConfig(
            max_connections=10,
        ),
        model=FlowModel(
            model_args={"arg": "foo"},
        ),
        model_prefix={
            "openai/": FlowModel(
                config=FlowGenerateConfig(
                    max_connections=20
                ),
            ),
        },
        solver=FlowSolver(...),
        solver_prefix={"chain_of_thought": ...},
        agent=FlowAgent(...),
        agent_prefix={"inspect/": ...},
        task=FlowTask(...),
        task_prefix={"inspect_evals/": ...},
    ),
    tasks=[...]
)

1: Default model generation options. Will be overridden by settings on FlowTask and FlowModel.
2: Field defaults for models.
3: Model defaults for model name prefixes. Overrides FlowDefaults.config and FlowDefaults.model. If multiple prefixes match, longest prefix wins.
4: Field defaults for solvers.
5: Solver defaults for solver name prefixes. Overrides FlowDefaults.solver. If multiple prefixes match, longest prefix wins.
6: Field defaults for agents.
7: Agent defaults for agent name prefixes. Overrides FlowDefaults.agent. If multiple prefixes match, longest prefix wins.
8: Field defaults for tasks.
9: Task defaults for task name prefixes. Overrides FlowDefaults.config and FlowDefaults.task. If multiple prefixes match, longest prefix wins.

Merge Priority and Behavior

Defaults follow a hierarchy where more specific settings override less specific ones:

For Models:

Global config defaults (defaults.config)
Global model defaults (defaults.model)
Model prefix defaults (defaults.model_prefix)
Task-specific config (task.config)
Model-specific config (model.config) — highest priority

Example hierarchy in action:

hierarchy.py

from inspect_flow import FlowDefaults, FlowGenerateConfig, FlowJob, FlowModel, FlowTask

FlowJob(
    defaults=FlowDefaults(
        config=FlowGenerateConfig(
            temperature=0.0,
            max_tokens=100,
        ),
        model_prefix={
            "openai/": FlowModel(
                config=FlowGenerateConfig(temperature=0.5)
            )
        },
    ),
    tasks=[
        FlowTask(
            name="task",
            config=FlowGenerateConfig(temperature=0.7),
            model=FlowModel(
                name="openai/gpt-4o",
                config=FlowGenerateConfig(temperature=1.0),
            ),
        )
    ],
)

1: Global defaults: temperature=0.0, max_tokens=100
2: Prefix defaults override: temperature=0.5 (for OpenAI models)
3: Task config overrides: temperature=0.7
4: Model config wins: temperature=1.0, max_tokens=100

Final result: temperature=1.0 (most specific), max_tokens=100 (from global defaults)

None Values Don’t Override

Setting a field to None means “not specified” — it won’t override existing values from defaults. This allows partial configs to merge cleanly:

FlowTask(
    config=FlowGenerateConfig(temperature=0.5, max_tokens=None)
)

The max_tokens=None doesn’t override a default max_tokens value; it’s simply not set at this level.

CLI Overrides

Override config values at runtime using the --set flag:

flow run config.py --set log_dir=./logs

flow run config.py --set options.limit=10
flow run config.py --set defaults.solver.args.tool_calls=none

flow run config.py --set 'options.metadata={"experiment": "baseline", "version": "v1"}'

flow run config.py \
  --set log_dir=./logs/experiment1 \
  --set options.limit=100 \
  --set defaults.config.temperature=0.5

Override Behavior

Strings: Replace existing values
Dicts: Replace existing values
Lists:
- String values append to existing list
- Lists replace existing list

Examples:

# Appends to list
--set dependencies=new_package

# Replaces list
--set 'dependencies=["pkg1", "pkg2"]'

Environment Variables

Set config values via environment variables:

export INSPECT_FLOW_LOG_DIR=./logs/custom
export INSPECT_FLOW_LIMIT=50
export INSPECT_FLOW_SET="options.metadata={\"key\": \"value\"}"
export INSPECT_FLOW_VAR="task_min_priority=2"
flow run config.py

Supported environment variables:

Variable	Equivalent Flag	Description
`INSPECT_FLOW_LOG_DIR`	`--log-dir`	Override log directory
`INSPECT_FLOW_LOG_DIR_CREATE_UNIQUE`	`--log-dir-create-unique`	Create new log directory with numeric suffix if exists
`INSPECT_FLOW_LIMIT`	`--limit`	Limit number of samples
`INSPECT_FLOW_SET`	`--set`	Set config overrides (can be specified multiple times)
`INSPECT_FLOW_VAR`	`--var`	Set variables for `__flow_vars__` (can be multiple)

Override Priority

Setting defaults via the command line will override the defaults which in turn might be overridden by anything set explicitly.

Dedicated environment variables (see Supported environment variables above) and corresponding CLI flags will override the --set flag.

Dedicated CLI flags have the highest priority.

Priority order of log-dir, log-dir-create-unique and limit:

FlowJob defaults
Explicit setting on the FlowJob
INSPECT_FLOW_SET_ environment variables
CLI --set flags
INSPECT_FLOW_LOG_DIR, INSPECT_FLOW_LOG_DIR_CREATE_UNIQUE and INSPECT_FLOW_LIMIT environment variables
Explicit --log-dir, --log-dir-create-unique and --limit CLI flags

All other settings follow the priority order:

FlowJob defaults
Explicit setting on the FlowJob
INSPECT_FLOW_SET_ environment variables
CLI --set flags

Debugging Defaults Resolution

To see the fully resolved configuration with all defaults applied:

flow config config.py --resolve

This shows exactly what settings each task will use after applying all defaults and overrides.

Configuration Inheritance

Inspect Flow supports configuration inheritance through two mechanisms:

Explicit Includes

Use the includes field to explicitly merge other config files into your job:

FlowJob(
    includes=["defaults_flow.py"],
    log_dir="./logs/flow_test",
    tasks=["my_task"]
)

Included configs are merged recursively, with the current config’s values taking precedence over included values. For dictionaries, fields are merged deeply. For lists, items are concatenated with duplicates removed.

Automatic Discovery

Inspect Flow automatically discovers and includes files named _flow.py in parent directories. Starting from your config file’s location, it searches upward through the directory tree for _flow.py files and automatically merges them as base configurations.

This allows you to define shared defaults (model settings, dependencies, etc.) at a repository root that apply to all configs in subdirectories without explicit includes.

Parameter Sweeping

Parameter sweeping lets you systematically explore evaluation configurations by generating Cartesian products of parameters. Instead of manually writing every combination, Flow provides matrix and “with” functions to declaratively generate evaluation grids.

Matrix Functions (Cartesian Products)

Matrix functions generate all combinations of their parameters using Cartesian products.

tasks_matrix()

Generate task configurations by combining tasks with models, configs, solvers, and arguments:

tasks_matrix.py

from inspect_flow import FlowJob, tasks_matrix

FlowJob(
    log_dir="logs",
    dependencies=["inspect-evals"],
    tasks=tasks_matrix(
        task=["inspect_evals/gpqa_diamond", "inspect_evals/mmlu_0_shot"],
        model=["openai/gpt-4o", "anthropic/claude-3-5-sonnet"],
    ),
)

This creates 4 tasks (2 tasks × 2 models).

models_matrix()

Generate model configurations with different generation settings:

models_matrix.py

from inspect_flow import FlowGenerateConfig, FlowJob, models_matrix, tasks_matrix

FlowJob(
    log_dir="logs",
    dependencies=["inspect-evals"],
    tasks=tasks_matrix(
        task=[
            "inspect_evals/gpqa_diamond",
            "inspect_evals/mmlu_0_shot",
        ],
        model=models_matrix(
            model=[
                "openai/gpt-5",
                "openai/gpt-5-mini",
            ],
            config=[
                FlowGenerateConfig(reasoning_effort="minimal"),
                FlowGenerateConfig(reasoning_effort="low"),
                FlowGenerateConfig(reasoning_effort="medium"),
                FlowGenerateConfig(reasoning_effort="high"),
            ],
        ),
    ),
)

This creates 16 tasks (2 task × 2 models × 4 resoning_effort).

configs_matrix()

Generate generation config combinations by specifying individual parameters:

configs_matrix.py

from inspect_flow import FlowJob, configs_matrix, models_matrix, tasks_matrix

FlowJob(
    log_dir="logs",
    dependencies=["inspect-evals"],
    tasks=tasks_matrix(
        task=[
            "inspect_evals/gpqa_diamond",
            "inspect_evals/mmlu_0_shot",
        ],
        model=models_matrix(
            model=[
                "openai/gpt-5",
                "openai/gpt-5-mini",
            ],
            config=configs_matrix(
                reasoning_effort=["minimal", "low", "medium", "high"],
            ),
        ),
    ),
)

This creates 16 tasks (2 task × 2 models × 4 resoning_effort).

solvers_matrix() and agents_matrix()

Generate solver or agent configurations with different arguments:

solvers_matrix.py

from inspect_flow import FlowJob, solvers_matrix, tasks_matrix

FlowJob(
    log_dir="logs",
    tasks=tasks_matrix(
        task="my_task",
        solver=solvers_matrix(
            solver="chain_of_thought",
            args=[
                {"max_iterations": 3},
                {"max_iterations": 5},
                {"max_iterations": 10},
            ],
        ),
    ),
)

This creates 3 tasks (1 task × 3 solver configurations).

With Functions (Apply to All)

“With” functions apply the same value to all items in a list. Use these when you want to sweep over some parameters while keeping others constant.

tasks_with()

Apply common settings to multiple tasks:

tasks_with.py

from inspect_flow import FlowGenerateConfig, FlowJob, tasks_with

FlowJob(
    tasks=tasks_with(
        task=["inspect_evals/gpqa_diamond", "inspect_evals/mmlu_0_shot"],
        model="openai/gpt-4o",  # Same model for both tasks
        config=FlowGenerateConfig(temperature=0.7),  # Same config for both
    )
)

This creates 2 tasks (2 tasks, each with the same model and config).

Combining Matrix and With

Mix parameter sweeps with common settings:

matrix_and_with.py

from inspect_flow import (
    FlowJob,
    configs_matrix,
    tasks_matrix,
    tasks_with,
)

FlowJob(
    log_dir="logs",
    tasks=tasks_with(
        task=tasks_matrix(
            task=["task1", "task2"], config=configs_matrix(temperature=[0.0, 0.5, 1.0])
        ),  # Creates 6 tasks (2 × 3)
        model="openai/gpt-4o",  # Applied to all 6 tasks
        sandbox="docker",  # Applied to all 6 tasks
    ),
)

Nested Sweeps

Matrix functions can be nested to create complex parameter grids. Use the unpacking operator * to expand inner matrix results:

Example: Tasks with nested model sweep

nested_model_sweep.py

from inspect_flow import FlowGenerateConfig, FlowJob, models_matrix, tasks_matrix

FlowJob(
    log_dir="logs",
    tasks=tasks_matrix(
        task=["inspect_evals/mmlu_0_shot", "inspect_evals/gpqa_diamond"],
        model=[
            "anthropic/claude-3-5-sonnet",  # Single model
            *models_matrix(  # Unpacks list of 4 model configs
                model=["openai/gpt-4o", "openai/gpt-4o-mini"],
                config=[
                    FlowGenerateConfig(reasoning_effort="low"),
                    FlowGenerateConfig(reasoning_effort="high"),
                ],
            ),
        ],  # Total: 1 + 4 = 5 models
    ),
)

This creates 10 tasks (2 tasks × 5 model configurations).

Example: Tasks with nested task sweep

nested_task_sweep.py

from inspect_flow import FlowJob, FlowTask, tasks_matrix

FlowJob(
    log_dir="logs",
    tasks=tasks_matrix(
        task=[
            FlowTask(name="task1", args={"subset": "test"}),  # Single task
            *tasks_matrix(  # Unpacks list of 3 tasks
                task="task2",
                args=[
                    {"language": "en"},
                    {"language": "de"},
                    {"language": "fr"},
                ],
            ),
        ],  # Total: 1 + 3 = 4 tasks
        model=["model1", "model2"],
    ),
)

This creates 8 tasks (4 task variants × 2 models).

Watch Out for Combinatorial Explosion

Parameter sweeps grow multiplicatively. A sweep with: - 3 tasks - 4 models - 5 temperature values - 3 solver configurations

Results in 3 × 4 × 5 × 3 = 180 evaluations.

Always use --dry-run to check the number of evaluations before running expensive grids.

Config Merging Behavior

When base objects already have values, matrix parameters are merged:

# Config fields merge recursively
tasks_matrix(
    task=FlowTask(
        name="task",
        config=FlowGenerateConfig(temperature=0.5)  # Base value
    ),
    config=[
        FlowGenerateConfig(max_tokens=1000),  # Adds max_tokens, keeps temperature=0.5
        FlowGenerateConfig(max_tokens=2000),  # Adds max_tokens, keeps temperature=0.5
    ]
)

Running Evaluations

Once you’ve defined your Flow configuration, you can execute evaluations using the flow run command. Flow also provides tools for previewing configurations and controlling runtime behavior.

The `flow run` Command

Execute your evaluation workflow:

flow run config.py

What happens when you run this:

Flow loads your configuration file
Creates an isolated virtual environment
Installs dependencies
Resolves all defaults and matrix expansions
Executes evaluations via Inspect AI’s eval_set()
Stores logs in log_dir
Cleans up the temporary environment

Common CLI Flags

Preview without running:

flow run config.py --dry-run

Shows how many tasks would be executed without actually running them. Useful for validating large parameter sweeps.

eval_set would be called with 24 tasks

Override log directory:

flow run config.py --log-dir ./experiments/baseline

Changes where logs and results are stored.

Runtime overrides:

flow run config.py \
  --set options.limit=100 \
  --set defaults.config.temperature=0.5

Override any configuration value at runtime. See CLI Overrides for more details.

The `flow config` Command

Preview your configuration before running:

Basic usage:

flow config config.py

Displays the parsed configuration as YAML with CLI overrides applied. Does not create a virtual environment or instantiate Python objects.

Full resolution:

flow config config.py --resolve

Shows the completely resolved configuration:

Creates virtual environment
Applies all defaults
Expands all matrix functions
Instantiates all Python objects

This is invaluable for debugging what settings will actually be used in your evaluations.

When to Use Each Command

flow config - Quick syntax check, verify overrides
flow config --resolve - Debug defaults resolution, inspect final settings
flow run --dry-run - Count tasks in parameter sweeps, validate before expensive runs
flow run - Execute evaluations

Results and Logs

Flow Directory Structure

Evaluation results are stored in the log_dir:

logs/
├── 2025-11-21T17-38-20+01-00_gpqa-diamond_KvJBGowidXSCLRhkKQbHYA.eval
├── 2025-11-21T17-38-20+01-00_mmlu-0-shot_Vnu2A3M2wPet5yobLiCQmZ.eval
├── .eval-set-id
├── eval-set.json
├── flow.yaml
└── ...

Directory structure:

Flow passes the log_dir directly to Inspect AI eval_set() for evaluation log storage
Inspect AI handles the actual evaluation log file naming and storage
Log file naming conventions follow Inspect AI’s standards (see Inspect AI logging docs)
Flow automatically saves the resolved configuration as flow.yaml in the log directory
The .eval-set-id file contains the eval set identifier
The eval-set.json file contains eval set metadata

Log formats:

.eval - Binary Inspect AI log format (default, high-performance)
.json - JSON format (if log_format="json" in FlowOptions)

Viewing Results

Using Inspect View:

inspect view

Opens the Inspect AI viewer to explore evaluation logs interactively.

S3 Support

Store logs directly to S3:

FlowJob(
    log_dir="s3://my-bucket/experiments/baseline",
    tasks=[...]
)

Advanced Features

Metadata Management

Flow supports two types of metadata with distinct purposes: metadata and flow_metadata.

metadata (Inspect AI Metadata)

The metadata field in FlowOptions and FlowTask is passed directly to Inspect AI and stored in evaluation logs. Use this for tracking experiment information that should be accessible in Inspect AI’s log viewer and analysis tools.

Example:

metadata.py

from inspect_flow import FlowJob, FlowOptions, FlowTask

FlowJob(
    log_dir="logs",
    options=FlowOptions(
        metadata={
            "experiment": "baseline_v1",
            "hypothesis": "Higher temperature improves creative tasks",
            "hardware": "A100-80GB",
        }
    ),
    tasks=[
        FlowTask(
            name="inspect_evals/gpqa_diamond",
            model="openai/gpt-4o",
            metadata={
                "task_variant": "chemistry_subset",
                "note": "Testing with reduced context",
            },
        )
    ],
)

The metadata from FlowOptions is applied globally to all tasks in the evaluation run, while task-level metadata is specific to each task.

flow_metadata (Flow-Only Metadata)

The flow_metadata field is available on FlowJob, FlowTask, FlowModel, FlowSolver, and FlowAgent. This metadata is not passed to Inspect AI—it exists only in the Flow configuration and is useful for configuration-time logic and organization.

Use cases:

Filtering or selecting configurations based on properties
Organizing complex configuration generation logic
Documenting configuration decisions
Annotating configs without polluting Inspect AI logs

Example: Configuration-time filtering

flow_metadata.py

from inspect_flow import FlowJob, FlowModel, tasks_matrix

# Define models with metadata about capabilities
models = [
    FlowModel(name="openai/gpt-4o", flow_metadata={"context_window": 128000}),
    FlowModel(name="openai/gpt-4o-mini", flow_metadata={"context_window": 128000}),
    FlowModel(
        name="anthropic/claude-3-5-sonnet", flow_metadata={"context_window": 200000}
    ),
]

# Filter to only long-context models
long_context_models = [
    m
    for m in models
    if m.flow_metadata and m.flow_metadata.get("context_window", 0) >= 128000
]

FlowJob(
    log_dir="logs",
    tasks=tasks_matrix(
        task="long_context_task",
        model=long_context_models,
    ),
)

Variable Substitution

Flow Variables (`--var`)

Pass custom variables to Python config files using --var or the INSPECT_FLOW_VAR environment variable. Use this for dynamic configuration that isn’t available via --set. Variables are accessible via the __flow_vars__ dictionary:

flow run config.py --var task_min_priority=2

flow_vars.py

from inspect_flow import FlowJob, FlowTask

task_min_priority = globals().get("__flow_vars__", {}).get("task_min_priority", 1)

all_tasks = [
    FlowTask(name="task_easy", flow_metadata={"priority": 1}),
    FlowTask(name="task_medium", flow_metadata={"priority": 2}),
    FlowTask(name="task_hard", flow_metadata={"priority": 3}),
]

FlowJob(
    log_dir="logs",
    tasks=[
        t
        for t in all_tasks
        if (t.flow_metadata or {}).get("priority", 0) >= task_min_priority
    ],
)

Template Substitution

Use {field_name} syntax to reference other FlowJob configuration values. Substitutions are applied after the config is loaded:

FlowJob(
    log_dir="logs/my_eval",
    options=FlowOptions(bundle_dir="{log_dir}/bundle"),
    # Result: bundle_dir="logs/my_eval/bundle"
)

For nested fields, use bracket notation: {options[eval_set_id]} or {flow_metadata[key]}. Substitutions are resolved recursively until no more remain.

Bundle URL Mapping

Convert local bundle paths to public URLs for sharing evaluation results. The bundle_url_map in FlowOptions applies string replacements to bundle_dir to generate a shareable URL that’s printed to stdout after the evaluation completes.

bundle_url_map.py

from inspect_flow import FlowJob, FlowOptions, FlowTask

FlowJob(
    log_dir="logs/my_eval",
    options=FlowOptions(
        bundle_dir="/local/storage/bundles/my_eval",
        bundle_url_map={"/local/storage": "https://example.com/shared"},
    ),
    tasks=[FlowTask(name="task", model="openai/gpt-4o")],
)

After running this prints: Bundle URL: https://example.com/shared/bundles/my_eval

Use this when storing bundles on servers with public HTTP access or cloud storage with URL mapping. Multiple mappings are applied in order.

Configuration Scripts

Included configuration files can execute validation code to enforce constraints. Files named _flow.py are automatically included from parent directories, and receive __flow_including_jobs__ containing all configs in the include chain.

Terminology:

Including config: The file being loaded (e.g., my_config.py)
Included config: The _flow.py file(s) automatically inherited from parent directories

The included config can inspect the including configs to validate or enforce constraints.

Prevent Runs with Uncommitted Changes

Place a _flow.py file at your repository root to validate that all configs are in clean git repositories. This validation runs automatically for all configs in subdirectories.

_flow.py

import subprocess
from pathlib import Path

from inspect_flow import FlowJob

# Get all configs that are including this one
including_jobs: dict[str, FlowJob] = globals().get("__flow_including_jobs__", {})


def check_repo(path: str) -> None:
    abs_path = Path(path).resolve()
    check_dir = abs_path if abs_path.is_dir() else abs_path.parent

    result = subprocess.run(
        ["git", "status", "--porcelain"],
        cwd=check_dir,
        capture_output=True,
        text=True,
        check=True,
    )

    if result.stdout.strip():
        raise RuntimeError(f"The repository at {check_dir} has uncommitted changes.")


# Check this config and all configs including it
check_repo(__file__)
for path in including_jobs.keys():
    check_repo(path)

FlowJob()  # Return empty job for inheritance

config.py

# Automatically inherits _flow.py
from inspect_flow import FlowJob, FlowTask

FlowJob(
    log_dir="logs",
    dependencies=["inspect-evals"],
    tasks=[FlowTask(name="inspect_evals/gpqa_diamond", model="openai/gpt-4o")],
)
# Will fail if uncommitted changes exist in the repository

Lock Configuration Fields

A _flow.py file can prevent configs from overriding critical settings:

_flow.py

from inspect_flow import FlowJob, FlowOptions

MAX_SAMPLES = 16

# Get all configs that are including this one
including_jobs: dict[str, FlowJob] = globals().get("__flow_including_jobs__", {})

# Validate that including configs don't override MAX_SAMPLES
for file, job in including_jobs.items():
    if (
        job.options
        and job.options.max_samples is not None
        and job.options.max_samples != MAX_SAMPLES
    ):
        raise ValueError(
            f"Do not override max_samples! Error in {file} (or its includes)"
        )

FlowJob(
    options=FlowOptions(max_samples=MAX_SAMPLES),
)

config.py

# Automatically inherits _flow.py
from inspect_flow import FlowJob, FlowOptions, FlowTask

FlowJob(
    log_dir="logs",
    options=FlowOptions(max_samples=32),  # Will raise ValueError!
    tasks=[FlowTask(name="inspect_evals/gpqa_diamond", model="openai/gpt-4o")],
)

This pattern is useful for enforcing organizational standards (resource limits, safety constraints, etc.) across all evaluation configs in a repository.