Inspect Flow
Introduction
Inspect Flow is a workflow orchestration tool for Inspect AI that enables you to run evaluations at scale with repeatability and maintainability.
Why Inspect Flow? As evaluation workflows grow in complexity—running multiple tasks across different models with varying parameters—managing these experiments becomes challenging. Inspect Flow addresses this by providing:
- Declarative Configuration: Define complex evaluations with tasks, models, and parameters in type-safe schemas
- Repeatable & Shareable: Encapsulated definitions of tasks, models, configurations, and Python dependencies ensure experiments can be reliably repeated and shared
- Incremental Execution: Add new models, tasks, or configurations to existing results without re-running completed work
- Parameter Sweeping: Matrix patterns for systematic exploration across tasks, models, and hyperparameters
Inspect Flow is designed for researchers and engineers running systematic AI evaluations who need to scale beyond ad-hoc scripts.
Getting Started
Before using Inspect Flow, you should:
- Have familiarity with Inspect AI
- Have an existing Inspect evaluation or use one from inspect-evals
Installation
Install the inspect_flow package from PyPI as follows:
pip install inspect-flowSet up API keys
You’ll need API keys for the model providers you want to use. Set the relevant provider API key in your .env file or export it in your shell:
export OPENAI_API_KEY=your-openai-api-keyexport ANTHROPIC_API_KEY=your-anthropic-api-keyexport GOOGLE_API_KEY=your-google-api-keyexport GROK_API_KEY=your-grok-api-keyexport MISTRAL_API_KEY=your-mistral-api-keyexport HF_TOKEN=your-hf-tokenOptional: VS Code extension
Optionally install the Inspect AI VS Code Extension which includes features for viewing evaluation log files.
Basic Examples
Let’s walk through creating your first Flow configuration. We’ll use FlowJob (the entrypoint class) and FlowTask to define evaluations.
types.FlowJob— Pydantic class that encapsulates the declarative description of a Flow job.types.FlowTask— Pydantic class abstraction on top of Inspect AI Task.types.FlowModel— Pydantic class abstraction on top of Inspect AI Model.types.FlowGenerateConfig— Pydantic class abstraction on top of Inspect AI GenerateConfig.tasks_matrix— Helper function for parameter sweeping to generate a list of tasks with all parameter combinations.models_matrix— Helper function for parameter sweeping to generate a list of models with all parameter combinations.configs_matrix— Helper function for parameter sweeping to generate a list of GenerateConfig with all parameter combinations.
FlowJob is the main entrypoint for defining evaluation runs. At its core, it takes a list of tasks to run. Here’s a simple example that runs two evaluations:
config.py
from inspect_flow import FlowJob, FlowTask
FlowJob(
log_dir="logs",
dependencies=["inspect-evals"],
tasks=[
FlowTask(
name="inspect_evals/gpqa_diamond",
model="openai/gpt-4o",
),
FlowTask(
name="inspect_evals/mmlu_0_shot",
model="openai/gpt-4o",
),
],
)To run the evaluations, run the following command in your shell. This will create a virtual environment for this job run and install the dependencies. Note that model dependencies (like the openai Python package) are inferred and installed automatically.
flow run config.pyThis will run both tasks and display progress in your terminal.

Matrix Functions
Often you’ll want to evaluate multiple tasks across multiple models. Rather than manually defining every combination, use tasks_matrix to generate all task-model pairs:
matrix.py
from inspect_flow import FlowJob, tasks_matrix
FlowJob(
log_dir="logs",
dependencies=["inspect-evals"],
tasks=tasks_matrix(
task=[
"inspect_evals/gpqa_diamond",
"inspect_evals/mmlu_0_shot",
],
model=[
"openai/gpt-5",
"openai/gpt-5-mini",
],
),
)To preview the expanded config before running it, you can run the following command in your shell to ensure the generated config is the one that you intend to run.
flow config matrix.pyThis command outputs the expanded configuration showing all 4 task-model combinations (2 tasks × 2 models).
matrix.yml
log_dir: logs
dependencies:
- inspect-evals
tasks:
- name: inspect_evals/gpqa_diamond
model:
name: openai/gpt-5
- name: inspect_evals/gpqa_diamond
model:
name: openai/gpt-5-mini
- name: inspect_evals/mmlu_0_shot
model:
name: openai/gpt-5
- name: inspect_evals/mmlu_0_shot
model:
name: openai/gpt-5-minitasks_matrix and models_matrix are powerful functions that can operate on multiple levels of nested matrixes which enable sophisticated parameter sweeping. Let’s say you want to explore different reasoning efforts across models—you can achieve this with the models_matrix function.
models_matrix.py
from inspect_flow import FlowGenerateConfig, FlowJob, models_matrix, tasks_matrix
FlowJob(
log_dir="logs",
dependencies=["inspect-evals"],
tasks=tasks_matrix(
task=[
"inspect_evals/gpqa_diamond",
"inspect_evals/mmlu_0_shot",
],
model=models_matrix(
model=[
"openai/gpt-5",
"openai/gpt-5-mini",
],
config=[
FlowGenerateConfig(reasoning_effort="minimal"),
FlowGenerateConfig(reasoning_effort="low"),
FlowGenerateConfig(reasoning_effort="medium"),
FlowGenerateConfig(reasoning_effort="high"),
],
),
),
)For even more concise parameter sweeping, use configs_matrix to generate configuration variants. This produces the same 16 evaluations (2 tasks × 2 models × 4 reasoning levels) as above, but with less boilerplate:
configs_matrix.py
from inspect_flow import FlowJob, configs_matrix, models_matrix, tasks_matrix
FlowJob(
log_dir="logs",
dependencies=["inspect-evals"],
tasks=tasks_matrix(
task=[
"inspect_evals/gpqa_diamond",
"inspect_evals/mmlu_0_shot",
],
model=models_matrix(
model=[
"openai/gpt-5",
"openai/gpt-5-mini",
],
config=configs_matrix(
reasoning_effort=["minimal", "low", "medium", "high"],
),
),
),
)To run the config:
flow run matrix.pyThis will run all 16 evaluations (2 tasks × 2 models × 4 reasoning levels). When complete, you’ll find a link to the logs at the bottom of the task results summary.

To view logs interactively, run:
inspect view
Learning More
See the following articles to learn more about using Flow: