inspect_flow

Types

FlowAgent

Configuration for an Agent.

class FlowAgent(BaseModel, extra="forbid")

Attributes

name str | None

Name of the agent. Required to be set by the time the agent is created.

args CreateArgs | None

Additional args to pass to agent constructor.

flow_metadata dict[str, Any] | None

Optional. Metadata stored in the flow config. Not passed to the agent.

type Literal['agent']

Type needed to differentiated solvers and agents in solver lists.

FlowJob

Configuration for a flow job.

class FlowJob(BaseModel, extra="forbid")

Attributes

includes Sequence[str | FlowInclude] | None

List of other flow configs to include.

log_dir str | None

Output path for logging results (required to ensure that a unique storage scope is assigned). Must be set before running the flow job. If a relative path, it will be resolved relative to the most recent config file loaded with ‘load_job’ or the current working directory if ‘load_job’ was not used.

log_dir_create_unique bool | None

If True, create a new log directory by appending an _ and numeric suffix if the specified log_dir already exists. If the directory exists and has a _numeric suffix, that suffix will be incremented. If False, use the existing log_dir (which must be empty or have log_dir_allow_dirty=True). Defaults to False.

python_version str | None

Python version to use in the flow virtual environment (e.g. ‘3.11’)

options FlowOptions | None

Arguments for calls to eval_set.

dependencies list[str] | None

Dependencies to pip install. E.g. PyPI package specifiers or Git repository URLs.

env dict[str, str] | None

Environment variables to set when running tasks.

defaults FlowDefaults | None

Defaults values for Inspect objects.

flow_metadata dict[str, Any] | None

Optional. Metadata stored in the flow config. Not passed to the model.

tasks Sequence[str | FlowTask] | None

Tasks to run

FlowDefaults

Default field values for Inspect objects. Will be overriden by more specific settings.

class FlowDefaults(BaseModel, extra="forbid")

Attributes

config FlowGenerateConfig | None

Default model generation options. Will be overriden by settings on the FlowModel and FlowTask.

agent FlowAgent | None

Field defaults for agents.

agent_prefix dict[str, FlowAgent] | None

Agent defaults for agent name prefixes. E.g. {‘inspect/’: FAgent(…)}

model FlowModel | None

Field defaults for models.

model_prefix dict[str, FlowModel] | None

Model defaults for model name prefixes. E.g. {‘openai/’: FModel(…)}

solver FlowSolver | None

Field defaults for solvers.

solver_prefix dict[str, FlowSolver] | None

Solver defaults for solver name prefixes. E.g. {‘inspect/’: FSolver(…)}

task FlowTask | None

Field defaults for tasks.

task_prefix dict[str, FlowTask] | None

Task defaults for task name prefixes. E.g. {‘inspect_evals/’: FTask(…)}

FlowEpochs

Configuration for task epochs.

Number of epochs to repeat samples over and optionally one or more reducers used to combine scores from samples across epochs. If not specified the “mean” score reducer is used.

class FlowEpochs(BaseModel)

Attributes

epochs int

Number of epochs.

reducer str | list[str] | None

One or more reducers used to combine scores from samples across epochs (defaults to “mean”)

FlowGenerateConfig

Model generation options.

class FlowGenerateConfig(GenerateConfig, extra="forbid")

FlowModel

Configuration for a Model.

class FlowModel(BaseModel, extra="forbid")

Attributes

name str | None

Name of the model to use. Required to be set by the time the model is created.

role str | None

Optional named role for model (e.g. for roles specified at the task or eval level). Provide a default as a fallback in the case where the role hasn’t been externally specified.

default str | None

Optional. Fallback model in case the specified model or role is not found. Should be a fully qualified model name (e.g. openai/gpt-4o).

config FlowGenerateConfig | None

Configuration for model. Config values will be override settings on the FlowTask and FlowJob.

base_url str | None

Optional. Alternate base URL for model.

api_key str | None

Optional. API key for model.

memoize bool | None

Use/store a cached version of the model based on the parameters to get_model(). Defaults to True.

model_args CreateArgs | None

Additional args to pass to model constructor.

flow_metadata dict[str, Any] | None

Optional. Metadata stored in the flow config. Not passed to the model.

FlowOptions

Evaluation options.

class FlowOptions(BaseModel, extra="forbid")

Attributes

retry_attempts int | None

Maximum number of retry attempts before giving up (defaults to 10).

retry_wait float | None

Time to wait between attempts, increased exponentially (defaults to 30, resulting in waits of 30, 60, 120, 240, etc.). Wait time per-retry will in no case be longer than 1 hour.

retry_connections float | None

Reduce max_connections at this rate with each retry (defaults to 1.0, which results in no reduction).

retry_cleanup bool | None

Cleanup failed log files after retries (defaults to True).

sandbox SandboxEnvironmentType | None

Sandbox environment type (or optionally a str or tuple with a shorthand spec).

sandbox_cleanup bool | None

Cleanup sandbox environments after task completes (defaults to True).

tags list[str] | None

Tags to associate with this evaluation run.

metadata dict[str, Any] | None

Metadata to associate with this evaluation run.

trace bool | None

Trace message interactions with evaluated model to terminal.

display DisplayType | None

Task display type (defaults to ‘full’).

approval str | ApprovalPolicyConfig | None

Tool use approval policies. Either a path to an approval policy config file or a list of approval policies. Defaults to no approval policy.

score bool | None

Score output (defaults to True).

log_level str | None

Level for logging to the console: “debug”, “http”, “sandbox”, “info”, “warning”, “error”, “critical”, or “notset” (defaults to “warning”).

log_level_transcript str | None

Level for logging to the log file (defaults to “info”).

log_format Literal['eval', 'json'] | None

Format for writing log files (defaults to “eval”, the native high-performance format).

limit int | None

Limit evaluated samples (defaults to all samples).

sample_shuffle bool | int | None

Shuffle order of samples (pass a seed to make the order deterministic).

fail_on_error bool | float | None

True to fail on first sample error(default); False to never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails.

continue_on_fail bool | None

True to continue running and only fail at the end if the fail_on_error condition is met. False to fail eval immediately when the fail_on_error condition is met (default).

retry_on_error int | None

Number of times to retry samples if they encounter errors (defaults to 3).

debug_errors bool | None

Raise task errors (rather than logging them) so they can be debugged (defaults to False).

max_samples int | None

Maximum number of samples to run in parallel (default is max_connections).

max_tasks int | None

Maximum number of tasks to run in parallel (defaults is 10).

max_subprocesses int | None

Maximum number of subprocesses to run in parallel (default is os.cpu_count()).

max_sandboxes int | None

Maximum number of sandboxes (per-provider) to run in parallel.

log_samples bool | None

Log detailed samples and scores (defaults to True).

log_realtime bool | None

Log events in realtime (enables live viewing of samples in inspect view) (defaults to True).

log_images bool | None

Log base64 encoded version of images, even if specified as a filename or URL (defaults to False).

log_buffer int | None

Number of samples to buffer before writing log file. If not specified, an appropriate default for the format and filesystem is chosen (10 for most all cases, 100 for JSON logs on remote filesystems).

log_shared bool | int | None

Sync sample events to log directory so that users on other systems can see log updates in realtime (defaults to no syncing). Specify True to sync every 10 seconds, otherwise an integer to sync every n seconds.

bundle_dir str | None

If specified, the log viewer and logs generated by this eval set will be bundled into this directory.

bundle_overwrite bool | None

Whether to overwrite files in the bundle_dir. (defaults to False).

log_dir_allow_dirty bool | None

If True, allow the log directory to contain unrelated logs. If False, ensure that the log directory only contains logs for tasks in this eval set (defaults to False).

eval_set_id str | None

ID for the eval set. If not specified, a unique ID will be generated.

bundle_url_map dict[str, str] | None

Replacements applied to bundle_dir to generate a URL. If provided and bundle_dir is set, the mapped URL will be written to stdout.

FlowSolver

Configuration for a Solver.

class FlowSolver(BaseModel, extra="forbid")

Attributes

name str | None

Name of the solver. Required to be set by the time the solver is created.

args CreateArgs | None

Additional args to pass to solver constructor.

flow_metadata dict[str, Any] | None

Optional. Metadata stored in the flow config. Not passed to the solver.

FlowTask

Configuration for an evaluation task.

Tasks are the basis for defining and running evaluations.

class FlowTask(BaseModel, extra="forbid")

Attributes

name str | None

Task name. Any of registry name (“inspect_evals/mbpp”), file name (“./my_task.py”), or a file name and attr (“./_name”). Required to be set by the time the task is created.

args CreateArgs | None

Additional args to pass to task constructor

solver str | FlowSolver | list[str | FlowSolver] | FlowAgent | None

Solver or list of solvers. Defaults to generate(), a normal call to the model.

model str | FlowModel | None

Default model for task (Optional, defaults to eval model).

config FlowGenerateConfig | None

Model generation config for default model (does not apply to model roles). Will override config settings on the FlowJob. Will be overridden by settings on the FlowModel.

model_roles ModelRolesConfig | None

Named roles for use in get_model().

sandbox SandboxEnvironmentType | None

Sandbox environment type (or optionally a str or tuple with a shorthand spec)

approval str | ApprovalPolicyConfig | None

Tool use approval policies. Either a path to an approval policy config file or an approval policy config. Defaults to no approval policy.

epochs int | FlowEpochs | None

Epochs to repeat samples for and optional score reducer function(s) used to combine sample scores (defaults to “mean”)

fail_on_error bool | float | None

True to fail on first sample error (default); False to never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails.

continue_on_fail bool | None

True to continue running and only fail at the end if the fail_on_error condition is met. False to fail eval immediately when the fail_on_error condition is met (default).

message_limit int | None

Limit on total messages used for each sample.

token_limit int | None

Limit on total tokens used for each sample.

time_limit int | None

Limit on clock time (in seconds) for samples.

working_limit int | None

Limit on working time (in seconds) for sample. Working time includes model generation, tool calls, etc. but does not include time spent waiting on retries or shared resources.

version int | str | None

Version of task (to distinguish evolutions of the task spec or breaking changes to it)

metadata dict[str, Any] | None

Additional metadata to associate with the task.

sample_id str | int | list[str | int] | None

Evaluate specific sample(s) from the dataset.

flow_metadata dict[str, Any] | None

Optional. Metadata stored in the flow config. Not passed to the task.

model_name str | None

Get the model name from the model field.

Returns: The model name if set, otherwise None.

Functions

agents_matrix

Create a list of agents from the product of lists of field values.

def agents_matrix(
    *,
    agent: str | FlowAgent | Sequence[str | FlowAgent],
    **kwargs: Unpack[FlowAgentMatrixDict],
) -> list[FlowAgent]
agent str | FlowAgent | Sequence[str | FlowAgent]

The agent or list of agents to matrix.

**kwargs Unpack[FlowAgentMatrixDict]

The lists of field values to matrix.

agents_with

Set fields on a list of agents.

def agents_with(
    *,
    agent: str | FlowAgent | Sequence[str | FlowAgent],
    **kwargs: Unpack[FlowAgentDict],
) -> list[FlowAgent]
agent str | FlowAgent | Sequence[str | FlowAgent]

The agent or list of agents to set fields on.

**kwargs Unpack[FlowAgentDict]

The fields to set on each agent.

configs_matrix

Create a list of generate configs from the product of lists of field values.

def configs_matrix(
    *,
    config: FlowGenerateConfig | Sequence[FlowGenerateConfig] | None = None,
    **kwargs: Unpack[FlowGenerateConfigMatrixDict],
) -> list[FlowGenerateConfig]
config FlowGenerateConfig | Sequence[FlowGenerateConfig] | None

The config or list of configs to matrix.

**kwargs Unpack[FlowGenerateConfigMatrixDict]

The lists of field values to matrix.

configs_with

Set fields on a list of generate configs.

def configs_with(
    *,
    config: FlowGenerateConfig | Sequence[FlowGenerateConfig],
    **kwargs: Unpack[FlowGenerateConfigDict],
) -> list[FlowGenerateConfig]
config FlowGenerateConfig | Sequence[FlowGenerateConfig]

The config or list of configs to set fields on.

**kwargs Unpack[FlowGenerateConfigDict]

The fields to set on each config.

merge

Merge two flow objects.

def merge(base: _T, add: _T) -> _T
base _T

The base object.

add _T

The object to merge into the base. Values in this object will override those in the base.

models_matrix

Create a list of models from the product of lists of field values.

def models_matrix(
    *,
    model: str | FlowModel | Sequence[str | FlowModel],
    **kwargs: Unpack[FlowModelMatrixDict],
) -> list[FlowModel]
model str | FlowModel | Sequence[str | FlowModel]

The model or list of models to matrix.

**kwargs Unpack[FlowModelMatrixDict]

The lists of field values to matrix.

models_with

Set fields on a list of models.

def models_with(
    *,
    model: str | FlowModel | Sequence[str | FlowModel],
    **kwargs: Unpack[FlowModelDict],
) -> list[FlowModel]
model str | FlowModel | Sequence[str | FlowModel]

The model or list of models to set fields on.

**kwargs Unpack[FlowModelDict]

The fields to set on each model.

solvers_matrix

Create a list of solvers from the product of lists of field values.

def solvers_matrix(
    *,
    solver: str | FlowSolver | Sequence[str | FlowSolver],
    **kwargs: Unpack[FlowSolverMatrixDict],
) -> list[FlowSolver]
solver str | FlowSolver | Sequence[str | FlowSolver]

The solver or list of solvers to matrix.

**kwargs Unpack[FlowSolverMatrixDict]

The lists of field values to matrix.

solvers_with

Set fields on a list of solvers.

def solvers_with(
    *,
    solver: str | FlowSolver | Sequence[str | FlowSolver],
    **kwargs: Unpack[FlowSolverDict],
) -> list[FlowSolver]
solver str | FlowSolver | Sequence[str | FlowSolver]

The solver or list of solvers to set fields on.

**kwargs Unpack[FlowSolverDict]

The fields to set on each solver.

tasks_matrix

Create a list of tasks from the product of lists of field values.

def tasks_matrix(
    *,
    task: str | FlowTask | Sequence[str | FlowTask],
    **kwargs: Unpack[FlowTaskMatrixDict],
) -> list[FlowTask]
task str | FlowTask | Sequence[str | FlowTask]

The task or list of tasks to matrix.

**kwargs Unpack[FlowTaskMatrixDict]

The lists of field values to matrix.

tasks_with

Set fields on a list of tasks.

def tasks_with(
    *,
    task: str | FlowTask | Sequence[str | FlowTask],
    **kwargs: Unpack[FlowTaskDict],
) -> list[FlowTask]
task str | FlowTask | Sequence[str | FlowTask]

The task or list of tasks to set fields on.

**kwargs Unpack[FlowTaskDict]

The fields to set on each task.