inspect_flow

Types

FlowAgent

Configuration for an Agent.

class FlowAgent(BaseModel, extra="forbid")

Attributes

name str | None: Name of the agent. Required to be set by the time the agent is created.
args CreateArgs | None: Additional args to pass to agent constructor.
flow_metadata dict[str, Any] | None: Optional. Metadata stored in the flow config. Not passed to the agent.
type Literal['agent']: Type needed to differentiated solvers and agents in solver lists.

FlowJob

Configuration for a flow job.

Source

class FlowJob(BaseModel, extra="forbid")

Attributes

includes Sequence[str | FlowInclude] | None: List of other flow configs to include.
log_dir str | None: Output path for logging results (required to ensure that a unique storage scope is assigned). Must be set before running the flow job. If a relative path, it will be resolved relative to the most recent config file loaded with ‘load_job’ or the current working directory if ‘load_job’ was not used.
log_dir_create_unique bool | None: If True, create a new log directory by appending an _ and numeric suffix if the specified log_dir already exists. If the directory exists and has a _numeric suffix, that suffix will be incremented. If False, use the existing log_dir (which must be empty or have log_dir_allow_dirty=True). Defaults to False.
python_version str | None: Python version to use in the flow virtual environment (e.g. ‘3.11’)
options FlowOptions | None: Arguments for calls to eval_set.
dependencies list[str] | None: Dependencies to pip install. E.g. PyPI package specifiers or Git repository URLs.
env dict[str, str] | None: Environment variables to set when running tasks.
defaults FlowDefaults | None: Defaults values for Inspect objects.
flow_metadata dict[str, Any] | None: Optional. Metadata stored in the flow config. Not passed to the model.
tasks Sequence[str | FlowTask] | None: Tasks to run

FlowDefaults

Default field values for Inspect objects. Will be overriden by more specific settings.

Source

class FlowDefaults(BaseModel, extra="forbid")

Attributes

config FlowGenerateConfig | None: Default model generation options. Will be overriden by settings on the FlowModel and FlowTask.
agent FlowAgent | None: Field defaults for agents.
agent_prefix dict[str, FlowAgent] | None: Agent defaults for agent name prefixes. E.g. {‘inspect/’: FAgent(…)}
model FlowModel | None: Field defaults for models.
model_prefix dict[str, FlowModel] | None: Model defaults for model name prefixes. E.g. {‘openai/’: FModel(…)}
solver FlowSolver | None: Field defaults for solvers.
solver_prefix dict[str, FlowSolver] | None: Solver defaults for solver name prefixes. E.g. {‘inspect/’: FSolver(…)}
task FlowTask | None: Field defaults for tasks.
task_prefix dict[str, FlowTask] | None: Task defaults for task name prefixes. E.g. {‘inspect_evals/’: FTask(…)}

FlowEpochs

Configuration for task epochs.

Number of epochs to repeat samples over and optionally one or more reducers used to combine scores from samples across epochs. If not specified the “mean” score reducer is used.

Source

class FlowEpochs(BaseModel)

Attributes

epochs int: Number of epochs.
reducer str | list[str] | None: One or more reducers used to combine scores from samples across epochs (defaults to “mean”)

FlowGenerateConfig

Model generation options.

Source

class FlowGenerateConfig(GenerateConfig, extra="forbid")

FlowModel

Configuration for a Model.

Source

class FlowModel(BaseModel, extra="forbid")

Attributes

name str | None: Name of the model to use. Required to be set by the time the model is created.
role str | None: Optional named role for model (e.g. for roles specified at the task or eval level). Provide a default as a fallback in the case where the role hasn’t been externally specified.
default str | None: Optional. Fallback model in case the specified model or role is not found. Should be a fully qualified model name (e.g. openai/gpt-4o).
config FlowGenerateConfig | None: Configuration for model. Config values will be override settings on the FlowTask and FlowJob.
base_url str | None: Optional. Alternate base URL for model.
api_key str | None: Optional. API key for model.
memoize bool | None: Use/store a cached version of the model based on the parameters to get_model(). Defaults to True.
model_args CreateArgs | None: Additional args to pass to model constructor.
flow_metadata dict[str, Any] | None: Optional. Metadata stored in the flow config. Not passed to the model.

FlowOptions

Evaluation options.

Source

class FlowOptions(BaseModel, extra="forbid")

Attributes

retry_attempts int | None: Maximum number of retry attempts before giving up (defaults to 10).
retry_wait float | None: Time to wait between attempts, increased exponentially (defaults to 30, resulting in waits of 30, 60, 120, 240, etc.). Wait time per-retry will in no case be longer than 1 hour.
retry_connections float | None: Reduce max_connections at this rate with each retry (defaults to 1.0, which results in no reduction).
retry_cleanup bool | None: Cleanup failed log files after retries (defaults to True).
sandbox SandboxEnvironmentType | None: Sandbox environment type (or optionally a str or tuple with a shorthand spec).
sandbox_cleanup bool | None: Cleanup sandbox environments after task completes (defaults to True).
tags list[str] | None: Tags to associate with this evaluation run.
metadata dict[str, Any] | None: Metadata to associate with this evaluation run.
trace bool | None: Trace message interactions with evaluated model to terminal.
display DisplayType | None: Task display type (defaults to ‘full’).
approval str | ApprovalPolicyConfig | None: Tool use approval policies. Either a path to an approval policy config file or a list of approval policies. Defaults to no approval policy.
score bool | None: Score output (defaults to True).
log_level str | None: Level for logging to the console: “debug”, “http”, “sandbox”, “info”, “warning”, “error”, “critical”, or “notset” (defaults to “warning”).
log_level_transcript str | None: Level for logging to the log file (defaults to “info”).
log_format Literal['eval', 'json'] | None: Format for writing log files (defaults to “eval”, the native high-performance format).
limit int | None: Limit evaluated samples (defaults to all samples).
sample_shuffle bool | int | None: Shuffle order of samples (pass a seed to make the order deterministic).
fail_on_error bool | float | None: True to fail on first sample error(default); False to never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails.
continue_on_fail bool | None: True to continue running and only fail at the end if the fail_on_error condition is met. False to fail eval immediately when the fail_on_error condition is met (default).
retry_on_error int | None: Number of times to retry samples if they encounter errors (defaults to 3).
debug_errors bool | None: Raise task errors (rather than logging them) so they can be debugged (defaults to False).
max_samples int | None: Maximum number of samples to run in parallel (default is max_connections).
max_tasks int | None: Maximum number of tasks to run in parallel (defaults is 10).
max_subprocesses int | None: Maximum number of subprocesses to run in parallel (default is os.cpu_count()).
max_sandboxes int | None: Maximum number of sandboxes (per-provider) to run in parallel.
log_samples bool | None: Log detailed samples and scores (defaults to True).
log_realtime bool | None: Log events in realtime (enables live viewing of samples in inspect view) (defaults to True).
log_images bool | None: Log base64 encoded version of images, even if specified as a filename or URL (defaults to False).
log_buffer int | None: Number of samples to buffer before writing log file. If not specified, an appropriate default for the format and filesystem is chosen (10 for most all cases, 100 for JSON logs on remote filesystems).
log_shared bool | int | None: Sync sample events to log directory so that users on other systems can see log updates in realtime (defaults to no syncing). Specify True to sync every 10 seconds, otherwise an integer to sync every n seconds.
bundle_dir str | None: If specified, the log viewer and logs generated by this eval set will be bundled into this directory.
bundle_overwrite bool | None: Whether to overwrite files in the bundle_dir. (defaults to False).
log_dir_allow_dirty bool | None: If True, allow the log directory to contain unrelated logs. If False, ensure that the log directory only contains logs for tasks in this eval set (defaults to False).
eval_set_id str | None: ID for the eval set. If not specified, a unique ID will be generated.
bundle_url_map dict[str, str] | None: Replacements applied to bundle_dir to generate a URL. If provided and bundle_dir is set, the mapped URL will be written to stdout.

FlowSolver

Configuration for a Solver.

Source

class FlowSolver(BaseModel, extra="forbid")

Attributes

name str | None: Name of the solver. Required to be set by the time the solver is created.
args CreateArgs | None: Additional args to pass to solver constructor.
flow_metadata dict[str, Any] | None: Optional. Metadata stored in the flow config. Not passed to the solver.

FlowTask

Configuration for an evaluation task.

Tasks are the basis for defining and running evaluations.

Source

class FlowTask(BaseModel, extra="forbid")

Attributes

name str | None

Task name. Any of registry name (“inspect_evals/mbpp”), file name (“./my_task.py”), or a file name and attr (“./my_task.py@task_name”). Required to be set by the time the task is created.

args CreateArgs | None

Additional args to pass to task constructor

Solver or list of solvers. Defaults to generate(), a normal call to the model.

model str | FlowModel | None

Default model for task (Optional, defaults to eval model).

config FlowGenerateConfig | None

Model generation config for default model (does not apply to model roles). Will override config settings on the FlowJob. Will be overridden by settings on the FlowModel.

model_roles ModelRolesConfig | None

Named roles for use in get_model().

sandbox SandboxEnvironmentType | None

Sandbox environment type (or optionally a str or tuple with a shorthand spec)

approval str | ApprovalPolicyConfig | None

Tool use approval policies. Either a path to an approval policy config file or an approval policy config. Defaults to no approval policy.

epochs int | FlowEpochs | None

Epochs to repeat samples for and optional score reducer function(s) used to combine sample scores (defaults to “mean”)

fail_on_error bool | float | None

True to fail on first sample error (default); False to never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails.

continue_on_fail bool | None

True to continue running and only fail at the end if the fail_on_error condition is met. False to fail eval immediately when the fail_on_error condition is met (default).

message_limit int | None

Limit on total messages used for each sample.

token_limit int | None

Limit on total tokens used for each sample.

time_limit int | None

Limit on clock time (in seconds) for samples.

working_limit int | None

Limit on working time (in seconds) for sample. Working time includes model generation, tool calls, etc. but does not include time spent waiting on retries or shared resources.

version int | str | None

Version of task (to distinguish evolutions of the task spec or breaking changes to it)

metadata dict[str, Any] | None

Additional metadata to associate with the task.

sample_id str | int | list[str | int] | None

Evaluate specific sample(s) from the dataset.

flow_metadata dict[str, Any] | None

Optional. Metadata stored in the flow config. Not passed to the task.

model_name str | None

Get the model name from the model field.

Returns: The model name if set, otherwise None.

Functions

agents_matrix

Create a list of agents from the product of lists of field values.

Source

def agents_matrix(
    *,
    agent: str | FlowAgent | Sequence[str | FlowAgent],
    **kwargs: Unpack[FlowAgentMatrixDict],
) -> list[FlowAgent]

agent str | FlowAgent | Sequence[str | FlowAgent]: The agent or list of agents to matrix.
**kwargs Unpack[FlowAgentMatrixDict]: The lists of field values to matrix.

agents_with

Set fields on a list of agents.

Source

def agents_with(
    *,
    agent: str | FlowAgent | Sequence[str | FlowAgent],
    **kwargs: Unpack[FlowAgentDict],
) -> list[FlowAgent]

agent str | FlowAgent | Sequence[str | FlowAgent]: The agent or list of agents to set fields on.
**kwargs Unpack[FlowAgentDict]: The fields to set on each agent.

configs_matrix

Create a list of generate configs from the product of lists of field values.

Source

def configs_matrix(
    *,
    config: FlowGenerateConfig | Sequence[FlowGenerateConfig] | None = None,
    **kwargs: Unpack[FlowGenerateConfigMatrixDict],
) -> list[FlowGenerateConfig]

config FlowGenerateConfig | Sequence[FlowGenerateConfig] | None: The config or list of configs to matrix.
**kwargs Unpack[FlowGenerateConfigMatrixDict]: The lists of field values to matrix.

configs_with

Set fields on a list of generate configs.

Source

def configs_with(
    *,
    config: FlowGenerateConfig | Sequence[FlowGenerateConfig],
    **kwargs: Unpack[FlowGenerateConfigDict],
) -> list[FlowGenerateConfig]

config FlowGenerateConfig | Sequence[FlowGenerateConfig]: The config or list of configs to set fields on.
**kwargs Unpack[FlowGenerateConfigDict]: The fields to set on each config.

merge

Merge two flow objects.

Source

def merge(base: _T, add: _T) -> _T

base _T: The base object.
add _T: The object to merge into the base. Values in this object will override those in the base.

models_matrix

Create a list of models from the product of lists of field values.

Source

def models_matrix(
    *,
    model: str | FlowModel | Sequence[str | FlowModel],
    **kwargs: Unpack[FlowModelMatrixDict],
) -> list[FlowModel]

model str | FlowModel | Sequence[str | FlowModel]: The model or list of models to matrix.
**kwargs Unpack[FlowModelMatrixDict]: The lists of field values to matrix.

models_with

Set fields on a list of models.

Source

def models_with(
    *,
    model: str | FlowModel | Sequence[str | FlowModel],
    **kwargs: Unpack[FlowModelDict],
) -> list[FlowModel]

model str | FlowModel | Sequence[str | FlowModel]: The model or list of models to set fields on.
**kwargs Unpack[FlowModelDict]: The fields to set on each model.

solvers_matrix

Create a list of solvers from the product of lists of field values.

Source

def solvers_matrix(
    *,
    solver: str | FlowSolver | Sequence[str | FlowSolver],
    **kwargs: Unpack[FlowSolverMatrixDict],
) -> list[FlowSolver]

solver str | FlowSolver | Sequence[str | FlowSolver]: The solver or list of solvers to matrix.
**kwargs Unpack[FlowSolverMatrixDict]: The lists of field values to matrix.

solvers_with

Set fields on a list of solvers.

Source

def solvers_with(
    *,
    solver: str | FlowSolver | Sequence[str | FlowSolver],
    **kwargs: Unpack[FlowSolverDict],
) -> list[FlowSolver]

solver str | FlowSolver | Sequence[str | FlowSolver]: The solver or list of solvers to set fields on.
**kwargs Unpack[FlowSolverDict]: The fields to set on each solver.

tasks_matrix

Create a list of tasks from the product of lists of field values.

Source

def tasks_matrix(
    *,
    task: str | FlowTask | Sequence[str | FlowTask],
    **kwargs: Unpack[FlowTaskMatrixDict],
) -> list[FlowTask]

task str | FlowTask | Sequence[str | FlowTask]: The task or list of tasks to matrix.
**kwargs Unpack[FlowTaskMatrixDict]: The lists of field values to matrix.

tasks_with

Set fields on a list of tasks.

Source

def tasks_with(
    *,
    task: str | FlowTask | Sequence[str | FlowTask],
    **kwargs: Unpack[FlowTaskDict],
) -> list[FlowTask]

task str | FlowTask | Sequence[str | FlowTask]: The task or list of tasks to set fields on.
**kwargs Unpack[FlowTaskDict]: The fields to set on each task.