inspect_flow
Types
FlowAgent
Configuration for an Agent.
class FlowAgent(FlowBase)Attributes
namestr | None | NotGiven-
Name of the agent. Used to create the agent if the factory is not provided.
factoryFlowFactory[Agent] | Callable[..., Agent] | str | None | NotGiven-
Factory function to create the agent instance.
argsCreateArgs | None | NotGiven-
Additional args to pass to agent constructor.
flow_metadatadict[str, Any] | None | NotGiven-
Optional. Metadata stored in the flow config. Not passed to the agent.
typeLiteral['agent'] | None-
Type needed to differentiated solvers and agents in solver lists.
FlowDefaults
Default field values for Inspect objects. Will be overriden by more specific settings.
class FlowDefaults(FlowBase)Attributes
configGenerateConfig | None | NotGiven-
Default model generation options. Will be overriden by settings on the FlowModel and FlowTask.
agentFlowAgent | None | NotGiven-
Field defaults for agents.
agent_prefixdict[str, FlowAgent] | None | NotGiven-
Agent defaults for agent name prefixes. E.g.
{'inspect/': FAgent(...)} modelFlowModel | None | NotGiven-
Field defaults for models.
model_prefixdict[str, FlowModel] | None | NotGiven-
Model defaults for model name prefixes. E.g.
{'openai/': FModel(...)} solverFlowSolver | None | NotGiven-
Field defaults for solvers.
solver_prefixdict[str, FlowSolver] | None | NotGiven-
Solver defaults for solver name prefixes. E.g.
{'inspect/': FSolver(...)} taskFlowTask | None | NotGiven-
Field defaults for tasks.
task_prefixdict[str, FlowTask] | None | NotGiven-
Task defaults for task name prefixes. E.g.
{'inspect_evals/': FTask(...)}
FlowDependencies
Configuration for flow dependencies to install in the venv.
class FlowDependencies(FlowBase)Attributes
dependency_fileLiteral['auto', 'no_file'] | str | None | NotGiven-
Path to a dependency file (either
requirements.txtorpyproject.toml) to use to create the virtual environment. If'auto', will search the path starting from the same directory as the config file (when using the CLI) orbase_dirarg (when using the API) looking forpyproject.tomlorrequirements.txtfiles. If'no_file', no dependency file will be used. Defaults to'auto'. additional_dependenciesstr | Sequence[str] | None | NotGiven-
Dependencies to pip install. E.g. PyPI package specifiers or Git repository URLs.
auto_detect_dependenciesbool | None | NotGiven-
If
True, automatically detect and install dependencies from names of objects in the config (defaults toTrue). For example, if a model name starts with'openai/', the'openai'package will be installed. If a task name is'inspect_evals/mmlu'then the'inspect-evals'package will be installed. uv_sync_argsstr | Sequence[str] | None | NotGiven-
Additional arguments to pass to
uv syncwhen creating the virtual environment using apyproject.tomlfile. May be a string ('--dev --extra test') or a list of strings (['--dev', '--extra', 'test']).
FlowEpochs
Configuration for task epochs.
Number of epochs to repeat samples over and optionally one or more reducers used to combine scores from samples across epochs. If not specified the “mean” score reducer is used.
class FlowEpochs(FlowBase)Attributes
epochsint-
Number of epochs.
reducerstr | Sequence[str] | None | NotGiven-
One or more reducers used to combine scores from samples across epochs (defaults to
"mean")
FlowFactory
Type-checked factory wrapper for creating Inspect AI objects.
Wraps a factory callable with its arguments, binding them at construction time so that type errors are caught immediately rather than at evaluation time. Works with FlowTask, FlowAgent, FlowSolver, FlowScorer, and FlowModel.
class FlowFactory(BaseModel, Generic[R], arbitrary_types_allowed=True)FlowModel
Configuration for a Model.
class FlowModel(FlowBase)Attributes
namestr | None | NotGiven-
Name of the model to use. If factory is not provided, this is used to create the model.
factoryFlowFactory[Model] | Callable[..., Model] | str | None | NotGiven-
Factory function to create the model instance.
rolestr | None | NotGiven-
Optional named role for model (e.g. for roles specified at the task or eval level). Provide a default as a fallback in the case where the role hasn’t been externally specified.
defaultstr | None | NotGiven-
Optional. Fallback model in case the specified model or role is not found. Should be a fully qualified model name (e.g.
openai/gpt-4o). configGenerateConfig | None | NotGiven-
Configuration for model. Config values will be override settings on the FlowTask and FlowSpec.
base_urlstr | None | NotGiven-
Optional. Alternate base URL for model.
api_keystr | None | NotGiven-
Optional. API key for model.
memoizebool | None | NotGiven-
Use/store a cached version of the model based on the parameters to
get_model(). Defaults toTrue. model_argsCreateArgs | None | NotGiven-
Additional args to pass to model constructor.
flow_metadatadict[str, Any] | None | NotGiven-
Optional. Metadata stored in the flow config. Not passed to the model.
FlowOptions
Evaluation options.
class FlowOptions(FlowBase)Attributes
retry_attemptsint | None | NotGiven-
Maximum number of retry attempts before giving up (defaults to 10).
retry_waitfloat | None | NotGiven-
Time to wait between attempts, increased exponentially (defaults to 30, resulting in waits of 30, 60, 120, 240, etc.). Wait time per-retry will in no case be longer than 1 hour.
retry_connectionsfloat | None | NotGiven-
Reduce
max_connectionsat this rate with each retry (defaults to 1.0, which results in no reduction). retry_cleanupbool | None | NotGiven-
Cleanup failed log files after retries (defaults to
True). sandboxSandboxEnvironmentType | None | NotGiven-
Sandbox environment type (or optionally a str or tuple with a shorthand spec).
sandbox_cleanupbool | None | NotGiven-
Cleanup sandbox environments after task completes (defaults to
True). tagsSequence[str] | None | NotGiven-
Tags to associate with this evaluation run.
metadatadict[str, Any] | None | NotGiven-
Metadata to associate with this evaluation run.
tracebool | None | NotGiven-
Trace message interactions with evaluated model to terminal.
displayDisplayType | None | NotGiven-
Task display type (defaults to
'rich'). approvalstr | ApprovalPolicyConfig | None | NotGiven-
Tool use approval policies. Either a path to an approval policy config file or a list of approval policies. Defaults to no approval policy.
scorebool | None | NotGiven-
Score output (defaults to
True). log_levelstr | None | NotGiven-
Level for logging to the console:
"debug","http","sandbox","info","warning","error","critical", or"notset"(defaults to"warning"). log_level_transcriptstr | None | NotGiven-
Level for logging to the log file (defaults to
"info"). log_formatLiteral['eval', 'json'] | None | NotGiven-
Format for writing log files (defaults to
"eval", the native high-performance format). limitint | None | NotGiven-
Limit evaluated samples (defaults to all samples).
sample_shufflebool | int | None | NotGiven-
Shuffle order of samples (pass a seed to make the order deterministic).
fail_on_errorbool | float | None | NotGiven-
Trueto fail on first sample error(default);Falseto never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails. continue_on_failbool | None | NotGiven-
Trueto continue running and only fail at the end if thefail_on_errorcondition is met.Falseto fail eval immediately when thefail_on_errorcondition is met (default). retry_on_errorint | None | NotGiven-
Number of times to retry samples if they encounter errors (defaults to 3).
debug_errorsbool | None | NotGiven-
Raise task errors (rather than logging them) so they can be debugged (defaults to
False). model_cost_configstr | dict[str, ModelCost] | None | NotGiven-
YAML or JSON file with model prices for cost tracking.
max_samplesint | None | NotGiven-
Maximum number of samples to run in parallel (default is
max_connections). max_tasksint | None | NotGiven-
Maximum number of tasks to run in parallel (defaults is 10).
max_subprocessesint | None | NotGiven-
Maximum number of subprocesses to run in parallel (default is
os.cpu_count()). max_sandboxesint | None | NotGiven-
Maximum number of sandboxes (per-provider) to run in parallel.
log_samplesbool | None | NotGiven-
Log detailed samples and scores (defaults to
True). log_realtimebool | None | NotGiven-
Log events in realtime (enables live viewing of samples in inspect view) (defaults to
True). log_imagesbool | None | NotGiven-
Log base64 encoded version of images, even if specified as a filename or URL (defaults to
False). log_model_apibool | None | NotGiven-
Log raw model api requests and responses. Note that error requests/responses are always logged.
log_refusalsbool | None | NotGiven-
Log warnings for model refusals.
log_bufferint | None | NotGiven-
Number of samples to buffer before writing log file. If not specified, an appropriate default for the format and filesystem is chosen (10 for most all cases, 100 for JSON logs on remote filesystems).
log_sharedbool | int | None | NotGiven-
Sync sample events to log directory so that users on other systems can see log updates in realtime (defaults to no syncing). Specify
Trueto sync every 10 seconds, otherwise an integer to sync everynseconds. bundle_dirstr | None | NotGiven-
If specified, the log viewer and logs generated by this eval set will be bundled into this directory. Relative paths will be resolved relative to the config file (when using the CLI) or
base_dirarg (when using the API). bundle_overwritebool | None | NotGiven-
Whether to overwrite files in the
bundle_dir(defaults toFalse). log_dir_allow_dirtybool | None | NotGiven-
If
True, allow the log directory to contain unrelated logs. IfFalse, ensure that the log directory only contains logs for tasks in this eval set (defaults toFalse). eval_set_idstr | None | NotGiven-
ID for the eval set. If not specified, a unique ID will be generated.
embed_viewerbool | None | NotGiven-
If True, embed a log viewer into the log directory.
bundle_url_mappingsdict[str, str] | None | NotGiven-
Replacements applied to
bundle_dirto generate a URL. If provided andbundle_diris set, the mapped URL will be written to stdout.
FlowScorer
Configuration for a Scorer.
class FlowScorer(FlowBase)Attributes
namestr | None | NotGiven-
Name of the scorer. Used to create the scorer if the factory is not provided.
factoryFlowFactory[Scorer] | Callable[..., Scorer] | str | None | NotGiven-
Factory function to create the scorer instance.
argsCreateArgs | None | NotGiven-
Additional args to pass to scorer constructor.
flow_metadatadict[str, Any] | None | NotGiven-
Optional. Metadata stored in the flow config. Not passed to the scorer.
FlowSolver
Configuration for a Solver.
class FlowSolver(FlowBase)Attributes
namestr | None | NotGiven-
Name of the solver. Used to create the solver if the factory is not provided.
factoryFlowFactory[Solver] | Callable[..., Solver] | str | None | NotGiven-
Factory function to create the solver instance.
argsCreateArgs | None | NotGiven-
Additional args to pass to solver constructor.
flow_metadatadict[str, Any] | None | NotGiven-
Optional. Metadata stored in the flow config. Not passed to the solver.
FlowSpec
Top-level flow specification.
class FlowSpec(FlowBase, arbitrary_types_allowed=True)Attributes
includesSequence[str | FlowSpec] | None | NotGiven-
List of other flow specs to include. Relative paths will be resolved relative to the config file (when using the CLI) or
base_dirarg (when using the API). In addition to this list of explicit files to include, any_flow.pyfiles in the same directory or any parent directory of the config file (when using the CLI) orbase_dirarg (when using the API) will also be included automatically. storeLiteral['auto'] | str | FlowStoreConfig | None | NotGiven-
Path to directory to use for flow storage, or a FlowStoreConfig with path and filter options.
'auto'will use a default application location.Nonewill disable storage. Relative paths will be resolved relative to the config file (when using the CLI) orbase_dirarg (when using the API). If not given,'auto'will be used. log_dirstr | None | NotGiven-
Output path for logging results (required to ensure that a unique storage scope is assigned). Must be set before running the flow spec. Relative paths will be resolved relative to the config file (when using the CLI) or
base_dirarg (when using the API). log_dir_create_uniquebool | None | NotGiven-
If
True, create a unique log directory by appending a datetime subdirectory (e.g.2025-12-09T17-36-43) under the specifiedlog_dir. IfFalse, use the existinglog_dir(which must be empty or havelog_dir_allow_dirty=True). Defaults toFalse. execution_typeLiteral['inproc', 'venv'] | None | NotGiven-
Execution environment for running tasks (defaults to
'inproc'). python_versionstr | None | NotGiven-
Python version to use in the flow virtual environment (e.g.
'3.11'). dependenciesFlowDependencies | None | NotGiven-
Dependencies to install in the venv. Defaults to auto-detecting dependencies from
pyproject.toml,requirements.txt, and object names in the config. optionsFlowOptions | None | NotGiven-
Arguments for calls to
eval_set(). envdict[str, str] | None | NotGiven-
Environment variables to set when running tasks.
defaultsFlowDefaults | None | NotGiven-
Defaults values for Inspect objects.
flow_metadatadict[str, Any] | None | NotGiven-
Optional. Metadata stored in the flow config. Not passed to the model.
tasksSequence[str | FlowTask | Task] | None | NotGiven-
Tasks to run
FlowStoreConfig
Store configuration with optional log filter.
class FlowStoreConfig(FlowBase)Attributes
pathLiteral['auto'] | str | None-
Path to directory to use for flow storage.
'auto'will use a default application location.Nonewill disable storage. filterSkipValidation[LogFilter] | str | Sequence[SkipValidation[LogFilter] | str] | None-
Log filter to apply when searching for existing logs. Can be a callable, a registered filter name, a sequence of filters (all must pass), or
None. readbool-
Whether to match existing logs from the store. Default is
False. writebool-
Whether to index completed logs in the store. Default is
True.
FlowTask
Configuration for an evaluation task.
Tasks are the basis for defining and running evaluations.
class FlowTask(FlowBase, arbitrary_types_allowed=True)Attributes
namestr | None | NotGiven-
Task name. Any of registry name (
"inspect_evals/mbpp"), file name ("./my_task.py"), or a file name and attr ("./my_task.py@task_name"). Used to create the task if the factory is not provided. factoryFlowFactory[Task] | Callable[..., Task] | str | None | NotGiven-
Factory function to create the task instance.
argsCreateArgs | None | NotGiven-
Additional args to pass to task constructor
extra_argsFlowExtraArgs | None | NotGiven-
Extra args to provide to creation of inspect objects for this task. Will override args provided in the
argsfield on the FlowModel, FlowSolver, FlowScorer, and FlowAgent. solverstr | FlowSolver | FlowAgent | Solver | Agent | Sequence[str | FlowSolver | Solver] | None | NotGiven-
Solver or list of solvers. Defaults to
generate(), a normal call to the model. scorerstr | FlowScorer | Scorer | Sequence[str | FlowScorer | Scorer] | None | NotGiven-
Scorer or list of scorers used to evaluate model output.
modelstr | FlowModel | Model | None | NotGiven-
Default model for task (Optional, defaults to eval model).
configGenerateConfig | NotGiven-
Model generation config for default model (does not apply to model roles). Will override config settings on the FlowSpec. Will be overridden by settings on the FlowModel.
model_rolesModelRolesConfig | None | NotGiven-
Named roles for use in
get_model(). sandboxSandboxEnvironmentType | None | NotGiven-
Sandbox environment type (or optionally a str or tuple with a shorthand spec)
approvalstr | ApprovalPolicyConfig | None | NotGiven-
Tool use approval policies. Either a path to an approval policy config file or an approval policy config. Defaults to no approval policy.
epochsint | FlowEpochs | None | NotGiven-
Epochs to repeat samples for and optional score reducer function(s) used to combine sample scores (defaults to
"mean") fail_on_errorbool | float | None | NotGiven-
Trueto fail on first sample error (default);Falseto never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails. continue_on_failbool | None | NotGiven-
Trueto continue running and only fail at the end if thefail_on_errorcondition is met.Falseto fail eval immediately when thefail_on_errorcondition is met (default). message_limitint | None | NotGiven-
Limit on total messages used for each sample.
token_limitint | None | NotGiven-
Limit on total tokens used for each sample.
time_limitint | None | NotGiven-
Limit on clock time (in seconds) for samples.
working_limitint | None | NotGiven-
Limit on working time (in seconds) for sample. Working time includes model generation, tool calls, etc. but does not include time spent waiting on retries or shared resources.
cost_limitfloat | None | NotGiven-
Limit on total cost (in dollars) for each sample. Requires model cost data via model_cost_config.
early_stoppingSkipValidation[EarlyStopping] | None | NotGiven-
Early stopping callbacks.
versionint | str | NotGiven-
Version of task (to distinguish evolutions of the task spec or breaking changes to it)
tagsSequence[str] | None | NotGiven-
Tags to associate with the task.
metadatadict[str, Any] | None | NotGiven-
Additional metadata to associate with the task.
sample_idstr | int | Sequence[str | int] | None | NotGiven-
Evaluate specific sample(s) from the dataset.
flow_metadatadict[str, Any] | None | NotGiven-
Optional. Metadata stored in the flow config. Not passed to the task.
model_namestr | None | NotGiven-
Get the model name from the model field.
Returns: The model name if set, otherwise None.
Type Aliases
LogFilter
A function that receives an EvalLog (header-only) and returns True to include the log or False to exclude it.
LogFilter: TypeAlias = Callable[[EvalLog], bool]Decorators
after_load
Decorator to mark a function to be called after a FlowSpec is loaded.
The decorated function should have the signature (args are all optional and may be omitted):
def after_flow_spec_loaded(
spec: FlowSpec,
files: list[str],
) -> None:
...def after_load(func: F) -> FfuncF-
The function to decorate.
log_filter
Decorator to register a log filter function.
def log_filter(func: Callable[[EvalLog], bool]) -> Callable[[EvalLog], bool]funcCallable[[EvalLog], bool]-
A function that takes an EvalLog and returns True to include.
Functions
agents_matrix
Create a list of agents from the product of lists of field values.
def agents_matrix(
*,
agent: str | FlowAgent | Sequence[str | FlowAgent],
args: Sequence[Mapping[str, Any] | NotGiven | None] | None = ...,
) -> list[FlowAgent]agents_with
Set fields on a list of agents.
def agents_with(
*,
agent: str | FlowAgent | Sequence[str | FlowAgent],
name: str | NotGiven | None = ...,
factory: str | NotGiven | None = ...,
args: Mapping[str, Any] | NotGiven | None = ...,
flow_metadata: Mapping[str, Any] | NotGiven | None = ...,
type: Literal['agent'] | None = ...,
) -> list[FlowAgent]agentstr | FlowAgent | Sequence[str | FlowAgent]-
The agent or list of agents to set fields on.
namestr | NotGiven | None-
Name of the agent. Used to create the agent if the factory is not provided.
factorystr | NotGiven | None-
Factory function to create the agent instance.
argsMapping[str, Any] | NotGiven | None-
Additional args to pass to agent constructor.
flow_metadataMapping[str, Any] | NotGiven | None-
Optional. Metadata stored in the flow config. Not passed to the agent.
typeLiteral['agent'] | None-
Type needed to differentiated solvers and agents in solver lists.
configs_matrix
Create a list of generate configs from the product of lists of field values.
def configs_matrix(
*,
config: GenerateConfig | Sequence[GenerateConfig] | None = ...,
system_message: Sequence[str | None] | None = ...,
max_tokens: Sequence[int | None] | None = ...,
top_p: Sequence[float | None] | None = ...,
temperature: Sequence[float | None] | None = ...,
stop_seqs: Sequence[Sequence[str] | None] | None = ...,
best_of: Sequence[int | None] | None = ...,
frequency_penalty: Sequence[float | None] | None = ...,
presence_penalty: Sequence[float | None] | None = ...,
logit_bias: Sequence[Mapping[str, float] | None] | None = ...,
seed: Sequence[int | None] | None = ...,
top_k: Sequence[int | None] | None = ...,
num_choices: Sequence[int | None] | None = ...,
logprobs: Sequence[bool | None] | None = ...,
top_logprobs: Sequence[int | None] | None = ...,
parallel_tool_calls: Sequence[bool | None] | None = ...,
internal_tools: Sequence[bool | None] | None = ...,
max_tool_output: Sequence[int | None] | None = ...,
cache_prompt: Sequence[Literal['auto'] | bool | None] | None = ...,
reasoning_effort: Sequence[Literal['none', 'minimal', 'low', 'medium', 'high', 'xhigh'] | None] | None = ...,
reasoning_tokens: Sequence[int | None] | None = ...,
reasoning_summary: Sequence[Literal['none', 'concise', 'detailed', 'auto'] | None] | None = ...,
reasoning_history: Sequence[Literal['none', 'all', 'last', 'auto'] | None] | None = ...,
response_schema: Sequence[ResponseSchema | None] | None = ...,
extra_body: Sequence[Mapping[str, Any] | None] | None = ...,
) -> list[GenerateConfig]configGenerateConfig | Sequence[GenerateConfig] | None-
The config or list of configs to matrix.
system_messageSequence[str | None] | None-
Override the default system message.
max_tokensSequence[int | None] | None-
The maximum number of tokens that can be generated in the completion (default is model specific).
top_pSequence[float | None] | None-
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.
temperatureSequence[float | None] | None-
What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
stop_seqsSequence[Sequence[str] | None] | None-
Sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
best_ofSequence[int | None] | None-
Generates best_of completions server-side and returns the ‘best’ (the one with the highest log probability per token). vLLM only.
frequency_penaltySequence[float | None] | None-
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim. OpenAI, Google, Grok, Groq, vLLM, and SGLang only.
presence_penaltySequence[float | None] | None-
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. OpenAI, Google, Grok, Groq, vLLM, and SGLang only.
logit_biasSequence[Mapping[str, float] | None] | None-
Map token Ids to an associated bias value from -100 to 100 (e.g. “42=10,43=-10”). OpenAI, Grok, Grok, and vLLM only.
seedSequence[int | None] | None-
Random seed. OpenAI, Google, Mistral, Groq, HuggingFace, and vLLM only.
top_kSequence[int | None] | None-
Randomly sample the next word from the top_k most likely next words. Anthropic, Google, HuggingFace, vLLM, and SGLang only.
num_choicesSequence[int | None] | None-
How many chat completion choices to generate for each input message. OpenAI, Grok, Google, TogetherAI, vLLM, and SGLang only.
logprobsSequence[bool | None] | None-
Return log probabilities of the output tokens. OpenAI, Grok, TogetherAI, Huggingface, llama-cpp-python, vLLM, and SGLang only.
top_logprobsSequence[int | None] | None-
Number of most likely tokens (0-20) to return at each token position, each with an associated log probability. OpenAI, Grok, Huggingface, vLLM, and SGLang only.
parallel_tool_callsSequence[bool | None] | None-
Whether to enable parallel function calling during tool use (defaults to True). OpenAI and Groq only.
internal_toolsSequence[bool | None] | None-
Whether to automatically map tools to model internal implementations (e.g. ‘computer’ for anthropic).
max_tool_outputSequence[int | None] | None-
Maximum tool output (in bytes). Defaults to 16 * 1024.
cache_promptSequence[Literal['auto'] | bool | None] | None-
Whether to cache the prompt prefix. Defaults to “auto”, which will enable caching for requests with tools. Anthropic only.
reasoning_effortSequence[Literal['none', 'minimal', 'low', 'medium', 'high', 'xhigh'] | None] | None-
Constrains effort on reasoning. Defaults vary by provider and model and not all models support all values (please consult provider documentation for details).
reasoning_tokensSequence[int | None] | None-
Maximum number of tokens to use for reasoning. Anthropic Claude models only.
reasoning_summarySequence[Literal['none', 'concise', 'detailed', 'auto'] | None] | None-
Provide summary of reasoning steps (OpenAI reasoning models only). Use ‘auto’ to access the most detailed summarizer available for the current model (defaults to ‘auto’ if your organization is verified by OpenAI).
reasoning_historySequence[Literal['none', 'all', 'last', 'auto'] | None] | None-
Include reasoning in chat message history sent to generate.
response_schemaSequence[ResponseSchema | None] | None-
Request a response format as JSONSchema (output should still be validated). OpenAI, Google, Mistral, vLLM, and SGLang only.
extra_bodySequence[Mapping[str, Any] | None] | None-
Extra body to be sent with requests to OpenAI compatible servers. OpenAI, vLLM, and SGLang only.
configs_with
Set fields on a list of generate configs.
def configs_with(
*,
config: GenerateConfig | Sequence[GenerateConfig],
max_retries: int | None = ...,
timeout: int | None = ...,
attempt_timeout: int | None = ...,
max_connections: int | None = ...,
system_message: str | None = ...,
max_tokens: int | None = ...,
top_p: float | None = ...,
temperature: float | None = ...,
stop_seqs: Sequence[str] | None = ...,
best_of: int | None = ...,
frequency_penalty: float | None = ...,
presence_penalty: float | None = ...,
logit_bias: Mapping[str, float] | None = ...,
seed: int | None = ...,
top_k: int | None = ...,
num_choices: int | None = ...,
logprobs: bool | None = ...,
top_logprobs: int | None = ...,
parallel_tool_calls: bool | None = ...,
internal_tools: bool | None = ...,
max_tool_output: int | None = ...,
cache_prompt: Literal['auto'] | bool | None = ...,
verbosity: Literal['low', 'medium', 'high'] | None = ...,
effort: Literal['low', 'medium', 'high', 'max'] | None = ...,
reasoning_effort: Literal['none', 'minimal', 'low', 'medium', 'high', 'xhigh'] | None = ...,
reasoning_tokens: int | None = ...,
reasoning_summary: Literal['none', 'concise', 'detailed', 'auto'] | None = ...,
reasoning_history: Literal['none', 'all', 'last', 'auto'] | None = ...,
response_schema: ResponseSchema | None = ...,
extra_headers: Mapping[str, str] | None = ...,
extra_body: Mapping[str, Any] | None = ...,
modalities: Sequence[Literal['image'] | ImageOutput] | None = ...,
cache: bool | CachePolicy | None = ...,
batch: bool | int | BatchConfig | None = ...,
) -> list[GenerateConfig]configGenerateConfig | Sequence[GenerateConfig]-
The config or list of configs to set fields on.
max_retriesint | None-
Maximum number of times to retry request (defaults to unlimited).
timeoutint | None-
Timeout (in seconds) for an entire request (including retries).
attempt_timeoutint | None-
Timeout (in seconds) for any given attempt (if exceeded, will abandon attempt and retry according to max_retries).
max_connectionsint | None-
Maximum number of concurrent connections to Model API (default is model specific).
system_messagestr | None-
Override the default system message.
max_tokensint | None-
The maximum number of tokens that can be generated in the completion (default is model specific).
top_pfloat | None-
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.
temperaturefloat | None-
What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
stop_seqsSequence[str] | None-
Sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
best_ofint | None-
Generates best_of completions server-side and returns the ‘best’ (the one with the highest log probability per token). vLLM only.
frequency_penaltyfloat | None-
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim. OpenAI, Google, Grok, Groq, vLLM, and SGLang only.
presence_penaltyfloat | None-
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. OpenAI, Google, Grok, Groq, vLLM, and SGLang only.
logit_biasMapping[str, float] | None-
Map token Ids to an associated bias value from -100 to 100 (e.g. “42=10,43=-10”). OpenAI, Grok, Grok, and vLLM only.
seedint | None-
Random seed. OpenAI, Google, Mistral, Groq, HuggingFace, and vLLM only.
top_kint | None-
Randomly sample the next word from the top_k most likely next words. Anthropic, Google, HuggingFace, vLLM, and SGLang only.
num_choicesint | None-
How many chat completion choices to generate for each input message. OpenAI, Grok, Google, TogetherAI, vLLM, and SGLang only.
logprobsbool | None-
Return log probabilities of the output tokens. OpenAI, Grok, TogetherAI, Huggingface, llama-cpp-python, vLLM, and SGLang only.
top_logprobsint | None-
Number of most likely tokens (0-20) to return at each token position, each with an associated log probability. OpenAI, Grok, Huggingface, vLLM, and SGLang only.
parallel_tool_callsbool | None-
Whether to enable parallel function calling during tool use (defaults to True). OpenAI and Groq only.
internal_toolsbool | None-
Whether to automatically map tools to model internal implementations (e.g. ‘computer’ for anthropic).
max_tool_outputint | None-
Maximum tool output (in bytes). Defaults to 16 * 1024.
cache_promptLiteral['auto'] | bool | None-
Whether to cache the prompt prefix. Defaults to “auto”, which will enable caching for requests with tools. Anthropic only.
verbosityLiteral['low', 'medium', 'high'] | None-
Constrains the verbosity of the model’s response. Lower values will result in more concise responses, while higher values will result in more verbose responses. GPT 5.x models only (defaults to “medium” for OpenAI models).
effortLiteral['low', 'medium', 'high', 'max'] | None-
Control how many tokens are used for a response, trading off between response thoroughness and token efficiency. Anthropic Claude Opus 4.5 and 4.6 only (
maxonly supported on 4.6). reasoning_effortLiteral['none', 'minimal', 'low', 'medium', 'high', 'xhigh'] | None-
Constrains effort on reasoning. Defaults vary by provider and model and not all models support all values (please consult provider documentation for details).
reasoning_tokensint | None-
Maximum number of tokens to use for reasoning. Anthropic Claude models only.
reasoning_summaryLiteral['none', 'concise', 'detailed', 'auto'] | None-
Provide summary of reasoning steps (OpenAI reasoning models only). Use ‘auto’ to access the most detailed summarizer available for the current model (defaults to ‘auto’ if your organization is verified by OpenAI).
reasoning_historyLiteral['none', 'all', 'last', 'auto'] | None-
Include reasoning in chat message history sent to generate.
response_schemaResponseSchema | None-
Request a response format as JSONSchema (output should still be validated). OpenAI, Google, Mistral, vLLM, and SGLang only.
extra_headersMapping[str, str] | None-
Extra headers to be sent with requests. Not supported for AzureAI, Bedrock, and Grok.
extra_bodyMapping[str, Any] | None-
Extra body to be sent with requests to OpenAI compatible servers. OpenAI, vLLM, and SGLang only.
modalitiesSequence[Literal['image'] | ImageOutput] | None-
Additional output modalities to enable beyond text (e.g. [“image”]). OpenAI and Google only.
cachebool | CachePolicy | None-
Policy for caching of model generate output.
batchbool | int | BatchConfig | None-
Use batching API when available. True to enable batching with default configuration, False to disable batching, a number to enable batching of the specified batch size, or a BatchConfig object specifying the batching configuration.
merge
Merge two flow objects, with add values overriding base values.
Only explicitly set fields in add override base — unset fields (defaulting to NotGiven) are ignored. Nested fields like config and flow_metadata are merged recursively rather than replaced.
def merge(base: _T, add: _T) -> _Tbase_T-
The base object providing default values.
add_T-
The object to merge into the base. Only explicitly set fields override those in
base.
models_matrix
Create a list of models from the product of lists of field values.
def models_matrix(
*,
model: str | FlowModel | Sequence[str | FlowModel],
config: Sequence[GenerateConfig | NotGiven | None] | None = ...,
) -> list[FlowModel]models_with
Set fields on a list of models.
def models_with(
*,
model: str | FlowModel | Sequence[str | FlowModel],
name: str | NotGiven | None = ...,
factory: str | NotGiven | None = ...,
role: str | NotGiven | None = ...,
default: str | NotGiven | None = ...,
config: GenerateConfig | NotGiven | None = ...,
base_url: str | NotGiven | None = ...,
api_key: str | NotGiven | None = ...,
memoize: bool | NotGiven | None = ...,
model_args: Mapping[str, Any] | NotGiven | None = ...,
flow_metadata: Mapping[str, Any] | NotGiven | None = ...,
) -> list[FlowModel]modelstr | FlowModel | Sequence[str | FlowModel]-
The model or list of models to set fields on.
namestr | NotGiven | None-
Name of the model to use. If factory is not provided, this is used to create the model.
factorystr | NotGiven | None-
Factory function to create the model instance.
rolestr | NotGiven | None-
Optional named role for model (e.g. for roles specified at the task or eval level). Provide a default as a fallback in the case where the role hasn’t been externally specified.
defaultstr | NotGiven | None-
Optional. Fallback model in case the specified model or role is not found. Should be a fully qualified model name (e.g.
openai/gpt-4o). configGenerateConfig | NotGiven | None-
Configuration for model. Config values will be override settings on the FlowTask and FlowSpec.
base_urlstr | NotGiven | None-
Optional. Alternate base URL for model.
api_keystr | NotGiven | None-
Optional. API key for model.
memoizebool | NotGiven | None-
Use/store a cached version of the model based on the parameters to
get_model(). Defaults toTrue. model_argsMapping[str, Any] | NotGiven | None-
Additional args to pass to model constructor.
flow_metadataMapping[str, Any] | NotGiven | None-
Optional. Metadata stored in the flow config. Not passed to the model.
solvers_matrix
Create a list of solvers from the product of lists of field values.
def solvers_matrix(
*,
solver: str | FlowSolver | Sequence[str | FlowSolver],
args: Sequence[Mapping[str, Any] | NotGiven | None] | None = ...,
) -> list[FlowSolver]solverstr | FlowSolver | Sequence[str | FlowSolver]-
The solver or list of solvers to matrix.
argsSequence[Mapping[str, Any] | NotGiven | None] | None-
Additional args to pass to solver constructor.
solvers_with
Set fields on a list of solvers.
def solvers_with(
*,
solver: str | FlowSolver | Sequence[str | FlowSolver],
name: str | NotGiven | None = ...,
factory: str | NotGiven | None = ...,
args: Mapping[str, Any] | NotGiven | None = ...,
flow_metadata: Mapping[str, Any] | NotGiven | None = ...,
) -> list[FlowSolver]solverstr | FlowSolver | Sequence[str | FlowSolver]-
The solver or list of solvers to set fields on.
namestr | NotGiven | None-
Name of the solver. Used to create the solver if the factory is not provided.
factorystr | NotGiven | None-
Factory function to create the solver instance.
argsMapping[str, Any] | NotGiven | None-
Additional args to pass to solver constructor.
flow_metadataMapping[str, Any] | NotGiven | None-
Optional. Metadata stored in the flow config. Not passed to the solver.
tasks_matrix
Create a list of tasks from the product of lists of field values.
def tasks_matrix(
*,
task: str | FlowTask | Sequence[str | FlowTask],
args: Sequence[Mapping[str, Any] | NotGiven | None] | None = ...,
solver: Sequence[str | FlowSolver | FlowAgent | Solver | Agent | Sequence[str | FlowSolver | Solver] | NotGiven | None] | None = ...,
model: Sequence[str | FlowModel | Model | NotGiven | None] | None = ...,
config: Sequence[GenerateConfig | NotGiven] | None = ...,
model_roles: Sequence[Mapping[str, FlowModel | str | Model] | NotGiven | None] | None = ...,
message_limit: Sequence[int | NotGiven | None] | None = ...,
token_limit: Sequence[int | NotGiven | None] | None = ...,
time_limit: Sequence[int | NotGiven | None] | None = ...,
working_limit: Sequence[int | NotGiven | None] | None = ...,
cost_limit: Sequence[float | NotGiven | None] | None = ...,
) -> list[FlowTask]taskstr | FlowTask | Sequence[str | FlowTask]-
The task or list of tasks to matrix.
argsSequence[Mapping[str, Any] | NotGiven | None] | None-
Additional args to pass to task constructor
solverSequence[str | FlowSolver | FlowAgent | Solver | Agent | Sequence[str | FlowSolver | Solver] | NotGiven | None] | None-
Solver or list of solvers. Defaults to
generate(), a normal call to the model. modelSequence[str | FlowModel | Model | NotGiven | None] | None-
Default model for task (Optional, defaults to eval model).
configSequence[GenerateConfig | NotGiven] | None-
Model generation config for default model (does not apply to model roles). Will override config settings on the FlowSpec. Will be overridden by settings on the FlowModel.
model_rolesSequence[Mapping[str, FlowModel | str | Model] | NotGiven | None] | None-
Named roles for use in
get_model(). message_limitSequence[int | NotGiven | None] | None-
Limit on total messages used for each sample.
token_limitSequence[int | NotGiven | None] | None-
Limit on total tokens used for each sample.
time_limitSequence[int | NotGiven | None] | None-
Limit on clock time (in seconds) for samples.
working_limitSequence[int | NotGiven | None] | None-
Limit on working time (in seconds) for sample. Working time includes model generation, tool calls, etc. but does not include time spent waiting on retries or shared resources.
cost_limitSequence[float | NotGiven | None] | None-
Limit on total cost (in dollars) for each sample. Requires model cost data via model_cost_config.
tasks_with
Set fields on a list of tasks.
def tasks_with(
*,
task: str | FlowTask | Sequence[str | FlowTask],
name: str | NotGiven | None = ...,
factory: str | NotGiven | None = ...,
args: Mapping[str, Any] | NotGiven | None = ...,
extra_args: FlowExtraArgs | NotGiven | None = ...,
solver: str | FlowSolver | FlowAgent | Solver | Agent | Sequence[str | FlowSolver | Solver] | NotGiven | None = ...,
scorer: str | FlowScorer | Scorer | Sequence[str | FlowScorer | Scorer] | NotGiven | None = ...,
model: str | FlowModel | Model | NotGiven | None = ...,
config: GenerateConfig | NotGiven = ...,
model_roles: Mapping[str, FlowModel | str | Model] | NotGiven | None = ...,
sandbox: str | tuple[str, str] | SandboxEnvironmentSpec | NotGiven | None = ...,
approval: str | ApprovalPolicyConfig | NotGiven | None = ...,
epochs: int | FlowEpochs | NotGiven | None = ...,
fail_on_error: bool | float | NotGiven | None = ...,
continue_on_fail: bool | NotGiven | None = ...,
message_limit: int | NotGiven | None = ...,
token_limit: int | NotGiven | None = ...,
time_limit: int | NotGiven | None = ...,
working_limit: int | NotGiven | None = ...,
cost_limit: float | NotGiven | None = ...,
early_stopping: NotGiven | None = ...,
version: int | str | NotGiven = ...,
tags: Sequence[str] | NotGiven | None = ...,
metadata: Mapping[str, Any] | NotGiven | None = ...,
sample_id: str | int | Sequence[str | int] | NotGiven | None = ...,
flow_metadata: Mapping[str, Any] | NotGiven | None = ...,
) -> list[FlowTask]taskstr | FlowTask | Sequence[str | FlowTask]-
The task or list of tasks to set fields on.
namestr | NotGiven | None-
Task name. Any of registry name (
"inspect_evals/mbpp"), file name ("./my_task.py"), or a file name and attr ("./my_task.py@task_name"). Used to create the task if the factory is not provided. factorystr | NotGiven | None-
Factory function to create the task instance.
argsMapping[str, Any] | NotGiven | None-
Additional args to pass to task constructor
extra_argsFlowExtraArgs | NotGiven | None-
Extra args to provide to creation of inspect objects for this task. Will override args provided in the
argsfield on the FlowModel, FlowSolver, FlowScorer, and FlowAgent. solverstr | FlowSolver | FlowAgent | Solver | Agent | Sequence[str | FlowSolver | Solver] | NotGiven | None-
Solver or list of solvers. Defaults to
generate(), a normal call to the model. scorerstr | FlowScorer | Scorer | Sequence[str | FlowScorer | Scorer] | NotGiven | None-
Scorer or list of scorers used to evaluate model output.
modelstr | FlowModel | Model | NotGiven | None-
Default model for task (Optional, defaults to eval model).
configGenerateConfig | NotGiven-
Model generation config for default model (does not apply to model roles). Will override config settings on the FlowSpec. Will be overridden by settings on the FlowModel.
model_rolesMapping[str, FlowModel | str | Model] | NotGiven | None-
Named roles for use in
get_model(). sandboxstr | tuple[str, str] | SandboxEnvironmentSpec | NotGiven | None-
Sandbox environment type (or optionally a str or tuple with a shorthand spec)
approvalstr | ApprovalPolicyConfig | NotGiven | None-
Tool use approval policies. Either a path to an approval policy config file or an approval policy config. Defaults to no approval policy.
epochsint | FlowEpochs | NotGiven | None-
Epochs to repeat samples for and optional score reducer function(s) used to combine sample scores (defaults to
"mean") fail_on_errorbool | float | NotGiven | None-
Trueto fail on first sample error (default);Falseto never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails. continue_on_failbool | NotGiven | None-
Trueto continue running and only fail at the end if thefail_on_errorcondition is met.Falseto fail eval immediately when thefail_on_errorcondition is met (default). message_limitint | NotGiven | None-
Limit on total messages used for each sample.
token_limitint | NotGiven | None-
Limit on total tokens used for each sample.
time_limitint | NotGiven | None-
Limit on clock time (in seconds) for samples.
working_limitint | NotGiven | None-
Limit on working time (in seconds) for sample. Working time includes model generation, tool calls, etc. but does not include time spent waiting on retries or shared resources.
cost_limitfloat | NotGiven | None-
Limit on total cost (in dollars) for each sample. Requires model cost data via model_cost_config.
early_stoppingNotGiven | None-
Early stopping callbacks.
versionint | str | NotGiven-
Version of task (to distinguish evolutions of the task spec or breaking changes to it)
tagsSequence[str] | NotGiven | None-
Tags to associate with the task.
metadataMapping[str, Any] | NotGiven | None-
Additional metadata to associate with the task.
sample_idstr | int | Sequence[str | int] | NotGiven | None-
Evaluate specific sample(s) from the dataset.
flow_metadataMapping[str, Any] | NotGiven | None-
Optional. Metadata stored in the flow config. Not passed to the task.