# Inspect SWE > Software engineering agents for Inspect AI. # Inspect SWE ## Overview The `inspect_swe` package makes software engineering agents like [Claude Code](https://docs.anthropic.com/en/docs/claude-code/overview), [Codex CLI](https://github.com/openai/codex), [Gemini CLI](https://github.com/google-gemini/gemini-cli), [OpenCode](https://github.com/anomalyco/opencode), and [Mini SWE Agent](https://github.com/SWE-agent/mini-swe-agent). available as standard Inspect agents. For example, here we use the [claude_code()](./reference/index.html.md#claude_code) agent as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import claude_code @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=claude_code(), scorer=model_graded_qa(), sandbox="docker", ) ``` Inspect SWE agents are implemented using the Inspect [`sandbox_agent_bridge()`](https://inspect.aisi.org.uk/agent-bridge.html#sandbox-bridge). Agents run inside the sample sandbox and their model API calls are proxied back to Inspect. This means that you can use any model with Inspect SWE agents, and that features like token or time limits and log transcripts work as normal with the agents. ## Getting Started Install Inspect SWE from PyPI with: ``` bash pip install inspect-swe ``` Then, try out one or more of the available agents: | Agent | Description | |----|----| | [claude_code()](./claude_code.html.md) | Anthropic’s agentic coding tool [Claude Code](https://docs.anthropic.com/en/docs/claude-code/overview) | | [codex_cli()](./codex_cli.html.md) | OpenAI’s terminal-based coding agent [Codex CLI](https://github.com/openai/codex) | | [gemini_cli()](./gemini_cli.html.md) | Google’s open-source AI agent [Gemini CLI](https://github.com/google-gemini/gemini-cli) | | [opencode()](./opencode.html.md) | Provider-independent terminal-based coding agent. | | [mini_swe_agent()](./mini_swe_agent.html.md) | SWE-agent’s minimal 100-line agent. | # Claude Code – Inspect SWE ## Overview The `claude_code()` agent uses the unattended mode of Anthropic [Claude Code](https://code.claude.com/docs/en/overview) to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task. > **NOTE: NoteClaude Code Installation** > > By default, the agent will download the current stable version of Claude Code and copy it to the sandbox. You can also exercise more explicit control over which version of Claude Code is used—see the [Installation](#installation) section below for details. ## Basic Usage Use the `claude_code()` agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import claude_code @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=claude_code(), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also pass the agent as a `--solver` on the command line: ``` bash inspect eval ctf.py --solver inspect_swe/claude_code ``` If you want to try this out locally, see the [system_explorer](https://github.com/meridianlabs-ai/inspect_swe/tree/main/examples/system_explorer/task.py) example. ## Options The following options are supported for customizing the behavior of the agent: | Option | Description | |----|----| | `system_prompt` | Additional system prompt to append to default system prompt. | | `skills` | Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. | | `mcp_servers` | MCP servers (see [MCP Servers](#mcp-servers) below for details). | | `bridged_tools` | Host-side Inspect tools to expose via MCP (see [Bridged Tools](#bridged-tools) below for details). | | `disallowed_tools` | Optionally disallow tools (e.g. `"WebSearch"`) | | `centaur` | Run in [Centaur Mode](#centaur-mode), which makes Claude Code available to an Inspect `human_cli()` agent rather than running it unattended. | | `attempts` | Allow the agent to have multiple scored attempts at solving the task. | | `model` | Model name to use for agent (defaults to main model for task). | | `model_config` | Model id used for the identity the agent presents to itself (its “You are powered by the model …” system prompt). Defaults to the real served model. | | `opus_model` | The model to use for `opus`, or for `opusplan` when Plan Mode is active. Defaults to `model`. | | `sonnet_model` | The model to use for `sonnet`, or for `opusplan` when Plan Mode is not active. Defaults to `model`. | | `haiku_model` | The model to use for haiku, or [background functionality](https://code.claude.com/docs/en/costs#background-token-usage). Defaults to `model`. | | `subagent_model` | The model to use for [subagents](https://code.claude.com/docs/en/sub-agents). Defaults to `model`. | | `filter` | Filter for intercepting bridged model requests. | | `retry_refusals` | Should refusals be retried? Defaults to 3. | | `retry_uncaught_errors` | Should uncaught errors (unexpected crashes of Claude Code) be retried? Defaults to 3. | | `cwd` | Working directory for Claude Code session. | | `env` | Environment variables to set for Claude Code. | | `version` | Version of Claude Code to use (see [Installation](#installation) below for details) | For example, here we specify a custom system prompt and disallow the `WebSearch` tool: ``` python claude_code( system_prompt="You are an ace system researcher.", disallowed_tools=["WebSearch"] ) ``` ## MCP Servers You can specify one or more [Model Context Protocol](https://modelcontextprotocol.io/docs/getting-started/intro) (MCP) servers to provide additional tools to Claude Code. Servers are specified via the [`MCPServerConfig`](https://inspect.aisi.org.uk/reference/inspect_ai.tool.html#mcpserverconfig) class and its Stdio and HTTP variants. For example, here is a Dockerfile that makes the `server-memory` MCP server available in the sandbox container: ``` dockerfile FROM python:3.12-bookworm # nodejs (required by mcp server) RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ && curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \ && apt-get install -y --no-install-recommends nodejs \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # memory mcp server RUN npx --yes @modelcontextprotocol/server-memory --version # run forever CMD ["tail", "-f", "/dev/null"] ``` Note that we run the `npx` server during the build of the Dockerfile so that it is cached for use offline (below we’ll run it with the `--offline` option). We can then use this MCP server in a task as follows: ``` python from inspect_ai import Task, task from inspect_ai.dataset import Sample from inspect_ai.tool import MCPServerConfigStdio from inspect_swe import claude_code @task def investigator() -> Task: return Task( dataset=[ Sample( input="What transport protocols are supported in " + " the 2025-03-26 version of the MCP spec?" ) ], solver=claude_code( system_prompt="Please use the web search tool to " + "research this question and the memory tools " + "to keep track of your research.", mcp_servers=[ MCPServerConfigStdio( name="memory", command="npx", args=[ "--offline", "@modelcontextprotocol/server-memory" ], ) ] ), sandbox=("docker", "Dockerfile"), ) ``` Note that we run the MCP server using the `--offline` option so that it doesn’t require an internet connection (which it would normally use to check for updates to the package). ## Bridged Tools You can expose host-side Inspect tools to the sandboxed agent via the MCP protocol using the `bridged_tools` parameter. This allows you to run tools on the host (e.g. tools that access host resources, databases, or APIs) but make them available to the agent running inside the sandbox. Tools are specified via [`BridgedToolsSpec`](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#bridgedtoolsspec) which wraps a list of Inspect tools: ``` python from inspect_ai import Task, task from inspect_ai.agent import BridgedToolsSpec from inspect_ai.dataset import Sample from inspect_ai.tool import tool from inspect_swe import claude_code @tool def search_database(): async def execute(query: str) -> str: """Search the internal database. Args: query: The search query. """ # This runs on the host, not in the sandbox return f"Results for: {query}" return execute @task def investigator() -> Task: return Task( dataset=[ Sample(input="Search for information about MCP protocols.") ], solver=claude_code( system_prompt="Use the search tool to research.", bridged_tools=[ BridgedToolsSpec( name="host_tools", tools=[search_database()] ) ] ), sandbox=("docker", "Dockerfile"), ) ``` The `name` field identifies the MCP server and will be visible to the agent as a tool prefix. You can specify multiple `BridgedToolsSpec` instances to create separate MCP servers for different tool groups. See the [Bridged Tools](https://inspect.aisi.org.uk/agent-bridge.html#bridged-tools) documentation for more details on the architecture and how tool execution flows between host and sandbox. ## Installation By default, the agent will download the current stable version of Claude Code and copy it to the sandbox. You can override this behaviour using the `version` option: | Option | Description | |----|----| | `"auto"` | Use any available version of Claude Code in the sandbox, otherwise download the current stable version. | | `"sandbox"` | Use the version of Claude Code in the sandbox (raises `RuntimeError` if not available in the sandbox) | | `"stable"` | Download and use the current stable version. | | `"latest"` | Download and use the very latest version. | | `"x.x.x"` | Download and use a specific version number. | If you don’t ever want to rely on automatic downloads of Claude Code (e.g. if you run your evaluations offline), you can use one of two approaches: 1. Pre-install the version of Claude Code you want to use in the sandbox, then use `version="sandbox"`: ``` python claude_code(version="sandbox") ``` 2. Download the version of Claude Code you want to use into the cache, then specify that version explicitly: ``` python # download the agent binary during installation/configuration download_agent_binary("claude_code", "0.29.0", "linux-x64") # reference that version in your task (no download will occur) claude_code(version="0.29.0") ``` Note that the 5 most recently downloaded versions are retained in the cache. Use the [cached_agent_binaries()](./reference/index.html.md#cached_agent_binaries) function to list the contents of the cache. ## Centaur Mode The `claude_code()` agent can also be run in “centaur” mode which uses the Inspect AI [Human Agent](https://inspect.aisi.org.uk/human-agent.html) as the solver and makes [Claude Code](https://code.claude.com/docs/en/overview) available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model. Enable centaur mode by passing `centaur=True` to the `claude_code()` agent: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import claude_code @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=claude_code(centaur=True), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also enable centaur mode from the CLI using a solver arg (`-S`): ``` bash inspect eval ctf.py --solver inspect_swe/claude_code -S centaur=true ``` You can also pass `CentaurOptions` to further customize the behavior of the human agent. For example: ``` python from inspect_swe import CentaurOptions Task( dataset=json_dataset("dataset.json"), solver=claude_code(centaur=CentaurOptions(answer=False)), scorer=model_graded_qa(), sandbox="docker", ) ``` See the [human_cli()](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#human_cli) documentation for details on available options. ## Troubleshooting If Claude Code doesn’t appear to be working or working as expected, you can troubleshoot by dumping the Claude Code debug log after an evaluation task is complete. You can do this with: ``` bash inspect trace dump --filter "Claude Code" ``` # Codex CLI – Inspect SWE ## Overview The `codex_cli()` agent uses the unattended mode of OpenAI [Codex CLI](https://github.com/openai/codex) to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task. > **NOTE: NoteCodex CLI Installation** > > By default, the agent will download the current stable version of Codex CLI and copy it to the sandbox. You can also exercise more explicit control over which version of Codex CLI is used—see the [Installation](#installation) section below for details. ## Basic Usage Use the `codex_cli()` agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import codex_cli @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=codex_cli(), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also pass the agent as a `--solver` on the command line: ``` bash inspect eval ctf.py --solver inspect_swe/codex_cli ``` If you want to try this out locally, see the [system_explorer](https://github.com/meridianlabs-ai/inspect_swe/tree/main/examples/system_explorer/task.py) example. ## Options The following options are supported for customizing the behavior of the agent: | Option | Description | |----|----| | `system_prompt` | Additional system prompt to append to default system prompt. | | `model_config` | Codex model slug used to select the system prompt and tool set. Defaults to `None`, which derives the slug from the model used by the agent so Codex’s prompt/tooling aligns with what’s actually running. | | `skills` | Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. | | `mcp_servers` | MCP servers (see [MCP Servers](#mcp-servers) below for details). | | `bridged_tools` | Host-side Inspect tools to expose via MCP (see [Bridged Tools](#bridged-tools) below for details). | | `web_search` | Web search mode. Use `"live"` for live web search, `"cached"` for cached web search, or `"disabled"` to disable web search. Defaults to `"live"`. | | `goals` | Enable Codex goal tools. Defaults to `True`. | | `centaur` | Run in [Centaur Mode](#centaur-mode), which makes Codex CLI available to an Inspect `human_cli()` agent rather than running it unattended. | | `attempts` | Allow the agent to have multiple scored attempts at solving the task. | | `model` | Model name to use for agent (defaults to main model for task). | | `filter` | Filter for intercepting bridged model requests. | | `retry_refusals` | Should refusals be retried? (pass number of times to retry) | | `home_dir` | Home directory to use for codex cli. When set, AGENTS.md and the MCP configuration will be written here rather than to .codex | | `cwd` | Working directory for Codex CLI session. | | `env` | Environment variables to set for Codex CLI. | | `version` | Version of Codex CLI to use (see [Installation](#installation) below for details) | | `config_overrides` | Additional Codex CLI configuration overrides. | For example, here we specify a custom system prompt and disable the web search and goals tools: ``` python codex_cli( system_prompt="You are an ace system researcher.", web_search="disabled", goals=False, ) ``` ## MCP Servers You can specify one or more [Model Context Protocol](https://modelcontextprotocol.io/docs/getting-started/intro) (MCP) servers to provide additional tools to Codex CLI. Servers are specified via the [`MCPServerConfig`](https://inspect.aisi.org.uk/reference/inspect_ai.tool.html#mcpserverconfig) class and its Stdio and HTTP variants. For example, here is a Dockerfile that makes the `server-memory` MCP server available in the sandbox container: ``` dockerfile FROM python:3.12-bookworm # nodejs (required by mcp server) RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ && curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \ && apt-get install -y --no-install-recommends nodejs \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # memory mcp server RUN npx --yes @modelcontextprotocol/server-memory --version # run forever CMD ["tail", "-f", "/dev/null"] ``` Note that we run the `npx` server during the build of the Dockerfile so that it is cached for use offline (below we’ll run it with the `--offline` option). We can then use this MCP server in a task as follows: ``` python from inspect_ai import Task, task from inspect_ai.dataset import Sample from inspect_ai.tool import MCPServerConfigStdio from inspect_swe import codex_cli @task def investigator() -> Task: return Task( dataset=[ Sample( input="What transport protocols are supported in " + " the 2025-03-26 version of the MCP spec?" ) ], solver=codex_cli( system_prompt="Please use the web search tool to " + "research this question and the memory tools " + "to keep track of your research.", mcp_servers=[ MCPServerConfigStdio( name="memory", command="npx", args=[ "--offline", "@modelcontextprotocol/server-memory" ], ) ] ), sandbox=("docker", "Dockerfile"), ) ``` Note that we run the MCP server using the `--offline` option so that it doesn’t require an internet connection (which it would normally use to check for updates to the package). ## Bridged Tools You can expose host-side Inspect tools to the sandboxed agent via the MCP protocol using the `bridged_tools` parameter. This allows you to run tools on the host (e.g. tools that access host resources, databases, or APIs) but make them available to the agent running inside the sandbox. Tools are specified via [`BridgedToolsSpec`](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#bridgedtoolsspec) which wraps a list of Inspect tools: ``` python from inspect_ai import Task, task from inspect_ai.agent import BridgedToolsSpec from inspect_ai.dataset import Sample from inspect_ai.tool import tool from inspect_swe import codex_cli @tool def search_database(): async def execute(query: str) -> str: """Search the internal database. Args: query: The search query. """ # This runs on the host, not in the sandbox return f"Results for: {query}" return execute @task def investigator() -> Task: return Task( dataset=[ Sample(input="Search for information about MCP protocols.") ], solver=codex_cli( system_prompt="Use the search tool to research.", bridged_tools=[ BridgedToolsSpec( name="host_tools", tools=[search_database()] ) ] ), sandbox=("docker", "Dockerfile"), ) ``` The `name` field identifies the MCP server and will be visible to the agent as a tool prefix. You can specify multiple `BridgedToolsSpec` instances to create separate MCP servers for different tool groups. See the [Bridged Tools](https://inspect.aisi.org.uk/agent-bridge.html#bridged-tools) documentation for more details on the architecture and how tool execution flows between host and sandbox. ## Installation By default, the agent will download the current stable version of Codex CLI and copy it to the sandbox. You can override this behaviour using the `version` option: | Option | Description | |----|----| | `"auto"` | Use any available version of Codex CLI in the sandbox, otherwise download the latest version. | | `"sandbox"` | Use the version of Codex CLI in the sandbox (raises `RuntimeError` if not available in the sandbox) | | `"latest"` | Download and use the very latest version. | | `"x.x.x"` | Download and use a specific version number. | If you don’t ever want to rely on automatic downloads of Codex CLI (e.g. if you run your evaluations offline), you can use one of two approaches: 1. Pre-install the version of Codex CLI you want to use in the sandbox, then use `version="sandbox"`: ``` python codex_cli(version="sandbox") ``` 2. Download the version of Codex CLI you want to use into the cache, then specify that version explicitly: ``` python # download the agent binary during installation/configuration download_agent_binary("codex_cli", "0.29.0", "linux-x64") # reference that version in your task (no download will occur) codex_cli(version="0.29.0") ``` Note that the 5 most recently downloaded versions are retained in the cache. Use the [cached_agent_binaries()](./reference/index.html.md#cached_agent_binaries) function to list the contents of the cache. ## Centaur Mode The `codex_cli()` agent can also be run in “centaur” mode which uses the Inspect AI [Human Agent](https://inspect.aisi.org.uk/human-agent.html) as the solver and makes [Codex CLI](https://github.com/openai/codex) available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model. Enable centaur mode by passing `centaur=True` to the `codex_cli()` agent: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import codex_cli @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=codex_cli(centaur=True), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also enable centaur mode from the CLI using a solver arg (`-S`): ``` bash inspect eval ctf.py --solver inspect_swe/codex_cli -S centaur=true ``` You can also pass `CentaurOptions` to further customize the behavior of the human agent. For example: ``` python from inspect_swe import CentaurOptions Task( dataset=json_dataset("dataset.json"), solver=codex_cli(centaur=CentaurOptions(answer=False)), scorer=model_graded_qa(), sandbox="docker", ) ``` See the [human_cli()](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#human_cli) documentation for details on available options. ## Troubleshooting If Codex CLI doesn’t appear to be working or working as expected, you can troubleshoot by dumping the Codex CLI debug log after an evaluation task is complete. You can do this with: ``` bash inspect trace dump --filter "Codex CLI" ``` # Gemini CLI – Inspect SWE ## Overview The `gemini_cli()` agent uses the unattended mode of Google [Gemini CLI](https://github.com/google-gemini/gemini-cli) to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task. > **NOTE: NoteGemini CLI Installation** > > By default, the agent will download the current stable version of Gemini CLI and copy it to the sandbox. You can also exercise more explicit control over which version of Gemini CLI is used—see the [Installation](#installation) section below for details. ## Basic Usage Use the `gemini_cli()` agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import gemini_cli @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=gemini_cli(), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also pass the agent as a `--solver` on the command line: ``` bash inspect eval ctf.py --solver inspect_swe/gemini_cli ``` If you want to try this out locally, see the [system_explorer](https://github.com/meridianlabs-ai/inspect_swe/tree/main/examples/system_explorer/task.py) example. ## Options The following options are supported for customizing the behavior of the agent: | Option | Description | |----|----| | `system_prompt` | Additional system prompt to append to default system prompt. | | `skills` | Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. | | `mcp_servers` | MCP servers (see [MCP Servers](#mcp-servers) below for details). | | `bridged_tools` | Host-side Inspect tools to expose via MCP (see [Bridged Tools](#bridged-tools) below for details). | | `centaur` | Run in [Centaur Mode](#centaur-mode), which makes Gemini CLI available to an Inspect `human_cli()` agent rather than running it unattended. | | `attempts` | Allow the agent to have multiple scored attempts at solving the task. | | `model` | Model name to use for agent (defaults to main model for task). | | `gemini_model` | Gemini model name to pass to CLI. This bypasses the auto-router. Use `"gemini-2.5-pro"` (default) or `"gemini-2.5-flash"`. | | `filter` | Filter for intercepting bridged model requests. | | `retry_refusals` | Should refusals be retried? (pass number of times to retry) | | `cwd` | Working directory for Gemini CLI session. | | `env` | Environment variables to set for Gemini CLI. | | `version` | Version of Gemini CLI to use (see [Installation](#installation) below for details) | For example, here we specify a custom system prompt: ``` python gemini_cli( system_prompt="You are an ace system researcher." ) ``` ## MCP Servers You can specify one or more [Model Context Protocol](https://modelcontextprotocol.io/docs/getting-started/intro) (MCP) servers to provide additional tools to Gemini CLI. Servers are specified via the [`MCPServerConfig`](https://inspect.aisi.org.uk/reference/inspect_ai.tool.html#mcpserverconfig) class and its Stdio and HTTP variants. For example, here is a Dockerfile that makes the `server-memory` MCP server available in the sandbox container: ``` dockerfile FROM python:3.12-bookworm # nodejs (required by mcp server) RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ && curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \ && apt-get install -y --no-install-recommends nodejs \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # memory mcp server RUN npx --yes @modelcontextprotocol/server-memory --version # run forever CMD ["tail", "-f", "/dev/null"] ``` Note that we run the `npx` server during the build of the Dockerfile so that it is cached for use offline (below we’ll run it with the `--offline` option). We can then use this MCP server in a task as follows: ``` python from inspect_ai import Task, task from inspect_ai.dataset import Sample from inspect_ai.tool import MCPServerConfigStdio from inspect_swe import gemini_cli @task def investigator() -> Task: return Task( dataset=[ Sample( input="What transport protocols are supported in " + " the 2025-03-26 version of the MCP spec?" ) ], solver=gemini_cli( system_prompt="Please use the web search tool to " + "research this question and the memory tools " + "to keep track of your research.", mcp_servers=[ MCPServerConfigStdio( name="memory", command="npx", args=[ "--offline", "@modelcontextprotocol/server-memory" ], ) ] ), sandbox=("docker", "Dockerfile"), ) ``` Note that we run the MCP server using the `--offline` option so that it doesn’t require an internet connection (which it would normally use to check for updates to the package). ## Bridged Tools You can expose host-side Inspect tools to the sandboxed agent via the MCP protocol using the `bridged_tools` parameter. This allows you to run tools on the host (e.g. tools that access host resources, databases, or APIs) but make them available to the agent running inside the sandbox. Tools are specified via [`BridgedToolsSpec`](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#bridgedtoolsspec) which wraps a list of Inspect tools: ``` python from inspect_ai import Task, task from inspect_ai.agent import BridgedToolsSpec from inspect_ai.dataset import Sample from inspect_ai.tool import tool from inspect_swe import gemini_cli @tool def search_database(): async def execute(query: str) -> str: """Search the internal database. Args: query: The search query. """ # This runs on the host, not in the sandbox return f"Results for: {query}" return execute @task def investigator() -> Task: return Task( dataset=[ Sample(input="Search for information about MCP protocols.") ], solver=gemini_cli( system_prompt="Use the search tool to research.", bridged_tools=[ BridgedToolsSpec( name="host_tools", tools=[search_database()] ) ] ), sandbox=("docker", "Dockerfile"), ) ``` The `name` field identifies the MCP server and will be visible to the agent as a tool prefix. You can specify multiple `BridgedToolsSpec` instances to create separate MCP servers for different tool groups. See the [Bridged Tools](https://inspect.aisi.org.uk/agent-bridge.html#bridged-tools) documentation for more details on the architecture and how tool execution flows between host and sandbox. ## Installation By default, the agent will download the current stable version of Gemini CLI and copy it to the sandbox. You can override this behaviour using the `version` option: | Option | Description | |----|----| | `"auto"` | Use any available version of Gemini CLI in the sandbox, otherwise download the latest version. | | `"sandbox"` | Use the version of Gemini CLI in the sandbox (raises `RuntimeError` if not available in the sandbox) | | `"latest"` | Download and use the very latest version. | | `"x.x.x-preview.y"` | Download and use a specific version number. | If you don’t ever want to rely on automatic downloads of Gemini CLI (e.g. if you run your evaluations offline), you can use one of two approaches: 1. Pre-install the version of Gemini CLI you want to use in the sandbox, then use `version="sandbox"`: ``` python gemini_cli(version="sandbox") ``` 2. Download the version of Gemini CLI you want to use into the cache, then specify that version explicitly: ``` python # download the agent binary during installation/configuration download_agent_binary("gemini_cli", "0.29.0", "linux-x64") # reference that version in your task (no download will occur) gemini_cli(version="0.29.0") ``` Note that the 5 most recently downloaded versions are retained in the cache. Use the [cached_agent_binaries()](./reference/index.html.md#cached_agent_binaries) function to list the contents of the cache. ## Centaur Mode The `gemini_cli()` agent can also be run in “centaur” mode which uses the Inspect AI [Human Agent](https://inspect.aisi.org.uk/human-agent.html) as the solver and makes [Gemini CLI](https://github.com/google-gemini/gemini-cli) available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model. Enable centaur mode by passing `centaur=True` to the `gemini_cli()` agent: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import gemini_cli @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=gemini_cli(centaur=True), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also enable centaur mode from the CLI using a solver arg (`-S`): ``` bash inspect eval ctf.py --solver inspect_swe/gemini_cli -S centaur=true ``` You can also pass `CentaurOptions` to further customize the behavior of the human agent. For example: ``` python from inspect_swe import CentaurOptions Task( dataset=json_dataset("dataset.json"), solver=gemini_cli(centaur=CentaurOptions(answer=False)), scorer=model_graded_qa(), sandbox="docker", ) ``` See the [human_cli()](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#human_cli) documentation for details on available options. ## Troubleshooting If Gemini CLI doesn’t appear to be working or working as expected, you can troubleshoot by dumping the Gemini CLI debug log after an evaluation task is complete. You can do this with: ``` bash inspect trace dump --filter "Gemini CLI" ``` # OpenCode – Inspect SWE ## Overview The `opencode()` agent uses the unattended mode of Anomaly [OpenCode](https://github.com/anomalyco/opencode) to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task. > **NOTE: NoteOpenCode Installation** > > By default, the agent will download the current stable version of OpenCode and copy it to the sandbox. You can also exercise more explicit control over which version of OpenCode is used—see the [Installation](#installation) section below for details. ## Basic Usage Use the `opencode()` agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import opencode @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=opencode(), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also pass the agent as a `--solver` on the command line: ``` bash inspect eval ctf.py --solver inspect_swe/opencode ``` If you want to try this out locally, see the [system_explorer](https://github.com/meridianlabs-ai/inspect_swe/tree/main/examples/system_explorer/task.py) example. ## Options The following options are supported for customizing the behavior of the agent: | Option | Description | |----|----| | `system_prompt` | Additional system prompt to append to default system prompt. | | `skills` | Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. | | `mcp_servers` | MCP servers (see [MCP Servers](#mcp-servers) below for details). | | `bridged_tools` | Host-side Inspect tools to expose via MCP (see [Bridged Tools](#bridged-tools) below for details). | | `centaur` | Run in [Centaur Mode](#centaur-mode), which makes OpenCode available to an Inspect `human_cli()` agent rather than running it unattended. | | `attempts` | Allow the agent to have multiple scored attempts at solving the task. | | `model` | Model name to use for agent (defaults to main model for task). | | `opencode_model` | OpenCode model identifier (`provider/model`) passed to the CLI. Default: `"anthropic/claude-sonnet-4-5"`. The actual model calls still go through the Inspect bridge; this just selects which provider client OpenCode uses to format the request. | | `filter` | Filter for intercepting bridged model requests. | | `retry_refusals` | Should refusals be retried? (pass number of times to retry) | | `cwd` | Working directory for OpenCode session. | | `env` | Environment variables to set for OpenCode. | | `version` | Version of OpenCode to use (see [Installation](#installation) below for details) | For example, here we specify a custom system prompt: ``` python opencode( system_prompt="You are an ace system researcher." ) ``` ## MCP Servers You can specify one or more [Model Context Protocol](https://modelcontextprotocol.io/docs/getting-started/intro) (MCP) servers to provide additional tools to OpenCode. Servers are specified via the [`MCPServerConfig`](https://inspect.aisi.org.uk/reference/inspect_ai.tool.html#mcpserverconfig) class and its Stdio and HTTP variants. For example, here is a Dockerfile that makes the `server-memory` MCP server available in the sandbox container: ``` dockerfile FROM python:3.12-bookworm # nodejs (required by mcp server) RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ && curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \ && apt-get install -y --no-install-recommends nodejs \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # memory mcp server RUN npx --yes @modelcontextprotocol/server-memory --version # run forever CMD ["tail", "-f", "/dev/null"] ``` Note that we run the `npx` server during the build of the Dockerfile so that it is cached for use offline (below we’ll run it with the `--offline` option). We can then use this MCP server in a task as follows: ``` python from inspect_ai import Task, task from inspect_ai.dataset import Sample from inspect_ai.tool import MCPServerConfigStdio from inspect_swe import opencode @task def investigator() -> Task: return Task( dataset=[ Sample( input="What transport protocols are supported in " + " the 2025-03-26 version of the MCP spec?" ) ], solver=opencode( system_prompt="Please use the web search tool to " + "research this question and the memory tools " + "to keep track of your research.", mcp_servers=[ MCPServerConfigStdio( name="memory", command="npx", args=[ "--offline", "@modelcontextprotocol/server-memory" ], ) ] ), sandbox=("docker", "Dockerfile"), ) ``` Note that we run the MCP server using the `--offline` option so that it doesn’t require an internet connection (which it would normally use to check for updates to the package). ## Bridged Tools You can expose host-side Inspect tools to the sandboxed agent via the MCP protocol using the `bridged_tools` parameter. This allows you to run tools on the host (e.g. tools that access host resources, databases, or APIs) but make them available to the agent running inside the sandbox. Tools are specified via [`BridgedToolsSpec`](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#bridgedtoolsspec) which wraps a list of Inspect tools: ``` python from inspect_ai import Task, task from inspect_ai.agent import BridgedToolsSpec from inspect_ai.dataset import Sample from inspect_ai.tool import tool from inspect_swe import opencode @tool def search_database(): async def execute(query: str) -> str: """Search the internal database. Args: query: The search query. """ # This runs on the host, not in the sandbox return f"Results for: {query}" return execute @task def investigator() -> Task: return Task( dataset=[ Sample(input="Search for information about MCP protocols.") ], solver=opencode( system_prompt="Use the search tool to research.", bridged_tools=[ BridgedToolsSpec( name="host_tools", tools=[search_database()] ) ] ), sandbox=("docker", "Dockerfile"), ) ``` The `name` field identifies the MCP server and will be visible to the agent as a tool prefix. You can specify multiple `BridgedToolsSpec` instances to create separate MCP servers for different tool groups. See the [Bridged Tools](https://inspect.aisi.org.uk/agent-bridge.html#bridged-tools) documentation for more details on the architecture and how tool execution flows between host and sandbox. ## Installation By default, the agent will download the current latest version of OpenCode and copy it to the sandbox. You can override this behaviour using the `version` option: | Option | Description | |----|----| | `"auto"` | Use any available version of OpenCode in the sandbox, otherwise download the latest version. | | `"sandbox"` | Use the version of OpenCode in the sandbox (raises `RuntimeError` if not available in the sandbox) | | `"latest"` | Download and use the very latest version. | | `"x.x.x"` | Download and use a specific version number. | If you don’t ever want to rely on automatic downloads of OpenCode (e.g. if you run your evaluations offline), you can use one of two approaches: 1. Pre-install the version of OpenCode you want to use in the sandbox, then use `version="sandbox"`: ``` python opencode(version="sandbox") ``` 2. Download the version of OpenCode you want to use into the cache, then specify that version explicitly: ``` python # download the agent binary during installation/configuration download_agent_binary("opencode", "0.29.0", "linux-x64") # reference that version in your task (no download will occur) opencode(version="0.29.0") ``` Note that the 5 most recently downloaded versions are retained in the cache. Use the [cached_agent_binaries()](./reference/index.html.md#cached_agent_binaries) function to list the contents of the cache. ## Centaur Mode The `opencode()` agent can also be run in “centaur” mode which uses the Inspect AI [Human Agent](https://inspect.aisi.org.uk/human-agent.html) as the solver and makes [OpenCode](https://github.com/anomalyco/opencode) available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model. Enable centaur mode by passing `centaur=True` to the `opencode()` agent: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import opencode @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=opencode(centaur=True), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also enable centaur mode from the CLI using a solver arg (`-S`): ``` bash inspect eval ctf.py --solver inspect_swe/opencode -S centaur=true ``` You can also pass `CentaurOptions` to further customize the behavior of the human agent. For example: ``` python from inspect_swe import CentaurOptions Task( dataset=json_dataset("dataset.json"), solver=opencode(centaur=CentaurOptions(answer=False)), scorer=model_graded_qa(), sandbox="docker", ) ``` See the [human_cli()](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#human_cli) documentation for details on available options. ## Troubleshooting If OpenCode doesn’t appear to be working or working as expected, you can troubleshoot by dumping the OpenCode debug log after an evaluation task is complete. You can do this with: ``` bash inspect trace dump --filter "OpenCode" ``` # Mini SWE Agent – Inspect SWE ## Overview The `mini_swe_agent()` agent uses the unattended mode of SWE-agent [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent) to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task. > **NOTE: Notemini-swe-agent Installation** > > By default, the agent will download the current stable version of mini-swe-agent and copy it to the sandbox. You can also exercise more explicit control over which version of mini-swe-agent is used—see the [Installation](#installation) section below for details. ## Basic Usage Use the `mini_swe_agent()` agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import mini_swe_agent @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=mini_swe_agent(), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also pass the agent as a `--solver` on the command line: ``` bash inspect eval ctf.py --solver inspect_swe/mini_swe_agent ``` If you want to try this out locally, see the [system_explorer](https://github.com/meridianlabs-ai/inspect_swe/tree/main/examples/system_explorer/task.py) example. ## Options The following options are supported for customizing the behavior of the agent: | Option | Description | |----|----| | `system_prompt` | Additional system prompt to append to default system prompt. | | `centaur` | Run in [Centaur Mode](#centaur-mode), which makes mini-swe-agent available to an Inspect `human_cli()` agent rather than running it unattended. | | `attempts` | Allow the agent to have multiple scored attempts at solving the task. | | `model` | Model name to use for agent (defaults to main model for task). | | `filter` | Filter for intercepting bridged model requests. | | `retry_refusals` | Should refusals be retried? (pass number of times to retry) | | `compaction` | Compaction strategy for managing context window overflow. | | `cwd` | Working directory for mini-swe-agent session. | | `env` | Environment variables to set for mini-swe-agent. | | `user` | User to execute mini-swe-agent as in the sandbox. | | `sandbox` | Sandbox environment name. | | `version` | Version of mini-swe-agent to use (see [Installation](#installation) below for details) | For example, here we specify a custom system prompt: ``` python mini_swe_agent( system_prompt="You are an ace system researcher.", ) ``` ## Installation By default, the agent will install the current stable version of mini-swe-agent in the sandbox via Python wheels. You can override this behaviour using the `version` option: | Option | Description | |----|----| | `"stable"` | Install and use the default pinned stable version. | | `"sandbox"` | Use the version of mini-swe-agent in the sandbox (raises `RuntimeError` if not available in the sandbox) | | `"latest"` | Install and use the latest version from PyPI. | | `"x.x.x"` | Install and use a specific version number. | Unlike the other agents which use standalone binaries, mini-swe-agent is installed via Python wheels using `uv`. If you don’t ever want to rely on automatic installation of mini-swe-agent (e.g. if you run your evaluations offline), you can use one of two approaches: 1. Pre-install the version of mini-swe-agent you want to use in the sandbox, then use `version="sandbox"`: ``` python mini_swe_agent(version="sandbox") ``` 2. Pre-install mini-swe-agent in your sandbox Dockerfile: ``` dockerfile RUN pip install mini-swe-agent==2.2.3 ``` Then reference it with `version="sandbox"` in your task. ## Centaur Mode The `mini_swe_agent()` agent can also be run in “centaur” mode which uses the Inspect AI [Human Agent](https://inspect.aisi.org.uk/human-agent.html) as the solver and makes [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent) available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model. Enable centaur mode by passing `centaur=True` to the `mini_swe_agent()` agent: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import mini_swe_agent @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=mini_swe_agent(centaur=True), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also enable centaur mode from the CLI using a solver arg (`-S`): ``` bash inspect eval ctf.py --solver inspect_swe/mini_swe_agent -S centaur=true ``` You can also pass `CentaurOptions` to further customize the behavior of the human agent. For example: ``` python from inspect_swe import CentaurOptions Task( dataset=json_dataset("dataset.json"), solver=mini_swe_agent(centaur=CentaurOptions(answer=False)), scorer=model_graded_qa(), sandbox="docker", ) ``` See the [human_cli()](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#human_cli) documentation for details on available options. ## Troubleshooting If mini-swe-agent doesn’t appear to be working or working as expected, you can troubleshoot by dumping the mini-swe-agent debug log after an evaluation task is complete. You can do this with: ``` bash inspect trace dump --filter "mini-swe-agent" ``` # Reference – Inspect SWE ## Agents ### claude_code Claude Code agent. Agent that uses [Claude Code](https://docs.anthropic.com/en/docs/claude-code/overview) running in a sandbox. The agent can either use a version of Claude Code installed in the sandbox, or can download a version and install it in the sandbox (see docs on `version` option below for details). Use `disallowed_tools` to control access to tools. See [Tools available to Claude](https://docs.anthropic.com/en/docs/claude-code/settings#tools-available-to-claude) for the list of built-in tools which can be disallowed. Use the `attempts` option to enable additional submissions if the initial submission(s) are incorrect (by default, no additional attempts are permitted). [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/_claude_code/claude_code.py#L51) ``` python @agent def claude_code( name: str = "Claude Code", description: str = dedent(""" Autonomous coding agent capable of writing, testing, debugging, and iterating on code across multiple languages. """), system_prompt: str | None = None, skills: Sequence[str | Path | Skill] | None = None, mcp_servers: Sequence[MCPServerConfig] | None = None, bridged_tools: Sequence[BridgedToolsSpec] | None = None, disallowed_tools: list[str] | None = None, centaur: bool | CentaurOptions = False, attempts: int | AgentAttempts = 1, model: str | None = None, model_config: str | None = None, model_aliases: dict[str, str | Model] | None = None, opus_model: str | None = None, sonnet_model: str | None = None, haiku_model: str | None = None, subagent_model: str | None = None, filter: GenerateFilter | None = None, auto_mode: bool = False, retry_refusals: int | None = 3, retry_uncaught_errors: int | None = 3, cwd: str | None = None, env: dict[str, str] | None = None, user: str | None = None, sandbox: str | None = None, version: Literal["auto", "sandbox", "stable", "latest"] | str = "auto", debug: bool | None = None, ) -> Agent ``` `name` str Agent name (used in multi-agent systems with `as_tool()` and `handoff()`) `description` str Agent description (used in multi-agent systems with `as_tool()` and `handoff()`) `system_prompt` str \| None Additional system prompt to append to default system prompt. `skills` Sequence\[str \| Path \| Skill\] \| None Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. `mcp_servers` Sequence\[MCPServerConfig\] \| None MCP servers to make available to the agent. `bridged_tools` Sequence\[BridgedToolsSpec\] \| None Host-side Inspect tools to expose to the agent via MCP. Each BridgedToolsSpec creates an MCP server that makes the specified tools available to the agent running in the sandbox. `disallowed_tools` list\[str\] \| None List of tool names to disallow entirely. `centaur` bool \| CentaurOptions Run in ‘centaur’ mode, which makes Claude Code available to an Inspect `human_cli()` agent rather than running it unattended. `attempts` int \| AgentAttempts Configure agent to make multiple attempts. When this is specified, the task will be scored when the agent stops calling tools. If the scoring is successful, execution will stop. Otherwise, the agent will be prompted to pick up where it left off for another attempt. `model` str \| None Model name to use for Opus and Sonnet calls (defaults to main model for task). `model_config` str \| None Model id used to select the identity Claude Code presents to itself (its “You are powered by the model …” system prompt) and any model-gated client behavior. Defaults to `None`, which derives it from the real served model so the presented identity matches what’s actually running. Purely the displayed identity — calls are still bridged to the served Inspect model regardless. (Claude Code renders the genuine name/cutoff for recognized Anthropic ids and shows other ids verbatim.) `model_aliases` dict\[str, str \| Model\] \| None Optional mapping of model names to Model instances or model name strings. Allows using custom Model implementations (e.g., wrapped Agents) instead of standard models. When a model name in the mapping is referenced, the corresponding Model/string is used. `opus_model` str \| None The model to use for `opus`, or for `opusplan` when Plan Mode is active. Defaults to `model`. `sonnet_model` str \| None The model to use for `sonnet`, or for `opusplan` when Plan Mode is not active. Defaults to `model`. `haiku_model` str \| None The model to use for haiku, or [background functionality](https://code.claude.com/docs/en/costs#background-token-usage). Defaults to `model`. `subagent_model` str \| None The model to use for [subagents](https://code.claude.com/docs/en/sub-agents). Defaults to `model`. `filter` GenerateFilter \| None Filter for intercepting bridged model requests. `auto_mode` bool Use `auto` permission mode rather than `--dangerously-skip-permissions`. Note that this can result in rejected tool calls so only enable if your evaluation can tolerate this. `retry_refusals` int \| None Should refusals be retried? Defaults to retrying up to 3 times. `retry_uncaught_errors` int \| None Should uncaught errors (unexpected crashes of Claude Code) be retried. Defaults to retrying up to 3 times. `cwd` str \| None Working directory to run claude code within. `env` dict\[str, str\] \| None Environment variables to set for claude code. `user` str \| None User to execute claude code with. `sandbox` str \| None Optional sandbox environment name. `version` Literal\['auto', 'sandbox', 'stable', 'latest'\] \| str Version of claude code to use. One of: - “auto”: Use any available version of claude code in the sandbox, otherwise download the current stable version. - “sandbox”: Use the version of claude code in the sandbox (raises `RuntimeError` if claude is not available in the sandbox) - “stable”: Download and use the current stable version of claude code. - “latest”: Download and use the very latest version of claude code. - “x.x.x”: Download and use a specific version of claude code. `debug` bool \| None Add `--debug` cli flag and trace all debug output. ### codex_cli Codex CLI. Agent that uses OpenAI [Codex CLI](https://github.com/openai/codex) running in a sandbox. Use the `attempts` option to enable additional submissions if the initial submission(s) are incorrect (by default, no additional attempts are permitted). [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/_codex_cli/codex_cli.py#L64) ``` python def codex_cli( name: str = ..., description: str = ..., system_prompt: str | None = ..., model_config: str | None = ..., skills: Sequence[str | Path | Skill] | None = ..., mcp_servers: Sequence[MCPServerConfig] | None = ..., bridged_tools: Sequence[BridgedToolsSpec] | None = ..., web_search: CodexWebSearch = ..., goals: bool = ..., centaur: bool | CentaurOptions = ..., attempts: int | AgentAttempts = ..., model: str | None = ..., model_aliases: dict[str, str | Model] | None = ..., filter: GenerateFilter | None = ..., retry_refusals: int | None = ..., home_dir: str | None = ..., cwd: str | None = ..., env: dict[str, str] | None = ..., user: str | None = ..., sandbox: str | None = ..., version: Literal['auto', 'sandbox', 'latest'] | str = ..., config_overrides: dict[str, str] | None = ..., debug: bool | None = ..., *, disallowed_tools: list[Literal['web_search']] | None = ..., ) -> Agent ``` `name` str Agent name (used in multi-agent systems with `as_tool()` and `handoff()`) `description` str Agent description (used in multi-agent systems with `as_tool()` and `handoff()`) `system_prompt` str \| None Additional system prompt to append to default system prompt. `model_config` str \| None Codex model slug used to select the system prompt and tool set. Defaults to `None`, which derives the slug from the real model so Codex’s prompt/tooling aligns with what’s actually running. Pass an explicit slug to override. `skills` Sequence\[str \| Path \| Skill\] \| None Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. `mcp_servers` Sequence\[MCPServerConfig\] \| None MCP servers to make available to the agent. `bridged_tools` Sequence\[BridgedToolsSpec\] \| None Host-side Inspect tools to expose to the agent via MCP. Each BridgedToolsSpec creates an MCP server that makes the specified tools available to the agent running in the sandbox. `web_search` CodexWebSearch Web search mode. Use “live” for live web search, “cached” for cached web search, or “disabled” to disable web search. Defaults to “live”. `goals` bool Enable Codex goal tools (defaults to `True`). `centaur` bool \| CentaurOptions Run in ‘centaur’ mode, which makes Codex CLI available to an Inspect `human_cli()` agent rather than running it unattended. `attempts` int \| AgentAttempts Configure agent to make multiple attempts. When this is specified, the task will be scored when the agent stops calling tools. If the scoring is successful, execution will stop. Otherwise, the agent will be prompted to pick up where it left off for another attempt. `model` str \| None Model name to use (defaults to main model for task). `model_aliases` dict\[str, str \| Model\] \| None Optional mapping of model names to Model instances or model name strings. Allows using custom Model implementations (e.g., wrapped Agents) instead of standard models. When a model name in the mapping is referenced, the corresponding Model/string is used. `filter` GenerateFilter \| None Filter for intercepting bridged model requests. `retry_refusals` int \| None Should refusals be retried? (pass number of times to retry) `home_dir` str \| None Home directory to use for codex cli. If set, AGENTS.md, skills, and the MCP configuration will be written here. `cwd` str \| None Working directory to run codex cli within. `env` dict\[str, str\] \| None Environment variables to set for codex cli `user` str \| None User to execute codex cli with. `sandbox` str \| None Optional sandbox environment name. `version` Literal\['auto', 'sandbox', 'latest'\] \| str Version of codex cli to use. One of: - “auto”: Use any available version of codex cli in the sandbox, otherwise download the latest version. - “sandbox”: Use the version of codex cli in the sandbox (raises `RuntimeError` if codex is not available in the sandbox) - “latest”: Download and use the very latest version of codex cli. - “x.x.x”: Download and use a specific version of codex cli. `config_overrides` dict\[str, str\] \| None Additional Codex CLI configuration overrides. Each key-value pair is passed as `-c key=value` to the CLI. `debug` bool \| None Trace all debug output. `disallowed_tools` list\[Literal\['web_search'\]\] \| None ### gemini_cli Gemini CLI agent. Agent that uses Google [Gemini CLI](https://github.com/google-gemini/gemini-cli) running in a sandbox with Inspect model bridging. Use the `attempts` option to enable additional submissions if the initial submission(s) are incorrect (by default, no additional attempts are permitted). [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/_gemini_cli/gemini_cli.py#L33) ``` python @agent def gemini_cli( name: str = "Gemini CLI", description: str = dedent(""" Autonomous coding agent capable of writing, testing, debugging, and iterating on code across multiple languages. """), system_prompt: str | None = None, skills: Sequence[str | Path | Skill] | None = None, mcp_servers: Sequence[MCPServerConfig] | None = None, bridged_tools: Sequence[BridgedToolsSpec] | None = None, centaur: bool | CentaurOptions = False, attempts: int | AgentAttempts = 1, model: str | None = None, model_aliases: dict[str, str | Model] | None = None, gemini_model: str = "gemini-2.5-pro", filter: GenerateFilter | None = None, retry_refusals: int | None = None, cwd: str | None = None, env: dict[str, str] | None = None, user: str | None = None, sandbox: str | None = None, version: Literal["auto", "sandbox", "stable", "latest"] | str = "auto", debug: bool | None = None, ) -> Agent ``` `name` str Agent name (used in multi-agent systems with `as_tool()` and `handoff()`) `description` str Agent description `system_prompt` str \| None Additional system prompt to append `skills` Sequence\[str \| Path \| Skill\] \| None Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. `mcp_servers` Sequence\[MCPServerConfig\] \| None MCP servers to make available to the agent `bridged_tools` Sequence\[BridgedToolsSpec\] \| None Host-side Inspect tools to expose to the agent via MCP `centaur` bool \| CentaurOptions Run in ‘centaur’ mode, which makes Gemini CLI available to an Inspect `human_cli()` agent rather than running it unattended. `attempts` int \| AgentAttempts Configure agent to make multiple attempts `model` str \| None Model name to use for inspect bridge (defaults to main model for task) `model_aliases` dict\[str, str \| Model\] \| None Optional mapping of model names to Model instances or model name strings. Allows using custom Model implementations (e.g., wrapped Agents) instead of standard models. When a model name in the mapping is referenced, the corresponding Model/string is used. `gemini_model` str Gemini model name to pass to CLI. This bypasses the auto-router. Use “gemini-2.5-pro” (default) or “gemini-2.5-flash”. The actual model calls still go through the inspect bridge, but this disables the router. `filter` GenerateFilter \| None Filter for intercepting bridged model requests `retry_refusals` int \| None Should refusals be retried? (pass number of times to retry) `cwd` str \| None Working directory to run gemini cli within `env` dict\[str, str\] \| None Environment variables to set for gemini cli `user` str \| None User to execute gemini cli with `sandbox` str \| None Optional sandbox environment name `version` Literal\['auto', 'sandbox', 'stable', 'latest'\] \| str Version of gemini cli to use. One of: - “auto”: Use any available version in sandbox, otherwise download latest - “sandbox”: Use sandbox version (raises RuntimeError if not available) - “stable”/“latest”: Download and use the latest version - “x.x.x”: Download and use a specific version `debug` bool \| None Trace all debug output. ### opencode OpenCode agent. Agent that uses [OpenCode](https://github.com/anomalyco/opencode) running in a sandbox with Inspect model bridging. Use the `attempts` option to enable additional submissions if the initial submission(s) are incorrect (by default, no additional attempts are permitted). [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/_opencode/opencode.py#L32) ``` python @agent def opencode( name: str = "OpenCode", description: str = dedent(""" Open-source autonomous coding agent for the terminal, capable of writing, testing, debugging, and iterating on code across multiple languages. """), system_prompt: str | None = None, skills: Sequence[str | Path | Skill] | None = None, mcp_servers: Sequence[MCPServerConfig] | None = None, bridged_tools: Sequence[BridgedToolsSpec] | None = None, centaur: bool | CentaurOptions = False, attempts: int | AgentAttempts = 1, model: str | None = None, model_aliases: dict[str, str | Model] | None = None, opencode_model: str = "anthropic/claude-sonnet-4-5", filter: GenerateFilter | None = None, retry_refusals: int | None = None, cwd: str | None = None, env: dict[str, str] | None = None, user: str | None = None, sandbox: str | None = None, version: Literal["auto", "sandbox", "stable", "latest"] | str = "auto", debug: bool | None = None, ) -> Agent ``` `name` str Agent name (used in multi-agent systems with `as_tool()` and `handoff()`) `description` str Agent description `system_prompt` str \| None Additional system prompt to append `skills` Sequence\[str \| Path \| Skill\] \| None Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. `mcp_servers` Sequence\[MCPServerConfig\] \| None MCP servers to make available to the agent `bridged_tools` Sequence\[BridgedToolsSpec\] \| None Host-side Inspect tools to expose to the agent via MCP `centaur` bool \| CentaurOptions Run in ‘centaur’ mode, which makes OpenCode available to an Inspect `human_cli()` agent rather than running it unattended. `attempts` int \| AgentAttempts Configure agent to make multiple attempts `model` str \| None Model name to use for inspect bridge (defaults to main model for task) `model_aliases` dict\[str, str \| Model\] \| None Optional mapping of model names to Model instances or model name strings. Allows using custom Model implementations (e.g., wrapped Agents) instead of standard models. When a model name in the mapping is referenced, the corresponding Model/string is used. `opencode_model` str OpenCode model identifier to pass to the CLI in the form `provider/model` (default: `"anthropic/claude-sonnet-4-5"`). The actual model calls still go through the Inspect bridge; this just selects which provider client OpenCode uses to format the request. `filter` GenerateFilter \| None Filter for intercepting bridged model requests `retry_refusals` int \| None Should refusals be retried? (pass number of times to retry) `cwd` str \| None Working directory to run opencode within `env` dict\[str, str\] \| None Environment variables to set for opencode `user` str \| None User to execute opencode with `sandbox` str \| None Optional sandbox environment name `version` Literal\['auto', 'sandbox', 'stable', 'latest'\] \| str Version of opencode to use. One of: - “auto”: Use any available version in sandbox, otherwise download latest - “sandbox”: Use sandbox version (raises RuntimeError if not available) - “stable”/“latest”: Download and use the latest version - “x.x.x”: Download and use a specific version `debug` bool \| None Trace all debug output. ### mini_swe_agent mini-swe-agent agent. Agent that uses [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent) running in a sandbox. Mini-swe-agent is a minimal 100-line agent that solves GitHub issues using only bash commands. The agent can either use a version of mini-swe-agent installed in the sandbox, or can download and install it via pip (see docs on `version` option below). Use `attempts` to enable additional submissions if initial submission(s) are incorrect (by default, no additional attempts are permitted). This agent does not handle compaction natively. Use `compaction` to specify a compaction strategy. [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/_mini_swe_agent/mini_swe_agent.py#L48) ``` python @agent def mini_swe_agent( name: str = "mini-swe-agent", description: str = dedent(""" Minimal AI agent that solves software engineering tasks using bash commands. 100 lines of Python, radically simple, scores >74% on SWE-bench verified. """), system_prompt: str | None = None, centaur: bool | CentaurOptions = False, attempts: int | AgentAttempts = 1, model: str | None = None, model_aliases: dict[str, str | Model] | None = None, filter: GenerateFilter | None = None, retry_refusals: int | None = None, compaction: CompactionStrategy | None = None, cwd: str | None = None, env: dict[str, str] | None = None, user: str | None = None, sandbox: str | None = None, version: Literal["stable", "sandbox", "latest"] | str = "stable", debug: bool | None = None, ) -> Agent ``` `name` str Agent name (used in multi-agent systems with `as_tool()` and `handoff()`) `description` str Agent description (used in multi-agent systems) `system_prompt` str \| None Additional system prompt to include (appended to any system messages from the task). `centaur` bool \| CentaurOptions Run in ‘centaur’ mode, which makes mini-swe-agent available to an Inspect `human_cli()` agent rather than running it unattended. `attempts` int \| AgentAttempts Configure agent to make multiple attempts. `model` str \| None Model name to use (defaults to main model for task). `model_aliases` dict\[str, str \| Model\] \| None Optional mapping of model names to Model instances or model name strings. Allows using custom Model implementations (e.g., wrapped Agents) instead of standard models. When a model name in the mapping is referenced, the corresponding Model/string is used. `filter` GenerateFilter \| None Filter for intercepting bridged model requests. `retry_refusals` int \| None Should refusals be retried? (pass number of times to retry) `compaction` CompactionStrategy \| None Compaction strategy for managing context window overflow. `cwd` str \| None Working directory to run mini-swe-agent within. `env` dict\[str, str\] \| None Environment variables to set for mini-swe-agent. `user` str \| None User to execute mini-swe-agent with. `sandbox` str \| None Optional sandbox environment name. `version` Literal\['stable', 'sandbox', 'latest'\] \| str Version of mini-swe-agent to use. One of: - “stable”: Download and install the default pinned version. - “sandbox”: Use version in sandbox (raises RuntimeError if not available) - “latest”: Download and install latest version from PyPI. - “x.x.x”: Install and use a specific version. `debug` bool \| None Trace all debug output. ## Binaries ### download_agent_binary Download agent binary. Download an agent binary. This version will be added to the cache of downloaded versions (which retains the 5 most recently downloaded versions). Use this if you need to ensure that a specific version of an agent binary is downloaded in advance (e.g. if you are going to run your evaluations offline). After downloading, explicit requests for the downloaded version (e.g. `claude_code(version="1.0.98")`) will not require network access. [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/_tools/download.py#L53) ``` python def download_agent_binary( binary: Literal["claude_code", "codex_cli"], version: Literal["stable", "latest"] | str, platform: SandboxPlatform, ) -> None ``` `binary` Literal\['claude_code', 'codex_cli'\] Type of binary to download `version` Literal\['stable', 'latest'\] \| str Version to download (“stable”, “latest”, or an explicit version number). `platform` [SandboxPlatform](../reference/index.html.md#sandboxplatform) Target platform (“linux-x64”, “linux-arm64”, “linux-x64-musl”, or “linux-arm64-musl”) ### cached_agent_binaries List the agent binaries which have been cached on this system. [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/_tools/download.py#L80) ``` python def cached_agent_binaries( binary: Literal["claude_code", "codex_cli"] | None = None, quiet: bool = False ) -> AgentBinaries ``` `binary` Literal\['claude_code', 'codex_cli'\] \| None Type of binary to list (lists all of if not specified). `quiet` bool Do not print the binaries as a side effect ### download_wheels_tarball Download all wheels for a package and its dependencies. Downloads wheels from PyPI for the specified platform and Python version, then bundles them into a tarball for offline installation in sandbox. Downloaded wheels are cached locally (retaining 5 most recent versions). [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/_util/agentwheel.py#L304) ``` python def download_wheels_tarball( package_name: str, version: str | None, platform: SandboxPlatform, python_version: str, ) -> tuple[bytes, str] ``` `package_name` str PyPI package name (e.g., “mini-swe-agent”) `version` str \| None Package version or None for latest `platform` [SandboxPlatform](../reference/index.html.md#sandboxplatform) Target sandbox platform `python_version` str Python version without dots (e.g., “312”) ### AgentBinary Agent binary. [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/_tools/download.py#L15) ``` python class AgentBinary(NamedTuple) ``` #### Attributes `agent` Literal\['claude_code', 'codex_cli'\] Agent type. `version` str Agent version. `path` Path “Agent path. ### SandboxPlatform Target platform identifier for sandbox binary and wheel downloads. [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/_util/sandbox.py#L5) ``` python SandboxPlatform: TypeAlias = Literal[ "linux-x64", "linux-arm64", "linux-x64-musl", "linux-arm64-musl" ] ``` ## ACP ### interactive_claude_code Claude Code agent via ACP. Uses the `claude-agent-acp` adapter in a sandbox. Supports multi-turn sessions and mid-turn interrupts. [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/acp/_agents/claude_code/claude_code.py#L167) ``` python def interactive_claude_code( *, disallowed_tools: list[str] | None = ..., skills: list[str | Path | Skill] | None = ..., opus_model: str | Model | None = ..., sonnet_model: str | Model | None = ..., haiku_model: str | Model | None = ..., subagent_model: str | Model | None = ..., model: str | Model | None = ..., filter: GenerateFilter | None = ..., bridged_tools: list[BridgedToolsSpec] | None = ..., mcp_servers: list[MCPServerConfig] | None = ..., system_prompt: str | None = ..., retry_refusals: int | None = ..., model_map: dict[str, str | Model] | None = ..., cwd: str | None = ..., env: dict[str, str] | None = ..., user: str | None = ..., sandbox: str | None = ..., ) -> ACPAgent ``` `disallowed_tools` list\[str\] \| None Tool names to disallow. `skills` list\[str \| Path \| Skill\] \| None Additional skills to make available. `opus_model` str \| Model \| None Model for opus calls. `sonnet_model` str \| Model \| None Model for sonnet calls. `haiku_model` str \| Model \| None Model for haiku / background calls. `subagent_model` str \| Model \| None Model for subagents. `model` str \| Model \| None `filter` GenerateFilter \| None `bridged_tools` list\[BridgedToolsSpec\] \| None `mcp_servers` list\[MCPServerConfig\] \| None `system_prompt` str \| None `retry_refusals` int \| None `model_map` dict\[str, str \| Model\] \| None `cwd` str \| None `env` dict\[str, str\] \| None `user` str \| None `sandbox` str \| None ### interactive_codex_cli Codex CLI agent via ACP. Uses the `codex-acp` adapter in a sandbox. Supports multi-turn sessions and mid-turn interrupts. [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/acp/_agents/codex_cli/codex_cli.py#L197) ``` python def interactive_codex_cli( *, web_search: CodexWebSearch = ..., goals: bool = ..., skills: list[str | Path | Skill] | None = ..., home_dir: str | None = ..., config_overrides: dict[str, str] | None = ..., disallowed_tools: list[Literal['web_search']] | None = ..., ) -> ACPAgent ``` `web_search` CodexWebSearch Web search mode. Use `"live"` for live web search, `"cached"` for cached web search, or `"disabled"` to disable web search. `goals` bool Enable Codex goal tools. `skills` list\[str \| Path \| Skill\] \| None Additional skills to make available. `home_dir` str \| None Override for `CODEX_HOME` directory in the sandbox. `config_overrides` dict\[str, str\] \| None Extra Codex config.toml key-value pairs. `disallowed_tools` list\[Literal\['web_search'\]\] \| None ### interactive_gemini_cli Gemini CLI agent via ACP. Uses gemini’s native `--experimental-acp` flag in a sandbox. Supports multi-turn sessions and mid-turn interrupts. [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/acp/_agents/gemini_cli/gemini_cli.py#L152) ``` python def interactive_gemini_cli( *, skills: list[str | Path | Skill] | None = ..., version: Literal['auto', 'sandbox', 'stable', 'latest'] | str = ..., debug: bool = ..., model: str | Model | None = ..., filter: GenerateFilter | None = ..., bridged_tools: list[BridgedToolsSpec] | None = ..., mcp_servers: list[MCPServerConfig] | None = ..., system_prompt: str | None = ..., retry_refusals: int | None = ..., model_map: dict[str, str | Model] | None = ..., cwd: str | None = ..., env: dict[str, str] | None = ..., user: str | None = ..., sandbox: str | None = ..., ) -> ACPAgent ``` `skills` list\[str \| Path \| Skill\] \| None Additional skills to make available. `version` Literal\['auto', 'sandbox', 'stable', 'latest'\] \| str Version of gemini CLI to use. One of: `"auto"`, `"sandbox"`, `"stable"`, `"latest"`, or a specific semver version string. `debug` bool Run gemini-cli with `--debug` and `GEMINI_DEBUG_LOG_FILE` set to `$HOME/gemini-debug.log` in the sandbox (in ACP mode console output is patched away from stderr, so the log file is the only way to surface internals). `model` str \| Model \| None `filter` GenerateFilter \| None `bridged_tools` list\[BridgedToolsSpec\] \| None `mcp_servers` list\[MCPServerConfig\] \| None `system_prompt` str \| None `retry_refusals` int \| None `model_map` dict\[str, str \| Model\] \| None `cwd` str \| None `env` dict\[str, str\] \| None `user` str \| None `sandbox` str \| None ### bridge_mcp_to_acp Convert bridge `MCPServerConfigHTTP` objects to ACP `HttpMcpServer`. [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/acp/agent.py#L29) ``` python def bridge_mcp_to_acp(configs: list[MCPServerConfigHTTP]) -> list[HttpMcpServer] ``` `configs` list\[MCPServerConfigHTTP\] ### ACPAgent Base class for ACP-based agents running in sandboxes. Manages the ACP lifecycle (connection, session, MCP announcement, cleanup). Subclasses implement :meth:`_start_agent` for agent-specific setup. Sets up the ACP lifecycle, exposes `.conn` and `.session_id`, signals `.ready`, then blocks until the task is cancelled. The caller drives all prompts via `conn.prompt()` / `conn.cancel()`. [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/acp/agent.py#L76) ``` python class ACPAgent(Agent) ``` ### ACPAgentParams Keyword arguments accepted by :class:[ACPAgent](../reference/index.html.md#acpagent). [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/acp/agent.py#L45) ``` python class ACPAgentParams(TypedDict, total=False) ``` ### acp_connection Bridge an `ExecRemoteProcess` to ACP. Yield `(conn, feeder, error_info)`. Bridges `ExecRemoteProcess` to the SDK’s `connect_to_agent()` via a transport wrapper, then cleans up on exit. *feeder* is a background task that reads process stdout and feeds it into the ACP reader. It completes when the process exits, so callers can `await feeder` to detect unexpected process termination. *proc_info* collects stderr output and the exit code as the process runs. Inspect after `await feeder` for full diagnostics. Usage:: async with acp_connection(proc) as (conn, feeder, proc_info): await conn.initialize(...) session = await conn.new_session(...) await conn.prompt(...) [Source](https://github.com/meridianlabs-ai/inspect_swe/blob/49a7c3004a872b87f9c22fc53036b397b660716f/src/inspect_swe/acp/client.py#L255) ``` python @contextlib.asynccontextmanager async def acp_connection( proc: ExecRemoteProcess, ) -> AsyncIterator[tuple[ClientSideConnection, asyncio.Task[None], ErrorInfo]] ``` `proc` ExecRemoteProcess