# Inspect SWE > Software engineering agents for Inspect AI. # Inspect SWE ## Overview The `inspect_swe` package makes software engineering agents like [Claude Code](https://docs.anthropic.com/en/docs/claude-code/overview), [Codex CLI](https://github.com/openai/codex), [Gemini CLI](https://github.com/google-gemini/gemini-cli), [OpenCode](https://github.com/anomalyco/opencode), and [Mini SWE Agent](https://github.com/SWE-agent/mini-swe-agent). available as standard Inspect agents. For example, here we use the [claude_code()](./reference/index.html.md#claude_code) agent as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import claude_code @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=claude_code(), scorer=model_graded_qa(), sandbox="docker", ) ``` Inspect SWE agents are implemented using the Inspect [`sandbox_agent_bridge()`](https://inspect.aisi.org.uk/agent-bridge.html#sandbox-bridge). Agents run inside the sample sandbox and their model API calls are proxied back to Inspect. This means that you can use any model with Inspect SWE agents, and that features like token or time limits and log transcripts work as normal with the agents. ## Getting Started Install Inspect SWE from PyPI with: ``` bash pip install inspect-swe ``` Then, try out one or more of the available agents: | Agent | Description | |----|----| | [claude_code()](./claude_code.html.md) | Anthropic’s agentic coding tool [Claude Code](https://docs.anthropic.com/en/docs/claude-code/overview) | | [codex_cli()](./codex_cli.html.md) | OpenAI’s terminal-based coding agent [Codex CLI](https://github.com/openai/codex) | | [gemini_cli()](./gemini_cli.html.md) | Google’s open-source AI agent [Gemini CLI](https://github.com/google-gemini/gemini-cli) | | [opencode()](./opencode.html.md) | Provider-independent terminal-based coding agent. | | [mini_swe_agent()](./mini_swe_agent.html.md) | SWE-agent’s minimal 100-line agent. | # Claude Code – Inspect SWE ## Overview The `claude_code()` agent uses the unattended mode of Anthropic [Claude Code](https://code.claude.com/docs/en/overview) to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task. > **NOTE: NoteClaude Code Installation** > > By default, the agent will download the current stable version of Claude Code and copy it to the sandbox. You can also exercise more explicit control over which version of Claude Code is used—see the [Installation](#installation) section below for details. ## Basic Usage Use the `claude_code()` agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import claude_code @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=claude_code(), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also pass the agent as a `--solver` on the command line: ``` bash inspect eval ctf.py --solver inspect_swe/claude_code ``` If you want to try this out locally, see the [system_explorer](https://github.com/meridianlabs-ai/inspect_swe/tree/main/examples/system_explorer/task.py) example. ## Options The following options are supported for customizing the behavior of the agent: | Option | Description | |----|----| | `system_prompt` | Additional system prompt to append to default system prompt. | | `skills` | Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. | | `mcp_servers` | MCP servers (see [MCP Servers](#mcp-servers) below for details). | | `bridged_tools` | Host-side Inspect tools to expose via MCP (see [Bridged Tools](#bridged-tools) below for details). | | `disallowed_tools` | Optionally disallow tools (e.g. `"WebSearch"`) | | `centaur` | Run in [Centaur Mode](#centaur-mode), which makes Claude Code available to an Inspect `human_cli()` agent rather than running it unattended. | | `attempts` | Allow the agent to have multiple scored attempts at solving the task. | | `model` | Model name to use for agent (defaults to main model for task). | | `model_config` | Model id used for the identity the agent presents to itself (its “You are powered by the model …” system prompt). Defaults to the real served model. | | `opus_model` | The model to use for `opus`, or for `opusplan` when Plan Mode is active. Defaults to `model`. | | `sonnet_model` | The model to use for `sonnet`, or for `opusplan` when Plan Mode is not active. Defaults to `model`. | | `haiku_model` | The model to use for haiku, or [background functionality](https://code.claude.com/docs/en/costs#background-token-usage). Defaults to `model`. | | `subagent_model` | The model to use for [subagents](https://code.claude.com/docs/en/sub-agents). Defaults to `model`. | | `filter` | Filter for intercepting bridged model requests. | | `retry_refusals` | Should refusals be retried? Defaults to 3. | | `retry_uncaught_errors` | Should uncaught errors (unexpected crashes of Claude Code) be retried? Defaults to 3. | | `cwd` | Working directory for Claude Code session. | | `env` | Environment variables to set for Claude Code. | | `version` | Version of Claude Code to use (see [Installation](#installation) below for details) | For example, here we specify a custom system prompt and disallow the `WebSearch` tool: ``` python claude_code( system_prompt="You are an ace system researcher.", disallowed_tools=["WebSearch"] ) ``` ## MCP Servers You can specify one or more [Model Context Protocol](https://modelcontextprotocol.io/docs/getting-started/intro) (MCP) servers to provide additional tools to Claude Code. Servers are specified via the [`MCPServerConfig`](https://inspect.aisi.org.uk/reference/inspect_ai.tool.html#mcpserverconfig) class and its Stdio and HTTP variants. For example, here is a Dockerfile that makes the `server-memory` MCP server available in the sandbox container: ``` dockerfile FROM python:3.12-bookworm # nodejs (required by mcp server) RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ && curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \ && apt-get install -y --no-install-recommends nodejs \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # memory mcp server RUN npx --yes @modelcontextprotocol/server-memory --version # run forever CMD ["tail", "-f", "/dev/null"] ``` Note that we run the `npx` server during the build of the Dockerfile so that it is cached for use offline (below we’ll run it with the `--offline` option). We can then use this MCP server in a task as follows: ``` python from inspect_ai import Task, task from inspect_ai.dataset import Sample from inspect_ai.tool import MCPServerConfigStdio from inspect_swe import claude_code @task def investigator() -> Task: return Task( dataset=[ Sample( input="What transport protocols are supported in " + " the 2025-03-26 version of the MCP spec?" ) ], solver=claude_code( system_prompt="Please use the web search tool to " + "research this question and the memory tools " + "to keep track of your research.", mcp_servers=[ MCPServerConfigStdio( name="memory", command="npx", args=[ "--offline", "@modelcontextprotocol/server-memory" ], ) ] ), sandbox=("docker", "Dockerfile"), ) ``` Note that we run the MCP server using the `--offline` option so that it doesn’t require an internet connection (which it would normally use to check for updates to the package). ## Bridged Tools You can expose host-side Inspect tools to the sandboxed agent via the MCP protocol using the `bridged_tools` parameter. This allows you to run tools on the host (e.g. tools that access host resources, databases, or APIs) but make them available to the agent running inside the sandbox. Tools are specified via [`BridgedToolsSpec`](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#bridgedtoolsspec) which wraps a list of Inspect tools: ``` python from inspect_ai import Task, task from inspect_ai.agent import BridgedToolsSpec from inspect_ai.dataset import Sample from inspect_ai.tool import tool from inspect_swe import claude_code @tool def search_database(): async def execute(query: str) -> str: """Search the internal database. Args: query: The search query. """ # This runs on the host, not in the sandbox return f"Results for: {query}" return execute @task def investigator() -> Task: return Task( dataset=[ Sample(input="Search for information about MCP protocols.") ], solver=claude_code( system_prompt="Use the search tool to research.", bridged_tools=[ BridgedToolsSpec( name="host_tools", tools=[search_database()] ) ] ), sandbox=("docker", "Dockerfile"), ) ``` The `name` field identifies the MCP server and will be visible to the agent as a tool prefix. You can specify multiple `BridgedToolsSpec` instances to create separate MCP servers for different tool groups. See the [Bridged Tools](https://inspect.aisi.org.uk/agent-bridge.html#bridged-tools) documentation for more details on the architecture and how tool execution flows between host and sandbox. ## Installation By default, the agent will download the current stable version of Claude Code and copy it to the sandbox. You can override this behaviour using the `version` option: | Option | Description | |----|----| | `"auto"` | Use any available version of Claude Code in the sandbox, otherwise download the current stable version. | | `"sandbox"` | Use the version of Claude Code in the sandbox (raises `RuntimeError` if not available in the sandbox) | | `"stable"` | Download and use the current stable version. | | `"latest"` | Download and use the very latest version. | | `"x.x.x"` | Download and use a specific version number. | If you don’t ever want to rely on automatic downloads of Claude Code (e.g. if you run your evaluations offline), you can use one of two approaches: 1. Pre-install the version of Claude Code you want to use in the sandbox, then use `version="sandbox"`: ``` python claude_code(version="sandbox") ``` 2. Download the version of Claude Code you want to use into the cache, then specify that version explicitly: ``` python # download the agent binary during installation/configuration download_agent_binary("claude_code", "0.29.0", "linux-x64") # reference that version in your task (no download will occur) claude_code(version="0.29.0") ``` Note that the 5 most recently downloaded versions are retained in the cache. Use the [cached_agent_binaries()](./reference/index.html.md#cached_agent_binaries) function to list the contents of the cache. ## Centaur Mode The `claude_code()` agent can also be run in “centaur” mode which uses the Inspect AI [Human Agent](https://inspect.aisi.org.uk/human-agent.html) as the solver and makes [Claude Code](https://code.claude.com/docs/en/overview) available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model. Enable centaur mode by passing `centaur=True` to the `claude_code()` agent: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import claude_code @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=claude_code(centaur=True), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also enable centaur mode from the CLI using a solver arg (`-S`): ``` bash inspect eval ctf.py --solver inspect_swe/claude_code -S centaur=true ``` You can also pass `CentaurOptions` to further customize the behavior of the human agent. For example: ``` python from inspect_swe import CentaurOptions Task( dataset=json_dataset("dataset.json"), solver=claude_code(centaur=CentaurOptions(answer=False)), scorer=model_graded_qa(), sandbox="docker", ) ``` See the [human_cli()](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#human_cli) documentation for details on available options. ## Troubleshooting If Claude Code doesn’t appear to be working or working as expected, you can troubleshoot by dumping the Claude Code debug log after an evaluation task is complete. You can do this with: ``` bash inspect trace dump --filter "Claude Code" ``` # Codex CLI – Inspect SWE ## Overview The `codex_cli()` agent uses the unattended mode of OpenAI [Codex CLI](https://github.com/openai/codex) to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task. > **NOTE: NoteCodex CLI Installation** > > By default, the agent will download the current stable version of Codex CLI and copy it to the sandbox. You can also exercise more explicit control over which version of Codex CLI is used—see the [Installation](#installation) section below for details. ## Basic Usage Use the `codex_cli()` agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import codex_cli @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=codex_cli(), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also pass the agent as a `--solver` on the command line: ``` bash inspect eval ctf.py --solver inspect_swe/codex_cli ``` If you want to try this out locally, see the [system_explorer](https://github.com/meridianlabs-ai/inspect_swe/tree/main/examples/system_explorer/task.py) example. ## Options The following options are supported for customizing the behavior of the agent: | Option | Description | |----|----| | `system_prompt` | Additional system prompt to append to default system prompt. | | `model_config` | Codex model slug used to select the system prompt and tool set. Defaults to `None`, which derives the slug from the model used by the agent so Codex’s prompt/tooling aligns with what’s actually running. | | `skills` | Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. | | `mcp_servers` | MCP servers (see [MCP Servers](#mcp-servers) below for details). | | `bridged_tools` | Host-side Inspect tools to expose via MCP (see [Bridged Tools](#bridged-tools) below for details). | | `web_search` | Web search mode. Use `"live"` for live web search, `"cached"` for cached web search, or `"disabled"` to disable web search. Defaults to `"live"`. | | `goals` | Enable Codex goal tools. Defaults to `True`. | | `centaur` | Run in [Centaur Mode](#centaur-mode), which makes Codex CLI available to an Inspect `human_cli()` agent rather than running it unattended. | | `attempts` | Allow the agent to have multiple scored attempts at solving the task. | | `model` | Model name to use for agent (defaults to main model for task). | | `filter` | Filter for intercepting bridged model requests. | | `retry_refusals` | Should refusals be retried? (pass number of times to retry) | | `home_dir` | Home directory to use for codex cli. When set, AGENTS.md and the MCP configuration will be written here rather than to .codex | | `cwd` | Working directory for Codex CLI session. | | `env` | Environment variables to set for Codex CLI. | | `version` | Version of Codex CLI to use (see [Installation](#installation) below for details) | | `config_overrides` | Additional Codex CLI configuration overrides. | For example, here we specify a custom system prompt and disable the web search and goals tools: ``` python codex_cli( system_prompt="You are an ace system researcher.", web_search="disabled", goals=False, ) ``` ## MCP Servers You can specify one or more [Model Context Protocol](https://modelcontextprotocol.io/docs/getting-started/intro) (MCP) servers to provide additional tools to Codex CLI. Servers are specified via the [`MCPServerConfig`](https://inspect.aisi.org.uk/reference/inspect_ai.tool.html#mcpserverconfig) class and its Stdio and HTTP variants. For example, here is a Dockerfile that makes the `server-memory` MCP server available in the sandbox container: ``` dockerfile FROM python:3.12-bookworm # nodejs (required by mcp server) RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ && curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \ && apt-get install -y --no-install-recommends nodejs \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # memory mcp server RUN npx --yes @modelcontextprotocol/server-memory --version # run forever CMD ["tail", "-f", "/dev/null"] ``` Note that we run the `npx` server during the build of the Dockerfile so that it is cached for use offline (below we’ll run it with the `--offline` option). We can then use this MCP server in a task as follows: ``` python from inspect_ai import Task, task from inspect_ai.dataset import Sample from inspect_ai.tool import MCPServerConfigStdio from inspect_swe import codex_cli @task def investigator() -> Task: return Task( dataset=[ Sample( input="What transport protocols are supported in " + " the 2025-03-26 version of the MCP spec?" ) ], solver=codex_cli( system_prompt="Please use the web search tool to " + "research this question and the memory tools " + "to keep track of your research.", mcp_servers=[ MCPServerConfigStdio( name="memory", command="npx", args=[ "--offline", "@modelcontextprotocol/server-memory" ], ) ] ), sandbox=("docker", "Dockerfile"), ) ``` Note that we run the MCP server using the `--offline` option so that it doesn’t require an internet connection (which it would normally use to check for updates to the package). ## Bridged Tools You can expose host-side Inspect tools to the sandboxed agent via the MCP protocol using the `bridged_tools` parameter. This allows you to run tools on the host (e.g. tools that access host resources, databases, or APIs) but make them available to the agent running inside the sandbox. Tools are specified via [`BridgedToolsSpec`](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#bridgedtoolsspec) which wraps a list of Inspect tools: ``` python from inspect_ai import Task, task from inspect_ai.agent import BridgedToolsSpec from inspect_ai.dataset import Sample from inspect_ai.tool import tool from inspect_swe import codex_cli @tool def search_database(): async def execute(query: str) -> str: """Search the internal database. Args: query: The search query. """ # This runs on the host, not in the sandbox return f"Results for: {query}" return execute @task def investigator() -> Task: return Task( dataset=[ Sample(input="Search for information about MCP protocols.") ], solver=codex_cli( system_prompt="Use the search tool to research.", bridged_tools=[ BridgedToolsSpec( name="host_tools", tools=[search_database()] ) ] ), sandbox=("docker", "Dockerfile"), ) ``` The `name` field identifies the MCP server and will be visible to the agent as a tool prefix. You can specify multiple `BridgedToolsSpec` instances to create separate MCP servers for different tool groups. See the [Bridged Tools](https://inspect.aisi.org.uk/agent-bridge.html#bridged-tools) documentation for more details on the architecture and how tool execution flows between host and sandbox. ## Installation By default, the agent will download the current stable version of Codex CLI and copy it to the sandbox. You can override this behaviour using the `version` option: | Option | Description | |----|----| | `"auto"` | Use any available version of Codex CLI in the sandbox, otherwise download the latest version. | | `"sandbox"` | Use the version of Codex CLI in the sandbox (raises `RuntimeError` if not available in the sandbox) | | `"latest"` | Download and use the very latest version. | | `"x.x.x"` | Download and use a specific version number. | If you don’t ever want to rely on automatic downloads of Codex CLI (e.g. if you run your evaluations offline), you can use one of two approaches: 1. Pre-install the version of Codex CLI you want to use in the sandbox, then use `version="sandbox"`: ``` python codex_cli(version="sandbox") ``` 2. Download the version of Codex CLI you want to use into the cache, then specify that version explicitly: ``` python # download the agent binary during installation/configuration download_agent_binary("codex_cli", "0.29.0", "linux-x64") # reference that version in your task (no download will occur) codex_cli(version="0.29.0") ``` Note that the 5 most recently downloaded versions are retained in the cache. Use the [cached_agent_binaries()](./reference/index.html.md#cached_agent_binaries) function to list the contents of the cache. ## Centaur Mode The `codex_cli()` agent can also be run in “centaur” mode which uses the Inspect AI [Human Agent](https://inspect.aisi.org.uk/human-agent.html) as the solver and makes [Codex CLI](https://github.com/openai/codex) available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model. Enable centaur mode by passing `centaur=True` to the `codex_cli()` agent: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import codex_cli @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=codex_cli(centaur=True), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also enable centaur mode from the CLI using a solver arg (`-S`): ``` bash inspect eval ctf.py --solver inspect_swe/codex_cli -S centaur=true ``` You can also pass `CentaurOptions` to further customize the behavior of the human agent. For example: ``` python from inspect_swe import CentaurOptions Task( dataset=json_dataset("dataset.json"), solver=codex_cli(centaur=CentaurOptions(answer=False)), scorer=model_graded_qa(), sandbox="docker", ) ``` See the [human_cli()](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#human_cli) documentation for details on available options. ## Troubleshooting If Codex CLI doesn’t appear to be working or working as expected, you can troubleshoot by dumping the Codex CLI debug log after an evaluation task is complete. You can do this with: ``` bash inspect trace dump --filter "Codex CLI" ``` # Gemini CLI – Inspect SWE ## Overview The `gemini_cli()` agent uses the unattended mode of Google [Gemini CLI](https://github.com/google-gemini/gemini-cli) to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task. > **NOTE: NoteGemini CLI Installation** > > By default, the agent will download the current stable version of Gemini CLI and copy it to the sandbox. You can also exercise more explicit control over which version of Gemini CLI is used—see the [Installation](#installation) section below for details. ## Basic Usage Use the `gemini_cli()` agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import gemini_cli @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=gemini_cli(), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also pass the agent as a `--solver` on the command line: ``` bash inspect eval ctf.py --solver inspect_swe/gemini_cli ``` If you want to try this out locally, see the [system_explorer](https://github.com/meridianlabs-ai/inspect_swe/tree/main/examples/system_explorer/task.py) example. ## Options The following options are supported for customizing the behavior of the agent: | Option | Description | |----|----| | `system_prompt` | Additional system prompt to append to default system prompt. | | `skills` | Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. | | `mcp_servers` | MCP servers (see [MCP Servers](#mcp-servers) below for details). | | `bridged_tools` | Host-side Inspect tools to expose via MCP (see [Bridged Tools](#bridged-tools) below for details). | | `centaur` | Run in [Centaur Mode](#centaur-mode), which makes Gemini CLI available to an Inspect `human_cli()` agent rather than running it unattended. | | `attempts` | Allow the agent to have multiple scored attempts at solving the task. | | `model` | Model name to use for agent (defaults to main model for task). | | `gemini_model` | Gemini model name to pass to CLI. This bypasses the auto-router. Use `"gemini-2.5-pro"` (default) or `"gemini-2.5-flash"`. | | `filter` | Filter for intercepting bridged model requests. | | `retry_refusals` | Should refusals be retried? (pass number of times to retry) | | `cwd` | Working directory for Gemini CLI session. | | `env` | Environment variables to set for Gemini CLI. | | `version` | Version of Gemini CLI to use (see [Installation](#installation) below for details) | For example, here we specify a custom system prompt: ``` python gemini_cli( system_prompt="You are an ace system researcher." ) ``` ## MCP Servers You can specify one or more [Model Context Protocol](https://modelcontextprotocol.io/docs/getting-started/intro) (MCP) servers to provide additional tools to Gemini CLI. Servers are specified via the [`MCPServerConfig`](https://inspect.aisi.org.uk/reference/inspect_ai.tool.html#mcpserverconfig) class and its Stdio and HTTP variants. For example, here is a Dockerfile that makes the `server-memory` MCP server available in the sandbox container: ``` dockerfile FROM python:3.12-bookworm # nodejs (required by mcp server) RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ && curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \ && apt-get install -y --no-install-recommends nodejs \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # memory mcp server RUN npx --yes @modelcontextprotocol/server-memory --version # run forever CMD ["tail", "-f", "/dev/null"] ``` Note that we run the `npx` server during the build of the Dockerfile so that it is cached for use offline (below we’ll run it with the `--offline` option). We can then use this MCP server in a task as follows: ``` python from inspect_ai import Task, task from inspect_ai.dataset import Sample from inspect_ai.tool import MCPServerConfigStdio from inspect_swe import gemini_cli @task def investigator() -> Task: return Task( dataset=[ Sample( input="What transport protocols are supported in " + " the 2025-03-26 version of the MCP spec?" ) ], solver=gemini_cli( system_prompt="Please use the web search tool to " + "research this question and the memory tools " + "to keep track of your research.", mcp_servers=[ MCPServerConfigStdio( name="memory", command="npx", args=[ "--offline", "@modelcontextprotocol/server-memory" ], ) ] ), sandbox=("docker", "Dockerfile"), ) ``` Note that we run the MCP server using the `--offline` option so that it doesn’t require an internet connection (which it would normally use to check for updates to the package). ## Bridged Tools You can expose host-side Inspect tools to the sandboxed agent via the MCP protocol using the `bridged_tools` parameter. This allows you to run tools on the host (e.g. tools that access host resources, databases, or APIs) but make them available to the agent running inside the sandbox. Tools are specified via [`BridgedToolsSpec`](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#bridgedtoolsspec) which wraps a list of Inspect tools: ``` python from inspect_ai import Task, task from inspect_ai.agent import BridgedToolsSpec from inspect_ai.dataset import Sample from inspect_ai.tool import tool from inspect_swe import gemini_cli @tool def search_database(): async def execute(query: str) -> str: """Search the internal database. Args: query: The search query. """ # This runs on the host, not in the sandbox return f"Results for: {query}" return execute @task def investigator() -> Task: return Task( dataset=[ Sample(input="Search for information about MCP protocols.") ], solver=gemini_cli( system_prompt="Use the search tool to research.", bridged_tools=[ BridgedToolsSpec( name="host_tools", tools=[search_database()] ) ] ), sandbox=("docker", "Dockerfile"), ) ``` The `name` field identifies the MCP server and will be visible to the agent as a tool prefix. You can specify multiple `BridgedToolsSpec` instances to create separate MCP servers for different tool groups. See the [Bridged Tools](https://inspect.aisi.org.uk/agent-bridge.html#bridged-tools) documentation for more details on the architecture and how tool execution flows between host and sandbox. ## Installation By default, the agent will download the current stable version of Gemini CLI and copy it to the sandbox. You can override this behaviour using the `version` option: | Option | Description | |----|----| | `"auto"` | Use any available version of Gemini CLI in the sandbox, otherwise download the latest version. | | `"sandbox"` | Use the version of Gemini CLI in the sandbox (raises `RuntimeError` if not available in the sandbox) | | `"latest"` | Download and use the very latest version. | | `"x.x.x-preview.y"` | Download and use a specific version number. | If you don’t ever want to rely on automatic downloads of Gemini CLI (e.g. if you run your evaluations offline), you can use one of two approaches: 1. Pre-install the version of Gemini CLI you want to use in the sandbox, then use `version="sandbox"`: ``` python gemini_cli(version="sandbox") ``` 2. Download the version of Gemini CLI you want to use into the cache, then specify that version explicitly: ``` python # download the agent binary during installation/configuration download_agent_binary("gemini_cli", "0.29.0", "linux-x64") # reference that version in your task (no download will occur) gemini_cli(version="0.29.0") ``` Note that the 5 most recently downloaded versions are retained in the cache. Use the [cached_agent_binaries()](./reference/index.html.md#cached_agent_binaries) function to list the contents of the cache. ## Centaur Mode The `gemini_cli()` agent can also be run in “centaur” mode which uses the Inspect AI [Human Agent](https://inspect.aisi.org.uk/human-agent.html) as the solver and makes [Gemini CLI](https://github.com/google-gemini/gemini-cli) available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model. Enable centaur mode by passing `centaur=True` to the `gemini_cli()` agent: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import gemini_cli @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=gemini_cli(centaur=True), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also enable centaur mode from the CLI using a solver arg (`-S`): ``` bash inspect eval ctf.py --solver inspect_swe/gemini_cli -S centaur=true ``` You can also pass `CentaurOptions` to further customize the behavior of the human agent. For example: ``` python from inspect_swe import CentaurOptions Task( dataset=json_dataset("dataset.json"), solver=gemini_cli(centaur=CentaurOptions(answer=False)), scorer=model_graded_qa(), sandbox="docker", ) ``` See the [human_cli()](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#human_cli) documentation for details on available options. ## Troubleshooting If Gemini CLI doesn’t appear to be working or working as expected, you can troubleshoot by dumping the Gemini CLI debug log after an evaluation task is complete. You can do this with: ``` bash inspect trace dump --filter "Gemini CLI" ``` # OpenCode – Inspect SWE ## Overview The `opencode()` agent uses the unattended mode of Anomaly [OpenCode](https://github.com/anomalyco/opencode) to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task. > **NOTE: NoteOpenCode Installation** > > By default, the agent will download the current stable version of OpenCode and copy it to the sandbox. You can also exercise more explicit control over which version of OpenCode is used—see the [Installation](#installation) section below for details. ## Basic Usage Use the `opencode()` agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import opencode @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=opencode(), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also pass the agent as a `--solver` on the command line: ``` bash inspect eval ctf.py --solver inspect_swe/opencode ``` If you want to try this out locally, see the [system_explorer](https://github.com/meridianlabs-ai/inspect_swe/tree/main/examples/system_explorer/task.py) example. ## Options The following options are supported for customizing the behavior of the agent: | Option | Description | |----|----| | `system_prompt` | Additional system prompt to append to default system prompt. | | `skills` | Additional [skills](https://inspect.aisi.org.uk/tools-standard.html#sec-skill) to make available to the agent. | | `mcp_servers` | MCP servers (see [MCP Servers](#mcp-servers) below for details). | | `bridged_tools` | Host-side Inspect tools to expose via MCP (see [Bridged Tools](#bridged-tools) below for details). | | `centaur` | Run in [Centaur Mode](#centaur-mode), which makes OpenCode available to an Inspect `human_cli()` agent rather than running it unattended. | | `attempts` | Allow the agent to have multiple scored attempts at solving the task. | | `model` | Model name to use for agent (defaults to main model for task). | | `opencode_model` | OpenCode model identifier (`provider/model`) passed to the CLI. Default: `"anthropic/claude-sonnet-4-5"`. The actual model calls still go through the Inspect bridge; this just selects which provider client OpenCode uses to format the request. | | `filter` | Filter for intercepting bridged model requests. | | `retry_refusals` | Should refusals be retried? (pass number of times to retry) | | `cwd` | Working directory for OpenCode session. | | `env` | Environment variables to set for OpenCode. | | `version` | Version of OpenCode to use (see [Installation](#installation) below for details) | For example, here we specify a custom system prompt: ``` python opencode( system_prompt="You are an ace system researcher." ) ``` ## MCP Servers You can specify one or more [Model Context Protocol](https://modelcontextprotocol.io/docs/getting-started/intro) (MCP) servers to provide additional tools to OpenCode. Servers are specified via the [`MCPServerConfig`](https://inspect.aisi.org.uk/reference/inspect_ai.tool.html#mcpserverconfig) class and its Stdio and HTTP variants. For example, here is a Dockerfile that makes the `server-memory` MCP server available in the sandbox container: ``` dockerfile FROM python:3.12-bookworm # nodejs (required by mcp server) RUN apt-get update && apt-get install -y --no-install-recommends \ curl \ && curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \ && apt-get install -y --no-install-recommends nodejs \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # memory mcp server RUN npx --yes @modelcontextprotocol/server-memory --version # run forever CMD ["tail", "-f", "/dev/null"] ``` Note that we run the `npx` server during the build of the Dockerfile so that it is cached for use offline (below we’ll run it with the `--offline` option). We can then use this MCP server in a task as follows: ``` python from inspect_ai import Task, task from inspect_ai.dataset import Sample from inspect_ai.tool import MCPServerConfigStdio from inspect_swe import opencode @task def investigator() -> Task: return Task( dataset=[ Sample( input="What transport protocols are supported in " + " the 2025-03-26 version of the MCP spec?" ) ], solver=opencode( system_prompt="Please use the web search tool to " + "research this question and the memory tools " + "to keep track of your research.", mcp_servers=[ MCPServerConfigStdio( name="memory", command="npx", args=[ "--offline", "@modelcontextprotocol/server-memory" ], ) ] ), sandbox=("docker", "Dockerfile"), ) ``` Note that we run the MCP server using the `--offline` option so that it doesn’t require an internet connection (which it would normally use to check for updates to the package). ## Bridged Tools You can expose host-side Inspect tools to the sandboxed agent via the MCP protocol using the `bridged_tools` parameter. This allows you to run tools on the host (e.g. tools that access host resources, databases, or APIs) but make them available to the agent running inside the sandbox. Tools are specified via [`BridgedToolsSpec`](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#bridgedtoolsspec) which wraps a list of Inspect tools: ``` python from inspect_ai import Task, task from inspect_ai.agent import BridgedToolsSpec from inspect_ai.dataset import Sample from inspect_ai.tool import tool from inspect_swe import opencode @tool def search_database(): async def execute(query: str) -> str: """Search the internal database. Args: query: The search query. """ # This runs on the host, not in the sandbox return f"Results for: {query}" return execute @task def investigator() -> Task: return Task( dataset=[ Sample(input="Search for information about MCP protocols.") ], solver=opencode( system_prompt="Use the search tool to research.", bridged_tools=[ BridgedToolsSpec( name="host_tools", tools=[search_database()] ) ] ), sandbox=("docker", "Dockerfile"), ) ``` The `name` field identifies the MCP server and will be visible to the agent as a tool prefix. You can specify multiple `BridgedToolsSpec` instances to create separate MCP servers for different tool groups. See the [Bridged Tools](https://inspect.aisi.org.uk/agent-bridge.html#bridged-tools) documentation for more details on the architecture and how tool execution flows between host and sandbox. ## Installation By default, the agent will download the current latest version of OpenCode and copy it to the sandbox. You can override this behaviour using the `version` option: | Option | Description | |----|----| | `"auto"` | Use any available version of OpenCode in the sandbox, otherwise download the latest version. | | `"sandbox"` | Use the version of OpenCode in the sandbox (raises `RuntimeError` if not available in the sandbox) | | `"latest"` | Download and use the very latest version. | | `"x.x.x"` | Download and use a specific version number. | If you don’t ever want to rely on automatic downloads of OpenCode (e.g. if you run your evaluations offline), you can use one of two approaches: 1. Pre-install the version of OpenCode you want to use in the sandbox, then use `version="sandbox"`: ``` python opencode(version="sandbox") ``` 2. Download the version of OpenCode you want to use into the cache, then specify that version explicitly: ``` python # download the agent binary during installation/configuration download_agent_binary("opencode", "0.29.0", "linux-x64") # reference that version in your task (no download will occur) opencode(version="0.29.0") ``` Note that the 5 most recently downloaded versions are retained in the cache. Use the [cached_agent_binaries()](./reference/index.html.md#cached_agent_binaries) function to list the contents of the cache. ## Centaur Mode The `opencode()` agent can also be run in “centaur” mode which uses the Inspect AI [Human Agent](https://inspect.aisi.org.uk/human-agent.html) as the solver and makes [OpenCode](https://github.com/anomalyco/opencode) available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model. Enable centaur mode by passing `centaur=True` to the `opencode()` agent: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import opencode @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=opencode(centaur=True), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also enable centaur mode from the CLI using a solver arg (`-S`): ``` bash inspect eval ctf.py --solver inspect_swe/opencode -S centaur=true ``` You can also pass `CentaurOptions` to further customize the behavior of the human agent. For example: ``` python from inspect_swe import CentaurOptions Task( dataset=json_dataset("dataset.json"), solver=opencode(centaur=CentaurOptions(answer=False)), scorer=model_graded_qa(), sandbox="docker", ) ``` See the [human_cli()](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#human_cli) documentation for details on available options. ## Troubleshooting If OpenCode doesn’t appear to be working or working as expected, you can troubleshoot by dumping the OpenCode debug log after an evaluation task is complete. You can do this with: ``` bash inspect trace dump --filter "OpenCode" ``` # Mini SWE Agent – Inspect SWE ## Overview The `mini_swe_agent()` agent uses the unattended mode of SWE-agent [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent) to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task. > **NOTE: Notemini-swe-agent Installation** > > By default, the agent will download the current stable version of mini-swe-agent and copy it to the sandbox. You can also exercise more explicit control over which version of mini-swe-agent is used—see the [Installation](#installation) section below for details. ## Basic Usage Use the `mini_swe_agent()` agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import mini_swe_agent @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=mini_swe_agent(), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also pass the agent as a `--solver` on the command line: ``` bash inspect eval ctf.py --solver inspect_swe/mini_swe_agent ``` If you want to try this out locally, see the [system_explorer](https://github.com/meridianlabs-ai/inspect_swe/tree/main/examples/system_explorer/task.py) example. ## Options The following options are supported for customizing the behavior of the agent: | Option | Description | |----|----| | `system_prompt` | Additional system prompt to append to default system prompt. | | `centaur` | Run in [Centaur Mode](#centaur-mode), which makes mini-swe-agent available to an Inspect `human_cli()` agent rather than running it unattended. | | `attempts` | Allow the agent to have multiple scored attempts at solving the task. | | `model` | Model name to use for agent (defaults to main model for task). | | `filter` | Filter for intercepting bridged model requests. | | `retry_refusals` | Should refusals be retried? (pass number of times to retry) | | `compaction` | Compaction strategy for managing context window overflow. | | `cwd` | Working directory for mini-swe-agent session. | | `env` | Environment variables to set for mini-swe-agent. | | `user` | User to execute mini-swe-agent as in the sandbox. | | `sandbox` | Sandbox environment name. | | `version` | Version of mini-swe-agent to use (see [Installation](#installation) below for details) | For example, here we specify a custom system prompt: ``` python mini_swe_agent( system_prompt="You are an ace system researcher.", ) ``` ## Installation By default, the agent will install the current stable version of mini-swe-agent in the sandbox via Python wheels. You can override this behaviour using the `version` option: | Option | Description | |----|----| | `"stable"` | Install and use the default pinned stable version. | | `"sandbox"` | Use the version of mini-swe-agent in the sandbox (raises `RuntimeError` if not available in the sandbox) | | `"latest"` | Install and use the latest version from PyPI. | | `"x.x.x"` | Install and use a specific version number. | Unlike the other agents which use standalone binaries, mini-swe-agent is installed via Python wheels using `uv`. If you don’t ever want to rely on automatic installation of mini-swe-agent (e.g. if you run your evaluations offline), you can use one of two approaches: 1. Pre-install the version of mini-swe-agent you want to use in the sandbox, then use `version="sandbox"`: ``` python mini_swe_agent(version="sandbox") ``` 2. Pre-install mini-swe-agent in your sandbox Dockerfile: ``` dockerfile RUN pip install mini-swe-agent==2.2.3 ``` Then reference it with `version="sandbox"` in your task. ## Centaur Mode The `mini_swe_agent()` agent can also be run in “centaur” mode which uses the Inspect AI [Human Agent](https://inspect.aisi.org.uk/human-agent.html) as the solver and makes [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent) available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model. Enable centaur mode by passing `centaur=True` to the `mini_swe_agent()` agent: ``` python from inspect_ai import Task, task from inspect_ai.dataset import json_dataset from inspect_ai.scorer import model_graded_qa from inspect_swe import mini_swe_agent @task def system_explorer() -> Task: return Task( dataset=json_dataset("dataset.json"), solver=mini_swe_agent(centaur=True), scorer=model_graded_qa(), sandbox="docker", ) ``` You can also enable centaur mode from the CLI using a solver arg (`-S`): ``` bash inspect eval ctf.py --solver inspect_swe/mini_swe_agent -S centaur=true ``` You can also pass `CentaurOptions` to further customize the behavior of the human agent. For example: ``` python from inspect_swe import CentaurOptions Task( dataset=json_dataset("dataset.json"), solver=mini_swe_agent(centaur=CentaurOptions(answer=False)), scorer=model_graded_qa(), sandbox="docker", ) ``` See the [human_cli()](https://inspect.aisi.org.uk/reference/inspect_ai.agent.html#human_cli) documentation for details on available options. ## Troubleshooting If mini-swe-agent doesn’t appear to be working or working as expected, you can troubleshoot by dumping the mini-swe-agent debug log after an evaluation task is complete. You can do this with: ``` bash inspect trace dump --filter "mini-swe-agent" ```