Claude Code

Overview

The claude_code() agent uses the unattended mode of Anthropic Claude Code to execute agentic tasks within the Inspect sandbox. Model API calls that occur in the sandbox are proxied back to Inspect for handling by the model provider for the current task.

Claude Code Installation

By default, the agent will download the current stable version of Claude Code and copy it to the sandbox. You can also exercise more explicit control over which version of Claude Code is used—see the Installation section below for details.

Basic Usage

Use the claude_code() agent as you would any Inspect agent. For example, here we use it as the solver in an Inspect task:

from inspect_ai import Task, task
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import model_graded_qa

from inspect_swe import claude_code

@task
def system_explorer() -> Task:
    return Task(
        dataset=json_dataset("dataset.json"),
        solver=claude_code(),
        scorer=model_graded_qa(),
        sandbox="docker",
    )

You can also pass the agent as a --solver on the command line:

inspect eval ctf.py --solver inspect_swe/claude_code

If you want to try this out locally, see the system_explorer example.

Options

The following options are supported for customizing the behavior of the agent:

Option	Description
`system_prompt`	Additional system prompt to append to default system prompt.
`skills`	Additional skills to make available to the agent.
`mcp_servers`	MCP servers (see MCP Servers below for details).
`bridged_tools`	Host-side Inspect tools to expose via MCP (see Bridged Tools below for details).
`disallowed_tools`	Optionally disallow tools (e.g. `"WebSearch"`)
`centaur`	Run in Centaur Mode, which makes Claude Code available to an Inspect `human_cli()` agent rather than running it unattended.
`attempts`	Allow the agent to have multiple scored attempts at solving the task.
`model`	Model name to use for agent (defaults to main model for task).
`opus_model`	The model to use for `opus`, or for `opusplan` when Plan Mode is active. Defaults to `model`.
`sonnet_model`	The model to use for `sonnet`, or for `opusplan` when Plan Mode is not active. Defaults to `model`.
`haiku_model`	The model to use for haiku, or background functionality. Defaults to `model`.
`subagent_model`	The model to use for subagents. Defaults to `model`.
`filter`	Filter for intercepting bridged model requests.
`retry_refusals`	Should refusals be retried? (pass number of times to retry)
`retry_timeouts`	Should timeouts be retried? (pass number of times to retry)
`cwd`	Working directory for Claude Code session.
`env`	Environment variables to set for Claude Code.
`version`	Version of Claude Code to use (see Installation below for details)

For example, here we specify a custom system prompt and disallow the WebSearch tool:

claude_code(
    system_prompt="You are an ace system researcher.",
    disallowed_tools=["WebSearch"]
)

MCP Servers

You can specify one or more Model Context Protocol (MCP) servers to provide additional tools to Claude Code. Servers are specified via the MCPServerConfig class and its Stdio and HTTP variants.

For example, here is a Dockerfile that makes the server-memory MCP server available in the sandbox container:

FROM python:3.12-bookworm

# nodejs (required by mcp server)
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \
    && apt-get install -y --no-install-recommends nodejs \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# memory mcp server
RUN npx --yes @modelcontextprotocol/server-memory --version

# run forever
CMD ["tail", "-f", "/dev/null"]

Note that we run the npx server during the build of the Dockerfile so that it is cached for use offline (below we’ll run it with the --offline option).

We can then use this MCP server in a task as follows:

from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.tool import MCPServerConfigStdio
from inspect_swe import claude_code

@task
def investigator() -> Task:
    return Task(
        dataset=[
            Sample(
                input="What transport protocols are supported in "
                + " the 2025-03-26 version of the MCP spec?"
            )
        ],
        solver=claude_code(
            system_prompt="Please use the web search tool to "
            + "research this question and the memory tools "
            + "to keep track of your research.",
            mcp_servers=[
                MCPServerConfigStdio(
                    name="memory",
                    command="npx",
                    args=[
                        "--offline",
                        "@modelcontextprotocol/server-memory"
                    ],
                )
            ]
        ),
        sandbox=("docker", "Dockerfile"),
    )

Note that we run the MCP server using the --offline option so that it doesn’t require an internet connection (which it would normally use to check for updates to the package).

Bridged Tools

You can expose host-side Inspect tools to the sandboxed agent via the MCP protocol using the bridged_tools parameter. This allows you to run tools on the host (e.g. tools that access host resources, databases, or APIs) but make them available to the agent running inside the sandbox.

Tools are specified via BridgedToolsSpec which wraps a list of Inspect tools:

from inspect_ai import Task, task
from inspect_ai.agent import BridgedToolsSpec
from inspect_ai.dataset import Sample
from inspect_ai.tool import tool
from inspect_swe import claude_code

@tool
def search_database():
    async def execute(query: str) -> str:
        """Search the internal database.

        Args:
            query: The search query.
        """
        # This runs on the host, not in the sandbox
        return f"Results for: {query}"
    return execute

@task
def investigator() -> Task:
    return Task(
        dataset=[
            Sample(input="Search for information about MCP protocols.")
        ],
        solver=claude_code(
            system_prompt="Use the search tool to research.",
            bridged_tools=[
                BridgedToolsSpec(
                    name="host_tools",
                    tools=[search_database()]
                )
            ]
        ),
        sandbox=("docker", "Dockerfile"),
    )

The name field identifies the MCP server and will be visible to the agent as a tool prefix. You can specify multiple BridgedToolsSpec instances to create separate MCP servers for different tool groups.

See the Bridged Tools documentation for more details on the architecture and how tool execution flows between host and sandbox.

Installation

By default, the agent will download the current stable version of Claude Code and copy it to the sandbox. You can override this behaviour using the version option:

Option	Description
`"auto"`	Use any available version of Claude Code in the sandbox, otherwise download the current stable version.
`"sandbox"`	Use the version of Claude Code in the sandbox (raises `RuntimeError` if not available in the sandbox)
`"stable"`	Download and use the current stable version.
`"latest"`	Download and use the very latest version.
`"x.x.x"`	Download and use a specific version number.

If you don’t ever want to rely on automatic downloads of Claude Code (e.g. if you run your evaluations offline), you can use one of two approaches:

Pre-install the version of Claude Code you want to use in the sandbox, then use version="sandbox":
```
claude_code(version="sandbox")
```
Download the version of Claude Code you want to use into the cache, then specify that version explicitly:
```
# download the agent binary during installation/configuration
download_agent_binary("claude_code", "0.29.0", "linux-x64")

# reference that version in your task (no download will occur)
claude_code(version="0.29.0")
```
Note that the 5 most recently downloaded versions are retained in the cache. Use the cached_agent_binaries() function to list the contents of the cache.

Centaur Mode

The claude_code() agent can also be run in “centaur” mode which uses the Inspect AI Human Agent as the solver and makes Claude Code available to the human user for help with the task. So rather than strictly measuring human vs. model performance, you are able to measure performance of humans working collaboratively with a model.

Enable centaur mode by passing centaur=True to the claude_code() agent:

from inspect_ai import Task, task
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import model_graded_qa

from inspect_swe import claude_code

@task
def system_explorer() -> Task:
    return Task(
        dataset=json_dataset("dataset.json"),
        solver=claude_code(centaur=True),
        scorer=model_graded_qa(),
        sandbox="docker",
    )

You can also enable centaur mode from the CLI using a solver arg (-S):

inspect eval ctf.py --solver inspect_swe/claude_code -S centaur=true

You can also pass CentaurOptions to further customize the behavior of the human agent. For example:

from inspect_swe import CentaurOptions

Task(
    dataset=json_dataset("dataset.json"),
    solver=claude_code(centaur=CentaurOptions(answer=False)),
    scorer=model_graded_qa(),
    sandbox="docker",
)

See the human_cli() documentation for details on available options.

Troubleshooting

If Claude Code doesn’t appear to be working or working as expected, you can troubleshoot by dumping the Claude Code debug log after an evaluation task is complete. You can do this with:

inspect trace dump --filter "Claude Code"