Scanner Tools

Overview

The LLM Scanner provides a high-level, batteries-included interface for building LLM-based scanners. Under the hood, it is composed of several lower-level tools: message numbering, prompt construction, answer generation, message extraction, and result reduction. You can use these tools directly when you need more control.

For straightforward scanning tasks, prefer llm_scanner() as it handles the entire pipeline for you. Use lower level scanner tools when you need to override or extend specific steps.

Here is roughly the implementation of llm_scanner(), but using the lower level tools that we’ll cover in more depth below (click on the numbers at right for further explanation):

from inspect_scout import (
    Result, ResultReducer, Scanner, Transcript, 
    generate_answer, message_numbering, scanner, 
    scanner_prompt, transcript_messages
)

@scanner(messages="all")
def environment_scanner() -> Scanner[Transcript]:

    async def scan(transcript: Transcript) -> Result:

        # setup numbering and references
        messages_as_str, extract_refs = message_numbering()

        # define question to ask llm
        question = (
            "Did the agent fail to complete the task due to "
            "a broken or misconfigured environment?"
        )

        # scan segments (breaks up transcript to fit in context window)
        results: list[Result] = []
        async for segment in transcript_messages(
            transcript,
            messages_as_str=messages_as_str
        ):
            # generate prompt using segment, question, and answer type
            prompt = scanner_prompt(
                messages=segment.messages_str,
                question=question,
                answer="boolean"
            )

            # generate, extract answer/explanation and append results
            results.append(
                await generate_answer(
                    prompt,
                    "boolean",
                    extract_refs=extract_refs,
                )
            )

        # reduce per-segment results into a single result
        return await ResultReducer.any(results)

    return scan

1: Setup message numbering (returns a function that can be used to number and a function that can be used to extract references to messages from model explanations).
2: Iterate over transcript segments (if the transcript exceeds the size of the scanning model’s context window it will need to be broken into segments).
3: Build a prompt for the scanner using the string for this segment (numbered messages), the question, and the answer type.
4: Call the model to generate an answer, extracting references from its explanation using the extract_refs function.
5: When context-window segmentation produces multiple segments, reduce them into a single result. Here we use the .any() reducer which returns True if any segment result was True.

Message Presentation

The message_numbering() function creates a pair of functions that share a closure-captured counter and message ID map. This enables consistent message numbering across multiple calls (e.g., across segments) and lets you extract [M1]-style references from model output back to the original messages.

from inspect_scout import message_numbering

messages_as_str, extract_refs = message_numbering()

The returned functions work together:

messages_as_str(messages) — Renders a list of ChatMessage objects into a numbered string (e.g., [M1] USER: ...). The counter auto-increments across calls, so a second call continues from where the first left off.
extract_refs(text) — Resolves [M1], [M2], etc. citations in model output back to Reference objects pointing to the original messages.

Preprocessing

You can pass a MessagesPreprocessor to control which message content is included in the rendered output:

from inspect_scout import message_numbering, MessagesPreprocessor

messages_as_str, extract_refs = message_numbering(
    preprocessor=MessagesPreprocessor(
        exclude_system=True,       # exclude system messages (default)
        exclude_reasoning=True,    # exclude reasoning content
        exclude_tool_usage=True,   # exclude tool calls and output
    )
)

`exclude_system`	Exclude system messages (defaults to `True`)
`exclude_reasoning`	Exclude reasoning content (defaults to `False`)
`exclude_tool_usage`	Exclude tool calls and output (defaults to `False`)
`transform`	Optional async function that takes the message list and returns a filtered list.

When no preprocessor is provided, the default behaviour is to exclude system messages only.

Prompt Construction

The scanner_prompt() function renders the default scanner template—the same template used internally by llm_scanner(). It combines the transcript messages, question, and answer type into a complete prompt:

from inspect_scout import scanner_prompt

prompt = scanner_prompt(
    messages=segment.messages_str,
    question="Did the assistant refuse the user's request?",
    answer="boolean",
)

Custom Prompts

For fully custom prompt templates, use answer_type() to resolve an AnswerSpec into an Answer object that exposes .prompt and .format strings:

from inspect_scout import answer_type

answer = answer_type("boolean")
# answer.prompt  -> "Answer the following yes or no question..."
# answer.format  -> "The last line of your response should be..."

prompt = f"""Here is a conversation:

{segment.messages_str}

{answer.prompt}

{question}

{answer.format}
"""

This gives you complete control over prompt structure while reusing the answer-type-specific formatting instructions.

Answer Generation

The generate_answer() function calls the LLM and parses the response into a Result. It supports all answer types, handles refusal retries, and uses tool-based generation for structured answers:

from inspect_scout import generate_answer

result = await generate_answer(
    prompt,
    answer="boolean",
    extract_refs=extract_refs,
)

`prompt`	The scanning prompt (string or message list).
`answer`	Answer specification (`"boolean"`, `"numeric"`, `"string"`, `list[str]`, AnswerMultiLabel, or AnswerStructured).
`model`	Model to use for generation. Defaults to the current default model.
`retry_refusals`	Number of times to retry on content filter refusals (default `3`).
`parse`	When `True` (default), parse the output into a Result. When `False`, return the raw ModelOutput.
`extract_refs`	Function to extract `[M1]`-style references from the explanation. Only used when `parse=True`.
`value_to_float`	Optional function to convert the parsed value to a float. Only used when `parse=True`.

Answer Parsing

The parse_answer() function provides pure parsing without making an LLM call. Use this when you generate model output through your own code (e.g., via get_model().generate()) but want to use the standard answer extraction logic:

from inspect_ai.model import get_model
from inspect_scout import parse_answer

# generate with your own code
model = get_model("openai/gpt-4o")
output = await model.generate(prompt)

# parse using standard answer extraction
result = parse_answer(
    output,
    answer="boolean",
    extract_refs=extract_refs,
)

`output`	The ModelOutput to parse.
`answer`	Answer specification (same types as generate_answer()).
`extract_refs`	Function to extract `[M1]`-style references from the explanation text.
`value_to_float`	Optional function to convert the parsed value to a float.

Answer Types

The answer parameter accepted by scanner_prompt(), generate_answer(), and parse_answer() is an AnswerSpec—a union type that determines how the LLM is prompted, how answers are extracted, and the Python type of the result value:

Type	LLM Output	Result Type
boolean	ANSWER: yes	`bool`
numeric	ANSWER: 10	`float`
string	ANSWER: brown fox	`str`
label	ANSWER: C	`str`
labels (multiple)	ANSWER: C, D	`list[str]`
structured	JSON object	`dict[str,JsonValue]`

Multi-Label Classification

Pass list[str] to prompt the model to select a single label. To allow multiple label selections, wrap the labels in AnswerMultiLabel:

from inspect_scout import AnswerMultiLabel

answer = AnswerMultiLabel(labels=[
    "Factual error",
    "Refusal",
    "Off-topic response",
    "Hallucination",
])

Label values (A, B, C, …) are assigned automatically based on position in the list.

Structured Answers

AnswerStructured uses tool-based generation to produce JSON output conforming to a Pydantic model:

from pydantic import BaseModel, Field
from inspect_scout import AnswerStructured

class CyberLint(BaseModel):
    misconfiguration: bool = Field(
        description="Was the environment misconfigured?"
    )
    tool_errors: int = Field(
        description="How many tool errors were encountered?"
    )

answer = AnswerStructured(type=CyberLint)

`type`	Pydantic `BaseModel` subclass, or `list[Model]` for multiple results.
`answer_tool`	Name of the tool provided to the model (default `"answer"`).
`answer_prompt`	Template for the prompt that precedes the question.
`answer_format`	Template for instructions on how to respond.
`max_attempts`	Maximum retries for correct schema generation (default `3`).

See Structured Answers in the LLM Scanner documentation for details on Pydantic field aliases (value, label, explanation), multiple results via list[Model], and value_to_float usage.

Message Extraction

The transcript_messages() function is the primary API for extracting pre-rendered message segments from a transcript. It automatically selects the best extraction strategy based on what data is available:

If timelines are present, it walks the span tree and yields per-span segments.
If only events are present, it extracts messages from events with compaction handling.
If only messages are present, it segments them by context window.

from inspect_scout import message_numbering, transcript_messages

messages_as_str, extract_refs = message_numbering()

async for segment in transcript_messages(
    transcript,
    messages_as_str=messages_as_str,
):
    # segment.messages     -> list[ChatMessage]
    # segment.messages_str -> pre-rendered numbered string
    # segment.segment      -> 0-based segment index
    ...

`transcript`	The Transcript to extract messages from.
`messages_as_str`	Rendering function from message_numbering().
`model`	Model used for token counting and context window lookup.
`context_window`	Override the model’s detected context window size (in tokens).
`compaction`	How to handle compaction boundaries: `"all"` (default) merges across boundaries; `"last"` uses only the final segment.
`depth`	Maximum depth of the span tree to process (timelines only). `None` recurses without limit.
`include_scorers`	Whether to include scorer events in extraction (default `False`).

segment_messages()

When you have a source other than a full Transcript—a list of messages, a list of events, or a TimelineSpan—use segment_messages() to render and split them into context-window-sized segments:

from inspect_scout import message_numbering, segment_messages

messages_as_str, extract_refs = message_numbering()

async for segment in segment_messages(
    source,                            # list[ChatMessage], list[Event], or TimelineSpan
    messages_as_str=messages_as_str,
):
    # segment.messages     -> list[ChatMessage]
    # segment.messages_str -> pre-rendered numbered string
    # segment.segment      -> 0-based segment index
    ...

When given events or a TimelineSpan, it delegates to span_messages() internally to extract and merge messages (handling compaction boundaries), then segments the result by token count.

`source`	A `list[ChatMessage]`, `list[Event]`, or TimelineSpan.
`messages_as_str`	Rendering function from message_numbering().
`model`	Model used for token counting and context window lookup.
`context_window`	Override the model’s detected context window size (in tokens).
`compaction`	How to handle compaction boundaries (passed to span_messages()).

span_messages()

The lowest-level extraction function. It takes a Timeline, TimelineSpan, or raw list[Event], extracts ChatMessage objects from ModelEvents, and handles compaction boundaries—all synchronously, without rendering or segmentation:

from inspect_scout import span_messages

# From a TimelineSpan
messages = span_messages(span)

# From a list of events
messages = span_messages(events, compaction="last")

# Split into per-compaction-region lists
regions = span_messages(span, split_compactions=True)

`source`	A Timeline, TimelineSpan, or `list[Event]`.
`compaction`	How to handle compaction boundaries: `"all"` (default) merges across boundaries; `"last"` uses only the final ModelEvent’s messages; an `int` keeps the last N compaction regions.
`split_compactions`	When `True`, returns `list[list[ChatMessage]]` with one inner list per compaction region instead of a flat merged list.

Use span_messages() when you need raw messages without rendering or token-based segmentation—for example, to inspect message content directly, apply your own formatting, or count messages before deciding how to process them.

Context Chunking

When a transcript’s messages exceed the scanning model’s context window, transcript_messages() and segment_messages() automatically split them into segments sized to fit within 80% of the model’s available context. Each segment is scanned independently with the same prompt. Use the context_window parameter on transcript_messages() or segment_messages() to override the model’s detected context window size.

Result Reduction

Because context-window segmentation can produce multiple results from a single source, those per-segment results need to be combined using a reducer. The ResultReducer class provides static async reducers for this purpose.

Reducer	Behaviour
`ResultReducer.any`	Boolean OR—`True` if any result is `True`.
`ResultReducer.mean`	Arithmetic mean of numeric values.
`ResultReducer.median`	Median of numeric values.
`ResultReducer.mode`	Mode (most common) of numeric values.
`ResultReducer.max`	Maximum of numeric values.
`ResultReducer.min`	Minimum of numeric values.
`ResultReducer.union`	Union of list values, deduplicated, preserving order.
`ResultReducer.majority`	Most common value, with last-result tiebreaker.
`ResultReducer.last`	Returns the last result with merged auxiliary fields.

All reducers are async functions with signature (list[Result]) -> Result.

LLM-Based Reduction

For cases where statistical reduction isn’t sufficient, ResultReducer.llm() returns a reducer that uses an LLM to synthesize segment results. It automatically formats each segment’s answer, value, and explanation into a prompt, asks the model to synthesize them, and returns a single combined result. It is the default reducer for "string" answer types.

from inspect_scout import ResultReducer, llm_scanner, scanner

@scanner()
def synthesized_analysis():
    return llm_scanner(
        question="Summarize the agent's key decisions.",
        answer="string",
        reducer=ResultReducer.llm(model="openai/gpt-5"),
    )

You can pass a custom prompt to guide how the model combines segment results. The per-segment answers, values, and explanations are automatically appended after your prompt:

@scanner()
def security_summary():
    return llm_scanner(
        question="Identify any security concerns in the agent's actions.",
        answer="string",
        reducer=ResultReducer.llm(
            prompt=(
                "You are reviewing security findings from different "
                "segments of a long agent conversation. Prioritize the "
                "most severe issues, deduplicate overlapping findings, "
                "and produce a concise summary ordered by severity."
            ),
        ),
    )

Custom Reducers

You can also write a custom reducer — any async function with the signature (list[Result]) -> Result. Each Result in the list has .value, .answer, .explanation, and .references from its segment:

from inspect_scout import Result, llm_scanner, scanner

async def worst_score(results: list[Result]) -> Result:
    """Keep the result with the lowest numeric value."""
    return min(results, key=lambda r: r.value)

@scanner()
def strictest_scan():
    return llm_scanner(
        question="Rate the quality of the agent's output (1-10).",
        answer="numeric",
        reducer=worst_score,
    )

Note that result reduction applies only to context-window segments within a single source. When scanning timelines with multiple spans, each span produces its own result independently — these are collected into a ResultSet without reduction.