Scanner Tools
Overview
The LLM Scanner provides a high-level, batteries-included interface for building LLM-based scanners. Under the hood, it is composed of several lower-level tools: message numbering, prompt construction, answer generation, message extraction, and result reduction. You can use these tools directly when you need more control.
For straightforward scanning tasks, prefer llm_scanner() as it handles the entire pipeline for you. Use lower level scanner tools when you need to override or extend specific steps.
Here is roughly the implementation of llm_scanner(), but using the lower level tools that we’ll cover in more depth below (click on the numbers at right for further explanation):
from inspect_scout import (
Result, ResultReducer, Scanner, Transcript,
generate_answer, message_numbering, scanner,
scanner_prompt, transcript_messages
)
@scanner(messages="all")
def environment_scanner() -> Scanner[Transcript]:
async def scan(transcript: Transcript) -> Result:
# setup numbering and references
messages_as_str, extract_refs = message_numbering()
# define question to ask llm
question = (
"Did the agent fail to complete the task due to "
"a broken or misconfigured environment?"
)
# scan segments (breaks up transcript to fit in context window)
results: list[Result] = []
async for segment in transcript_messages(
transcript,
messages_as_str=messages_as_str
):
# generate prompt using segment, question, and answer type
prompt = scanner_prompt(
messages=segment.messages_str,
question=question,
answer="boolean"
)
# generate, extract answer/explanation and append results
results.append(
await generate_answer(
prompt,
"boolean",
extract_refs=extract_refs,
)
)
# reduce per-segment results into a single result
return await ResultReducer.any(results)
return scan- 1
- Setup message numbering (returns a function that can be used to number and a function that can be used to extract references to messages from model explanations).
- 2
- Iterate over transcript segments (if the transcript exceeds the size of the scanning model’s context window it will need to be broken into segments).
- 3
- Build a prompt for the scanner using the string for this segment (numbered messages), the question, and the answer type.
- 4
-
Call the model to generate an answer, extracting references from its explanation using the
extract_refsfunction. - 5
-
When context-window segmentation produces multiple segments, reduce them into a single result. Here we use the
.any()reducer which returnsTrueif any segment result wasTrue.
Message Presentation
The message_numbering() function creates a pair of functions that share a closure-captured counter and message ID map. This enables consistent message numbering across multiple calls (e.g., across segments) and lets you extract [M1]-style references from model output back to the original messages.
from inspect_scout import message_numbering
messages_as_str, extract_refs = message_numbering()The returned functions work together:
messages_as_str(messages)— Renders a list ofChatMessageobjects into a numbered string (e.g.,[M1] USER: ...). The counter auto-increments across calls, so a second call continues from where the first left off.extract_refs(text)— Resolves[M1],[M2], etc. citations in model output back to Reference objects pointing to the original messages.
Preprocessing
You can pass a MessagesPreprocessor to control which message content is included in the rendered output:
from inspect_scout import message_numbering, MessagesPreprocessor
messages_as_str, extract_refs = message_numbering(
preprocessor=MessagesPreprocessor(
exclude_system=True, # exclude system messages (default)
exclude_reasoning=True, # exclude reasoning content
exclude_tool_usage=True, # exclude tool calls and output
)
)exclude_system |
Exclude system messages (defaults to True) |
exclude_reasoning |
Exclude reasoning content (defaults to False) |
exclude_tool_usage |
Exclude tool calls and output (defaults to False) |
transform |
Optional async function that takes the message list and returns a filtered list. |
When no preprocessor is provided, the default behaviour is to exclude system messages only.
Prompt Construction
The scanner_prompt() function renders the default scanner template—the same template used internally by llm_scanner(). It combines the transcript messages, question, and answer type into a complete prompt:
from inspect_scout import scanner_prompt
prompt = scanner_prompt(
messages=segment.messages_str,
question="Did the assistant refuse the user's request?",
answer="boolean",
)Custom Prompts
For fully custom prompt templates, use answer_type() to resolve an AnswerSpec into an Answer object that exposes .prompt and .format strings:
from inspect_scout import answer_type
answer = answer_type("boolean")
# answer.prompt -> "Answer the following yes or no question..."
# answer.format -> "The last line of your response should be..."
prompt = f"""Here is a conversation:
{segment.messages_str}
{answer.prompt}
{question}
{answer.format}
"""This gives you complete control over prompt structure while reusing the answer-type-specific formatting instructions.
Answer Generation
The generate_answer() function calls the LLM and parses the response into a Result. It supports all answer types, handles refusal retries, and uses tool-based generation for structured answers:
from inspect_scout import generate_answer
result = await generate_answer(
prompt,
answer="boolean",
extract_refs=extract_refs,
)prompt |
The scanning prompt (string or message list). |
answer |
Answer specification ("boolean", "numeric", "string", list[str], AnswerMultiLabel, or AnswerStructured). |
model |
Model to use for generation. Defaults to the current default model. |
retry_refusals |
Number of times to retry on content filter refusals (default 3). |
parse |
When True (default), parse the output into a Result. When False, return the raw ModelOutput. |
extract_refs |
Function to extract [M1]-style references from the explanation. Only used when parse=True. |
value_to_float |
Optional function to convert the parsed value to a float. Only used when parse=True. |
Answer Parsing
The parse_answer() function provides pure parsing without making an LLM call. Use this when you generate model output through your own code (e.g., via get_model().generate()) but want to use the standard answer extraction logic:
from inspect_ai.model import get_model
from inspect_scout import parse_answer
# generate with your own code
model = get_model("openai/gpt-4o")
output = await model.generate(prompt)
# parse using standard answer extraction
result = parse_answer(
output,
answer="boolean",
extract_refs=extract_refs,
)output |
The ModelOutput to parse. |
answer |
Answer specification (same types as generate_answer()). |
extract_refs |
Function to extract [M1]-style references from the explanation text. |
value_to_float |
Optional function to convert the parsed value to a float. |
Answer Types
The answer parameter accepted by scanner_prompt(), generate_answer(), and parse_answer() is an AnswerSpec—a union type that determines how the LLM is prompted, how answers are extracted, and the Python type of the result value:
| Type | LLM Output | Result Type |
|---|---|---|
| boolean | ANSWER: yes | bool |
| numeric | ANSWER: 10 | float |
| string | ANSWER: brown fox | str |
| label | ANSWER: C | str |
| labels (multiple) | ANSWER: C, D | list[str] |
| structured | JSON object | dict[str,JsonValue] |
Multi-Label Classification
Pass list[str] to prompt the model to select a single label. To allow multiple label selections, wrap the labels in AnswerMultiLabel:
from inspect_scout import AnswerMultiLabel
answer = AnswerMultiLabel(labels=[
"Factual error",
"Refusal",
"Off-topic response",
"Hallucination",
])Label values (A, B, C, …) are assigned automatically based on position in the list.
Structured Answers
AnswerStructured uses tool-based generation to produce JSON output conforming to a Pydantic model:
from pydantic import BaseModel, Field
from inspect_scout import AnswerStructured
class CyberLint(BaseModel):
misconfiguration: bool = Field(
description="Was the environment misconfigured?"
)
tool_errors: int = Field(
description="How many tool errors were encountered?"
)
answer = AnswerStructured(type=CyberLint)type |
Pydantic BaseModel subclass, or list[Model] for multiple results. |
answer_tool |
Name of the tool provided to the model (default "answer"). |
answer_prompt |
Template for the prompt that precedes the question. |
answer_format |
Template for instructions on how to respond. |
max_attempts |
Maximum retries for correct schema generation (default 3). |
See Structured Answers in the LLM Scanner documentation for details on Pydantic field aliases (value, label, explanation), multiple results via list[Model], and value_to_float usage.
Message Extraction
The transcript_messages() function is the primary API for extracting pre-rendered message segments from a transcript. It automatically selects the best extraction strategy based on what data is available:
- If timelines are present, it walks the span tree and yields per-span segments.
- If only events are present, it extracts messages from events with compaction handling.
- If only messages are present, it segments them by context window.
from inspect_scout import message_numbering, transcript_messages
messages_as_str, extract_refs = message_numbering()
async for segment in transcript_messages(
transcript,
messages_as_str=messages_as_str,
):
# segment.messages -> list[ChatMessage]
# segment.messages_str -> pre-rendered numbered string
# segment.segment -> 0-based segment index
...transcript |
The Transcript to extract messages from. |
messages_as_str |
Rendering function from message_numbering(). |
model |
Model used for token counting and context window lookup. |
context_window |
Override the model’s detected context window size (in tokens). |
compaction |
How to handle compaction boundaries: "all" (default) merges across boundaries; "last" uses only the final segment. |
depth |
Maximum depth of the span tree to process (timelines only). None recurses without limit. |
include_scorers |
Whether to include scorer events in extraction (default False). |
segment_messages()
When you have a source other than a full Transcript—a list of messages, a list of events, or a TimelineSpan—use segment_messages() to render and split them into context-window-sized segments:
from inspect_scout import message_numbering, segment_messages
messages_as_str, extract_refs = message_numbering()
async for segment in segment_messages(
source, # list[ChatMessage], list[Event], or TimelineSpan
messages_as_str=messages_as_str,
):
# segment.messages -> list[ChatMessage]
# segment.messages_str -> pre-rendered numbered string
# segment.segment -> 0-based segment index
...When given events or a TimelineSpan, it delegates to span_messages() internally to extract and merge messages (handling compaction boundaries), then segments the result by token count.
source |
A list[ChatMessage], list[Event], or TimelineSpan. |
messages_as_str |
Rendering function from message_numbering(). |
model |
Model used for token counting and context window lookup. |
context_window |
Override the model’s detected context window size (in tokens). |
compaction |
How to handle compaction boundaries (passed to span_messages()). |
span_messages()
The lowest-level extraction function. It takes a Timeline, TimelineSpan, or raw list[Event], extracts ChatMessage objects from ModelEvents, and handles compaction boundaries—all synchronously, without rendering or segmentation:
from inspect_scout import span_messages
# From a TimelineSpan
messages = span_messages(span)
# From a list of events
messages = span_messages(events, compaction="last")
# Split into per-compaction-region lists
regions = span_messages(span, split_compactions=True)source |
A Timeline, TimelineSpan, or list[Event]. |
compaction |
How to handle compaction boundaries: "all" (default) merges across boundaries; "last" uses only the final ModelEvent’s messages; an int keeps the last N compaction regions. |
split_compactions |
When True, returns list[list[ChatMessage]] with one inner list per compaction region instead of a flat merged list. |
Use span_messages() when you need raw messages without rendering or token-based segmentation—for example, to inspect message content directly, apply your own formatting, or count messages before deciding how to process them.
Context Chunking
When a transcript’s messages exceed the scanning model’s context window, transcript_messages() and segment_messages() automatically split them into segments sized to fit within 80% of the model’s available context. Each segment is scanned independently with the same prompt. Use the context_window parameter on transcript_messages() or segment_messages() to override the model’s detected context window size.
Result Reduction
Because context-window segmentation can produce multiple results from a single source, those per-segment results need to be combined using a reducer. The ResultReducer class provides static async reducers for this purpose.
| Reducer | Behaviour |
|---|---|
ResultReducer.any |
Boolean OR—True if any result is True. |
ResultReducer.mean |
Arithmetic mean of numeric values. |
ResultReducer.median |
Median of numeric values. |
ResultReducer.mode |
Mode (most common) of numeric values. |
ResultReducer.max |
Maximum of numeric values. |
ResultReducer.min |
Minimum of numeric values. |
ResultReducer.union |
Union of list values, deduplicated, preserving order. |
ResultReducer.majority |
Most common value, with last-result tiebreaker. |
ResultReducer.last |
Returns the last result with merged auxiliary fields. |
All reducers are async functions with signature (list[Result]) -> Result.
LLM-Based Reduction
For cases where statistical reduction isn’t sufficient, ResultReducer.llm() returns a reducer that uses an LLM to synthesize segment results. It automatically formats each segment’s answer, value, and explanation into a prompt, asks the model to synthesize them, and returns a single combined result. It is the default reducer for "string" answer types.
from inspect_scout import ResultReducer, llm_scanner, scanner
@scanner()
def synthesized_analysis():
return llm_scanner(
question="Summarize the agent's key decisions.",
answer="string",
reducer=ResultReducer.llm(model="openai/gpt-5"),
)You can pass a custom prompt to guide how the model combines segment results. The per-segment answers, values, and explanations are automatically appended after your prompt:
@scanner()
def security_summary():
return llm_scanner(
question="Identify any security concerns in the agent's actions.",
answer="string",
reducer=ResultReducer.llm(
prompt=(
"You are reviewing security findings from different "
"segments of a long agent conversation. Prioritize the "
"most severe issues, deduplicate overlapping findings, "
"and produce a concise summary ordered by severity."
),
),
)Custom Reducers
You can also write a custom reducer — any async function with the signature (list[Result]) -> Result. Each Result in the list has .value, .answer, .explanation, and .references from its segment:
from inspect_scout import Result, llm_scanner, scanner
async def worst_score(results: list[Result]) -> Result:
"""Keep the result with the lowest numeric value."""
return min(results, key=lambda r: r.value)
@scanner()
def strictest_scan():
return llm_scanner(
question="Rate the quality of the agent's output (1-10).",
answer="numeric",
reducer=worst_score,
)Note that result reduction applies only to context-window segments within a single source. When scanning timelines with multiple spans, each span produces its own result independently — these are collected into a ResultSet without reduction.