Capturing Transcripts

Overview

The observe() decorator and context manager captures LLM calls and writes transcripts directly to a database. This is useful for recording agent interactions during development and building transcript datasets from custom inference pipelines.

Observe can be used with either the Inspect AI Model API or with the native SDKs from OpenAI, Anthropic, and Google.

Use observe() as a decorator on async functions:

from inspect_ai.model import get_model, ChatMessageUser
from inspect_scout import observe, observe_update

@observe(db="./transcripts", task_set="my_eval", task_id="case_1")
async def run_case():
    model = get_model("openai/gpt-4o")
    response = await model.generate([
        ChatMessageUser(content="What is the capital of France?")
    ])
    observe_update(score=1.0, success=True)
    return response.completion

Or use it as an async context manager:

async with observe(db="./transcripts", task_set="my_eval"):
    model = get_model("openai/gpt-4o")
    response = await model.generate([
        ChatMessageUser(content="What is the capital of France?")
    ])

The db parameter accepts either a path string or a TranscriptsDB instance. If omitted, it defaults to the project’s configured transcripts directory (or ./transcripts if none is configured).

Providers

Above we demonstrated capturing transcripts when using the Inspect Model API. You can alternatively capture LLM generation from a native SDK by specifying an alternate provider (“openai”, “anthropic”, or “google”).

For example, here we observe() transcripts generated with the Anthropic SDK:

import anthropic
from inspect_scout import observe, observe_update

@observe(db="./transcripts", provider="anthropic")
async def run_case():
    client = anthropic.AsyncAnthropic()
    response = await client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{
            "role": "user", 
            "content": "What is the capital of France?"
        }],
    )
    observe_update(score=1.0, success=True)
    return response.content[0].text

The provider parameter tells observe() to patch the specified SDK to capture LLM calls. Available providers include:

Provider	SDK	Description
`inspect`	Inspect AI	Inspect Model API (default)
`openai`	OpenAI	OpenAI Chat Completions and Responses APIs
`anthropic`	Anthropic	Anthropic Messages API
`google`	Google GenAI	Google Gemini API

You can also implement a custom provider by implementing the ObserveProvider protocol.

Decorator Usage

The observe() decorator supports several usage patterns:

# without parameters (uses project default db)
@observe
async def my_function():
    ...

# with parameters
@observe(task_set="eval", task_id="case_1")
async def my_function():
    ...

# with explicit database path
@observe(db="./my_transcripts", task_set="eval")
async def my_function():
    ...

Parameters

The observe() function accepts the following parameters.

Parameter	Type	Description
`db`	`str \| TranscriptsDB`	Database path or instance. Only valid on outermost observe.
`source_type`	`str`	Type of source for transcript. Defaults to “observe”.
`source_id`	`str`	Globally unique ID for transcript source.
`source_uri`	`str`	URI for source data.
`task_set`	`str`	Set from which transcript task was drawn (e.g., benchmark name).
`task_id`	`str`	Identifier for task (e.g., dataset sample id).
`task_repeat`	`int`	Repeat number for a given task id (e.g., epoch).
`model`	`str`	Main model used by agent.
`model_options`	`dict`	Generation options for main model.
`agent`	`str`	Agent used to execute task.
`agent_args`	`dict`	Arguments passed to create agent.
`metadata`	`dict`	Additional metadata (merged with parent context).

The task_set and task_id parameters are not required but are a good way of providing context on transcripts (e.g in evaluations these are often used for dataset name and sample id).

The following Transcript fields are automatically populated when a transcript is written:

Field	Description
`transcript_id`	Unique UUID generated for each transcript
`date`	UTC timestamp when the context exited
`total_time`	Duration of the observe context in seconds
`message_count`	Number of messages in the transcript
`total_tokens`	Sum of tokens from all model calls
`model`	Model name from the final generation (if not explicitly set)
`model_options`	Generation config from the final generation (if not explicitly set)
`error`	Error message if an exception occurred
`events`	Events which occurred during execution (e.g. `ModelEvent`, `ToolEvent`).
`messages`	Input and output message(s) of the final generation in the transcript.

Updating Fields

Use observe_update() to set transcript fields after execution, which is useful for recording scores or outcomes:

@observe(db="./transcripts", task_set="eval")
async def run_and_score():
    model = get_model("openai/gpt-4o")
    response = await model.generate([
        ChatMessageUser(content="Solve: 2 + 2 = ?")
    ])

    # Evaluate and update transcript
    correct = "4" in response.completion
    observe_update(
        score=1.0 if correct else 0.0,
        success=correct,
        metadata={"answer": response.completion}
    )

    return response.completion

You can call observe_update() multiple times—fields are merged (metadata is combined, other fields are overwritten).

Nested Contexts

Nested observe contexts support batch processing where an outer context sets shared parameters and inner contexts represent individual transcripts:

@observe(db="./transcripts", task_set="cybench")
async def run_evaluation():
    model = get_model("openai/gpt-4o")

    for case in test_cases:
        async with observe(task_id=case.id):
            response = await model.generate([
                ChatMessageUser(content=case.prompt)
            ])
            score = evaluate(response, case.expected)
            observe_update(score=score, success=score > 0.8)

Key behaviors for nested contexts:

Inheritance: Inner contexts inherit parameters from outer contexts.
Leaf detection: Only the innermost context (the “leaf”) writes a transcript
Database scope: The db parameter can only be set on the outermost context
Metadata merging: The metadata dict is merged across all levels

Error Handling

Exceptions within an observe context are caught, logged, and saved to the transcript’s error field. The exception is suppressed to allow batch processing to continue:

@observe(db="./transcripts", task_set="eval")
async def run_with_error():
    model = get_model("openai/gpt-4o")
    await model.generate([ChatMessageUser(content="Hello")])
    raise ValueError("Something went wrong")
    # Transcript is still written with error="Something went wrong"

# This completes without raising
await run_with_error()

Example: Evaluation

Here’s a complete example of running a batch evaluation with parallel processing:

import asyncio
from inspect_ai.model import get_model, ChatMessageUser
from inspect_scout import observe, observe_update

async def main():
    test_cases = [
        {"id": "math_1", "prompt": "What is 2+2?", "expected": "4"},
        {"id": "math_2", "prompt": "What is 3*3?", "expected": "9"},
        {"id": "math_3", "prompt": "What is 10/2?", "expected": "5"},
    ]

    async def run_case(case: dict) -> None:
        async with observe(task_id=case["id"]):
            model = get_model("openai/gpt-4o")
            response = await model.generate([
                ChatMessageUser(content=case["prompt"])
            ])
            correct = case["expected"] in response.completion
            observe_update(
                score=1.0 if correct else 0.0, 
                success=correct
            )

    async with observe(db="./transcripts", task_set="math_eval"):
        await asyncio.gather(*[run_case(case) for case in test_cases])

asyncio.run(main())

After running, you can view the transcripts:

scout view -T ./transcripts

Or scan them with your scanners:

from inspect_scout import scan, transcripts_from
from my_scanners import math_checker

scan(
    scanners=[math_checker()],
    transcripts=transcripts_from("./transcripts")
)