Scores Radar By Task

Dataset: radar_by_task.parquet

This example illustrates the code behind the scores_radar_by_task() pre‑built view function. If you want to include this plot in your notebooks or sites, start with that function rather than the lower‑level code below.

scores_radar_by_task() is useful to compare headline metrics from different tasks across multiple models. The data preparation function scales values for visualization purposes by normalizing them using percentile ranks, the raw values are displayed in the tooltips.

Code

from inspect_viz import Data, Selection
from inspect_viz.mark import circle, line, text
from inspect_viz.plot import legend, plot
from inspect_viz.view import LabelStyles
from inspect_viz.view._scores_radar import (
    axes_coordinates,
    grid_circles_coordinates,
    labels_coordinates,
)


data = Data.from_file("radar_by_task.parquet")

channels = {
    "Model": "model_display_name",
    "Metric": "metric",
    "Score": "value",
    "Log viewer": "log_viewer",
    "Task": "task_display_name",
}

tasks = data.column_unique("task_display_name")
axes = axes_coordinates(num_axes=len(tasks))
grid_circles = grid_circles_coordinates()
labels = labels_coordinates(labels=tasks)

# enable interactive highlighting of a chosen model
model_selection = Selection.single()

elements = [
    *[
        line(
            x=data["x"],
            y=data["y"],
            stroke="#e0e0e0",
        )
        for data in grid_circles
    ],
    line(
        x=axes["x"],
        y=axes["y"],
        stroke="#ddd",
    ),
    line(
        data,
        x="x",
        y="y",
        stroke="model_display_name",
        filter_by=model_selection,
        tip=True,
        channels=channels,
    ),
    line(
        data,
        x="x",
        y="y",
        stroke="model_display_name",
        stroke_opacity=0.4,
        tip=False,
    ),
    circle(
        data,
        x="x",
        y="y",
        r=4,
        fill="model_display_name",
        stroke="white",
        filter_by=model_selection,
        tip=False,
    ),
    # axis labels
    *[
        text(
            x=label["x"],
            y=label["y"],
            text=label["label"],
            frame_anchor=label["frame_anchor"],
            styles=LabelStyles(line_width=8),
        )
        for label in labels
    ],
]

plot(
    elements,
    margin=60,
    x_axis=False,
    y_axis=False,
    width=400,
    height=400,
    legend=legend("color", target=model_selection),
)

1: Load data from a Parquet file into an inspect_viz.Data table.
2: Channels provide readable names for tooltips and the log viewer.
3: Coordinates: compute coordinates for axes, grid circles, and labels.
4: Selection enables interactive hovering/clicking to emphasize a single model.
5: Grid lines line() mark draws grid circles.
6: Axes spokes line() mark draws axes.
7: Polygon outlines line() mark draws polygon outlines.
8: Polygon vertex markers circle() mark draws polygon vertex markers.
9: Axis labels text() mark draws axis labels.
10: Layout draws the plot with no axes since axes are arbitrary scalers in the radar chart.
11: Legend draws a legend for the model selection.

Data Preparation

The data dataset for this example was created using the scores_radar_by_task_df() function, which reads evals metadata, scales scores by percentile ranks or min-max normalization, and computes coordinates for the radar chart.

Above we read the data for the plot from a parquet file. This file was in turn created by:

Reading evals level data into a data frame with evals_df().
Converting the evals dataframe into a dataframe specifically used by scores_radar_by_task() by using the scores_radar_by_task_df() function. The output of scores_radar_by_task_df() can be directly passed to scores_radar_by_task(). scores_radar_by_task_df() expects an optional list of metric names to invert where lower scores correspond to better scores, an optional list of model names, an optional list of task names, an optional normalization method to scale scores, and an optional min-max domain to use for normalization on the radar chart.
Using the prepare() function to add model_info(), task_info() and log_viewer() columns to the data frame.

Here is the data preparation code end-to-end:

from inspect_ai.analysis import (
    evals_df,
    log_viewer,
    model_info,
    prepare,
)
from inspect_viz.view import scores_radar_by_task_df


df = evals_df([
    "logs/aime",
    "logs/cybench",
    "logs/gpqa",
    "logs/mmlu-pro",
    "logs/swe-bench",
])

df = scores_radar_by_task_df(
    df,
    models=[
        "openai/o3",
        "anthropic/claude-3-7-sonnet-latest",
    ],
    normalization="min_max",
    domain=(0, 1),
)

df = prepare(df, [
    model_info(),
    log_viewer("eval", { "logs": "https://samples.meridianlabs.ai/" })
    task_info(task_name_mapping={
        "aime2024": "AIME 2024",
        "cybench": "CyBench",
        "gpqa_diamond": "GPQA Diamond",
        "mmlu_pro": "MMLU Pro",
        "swe_bench": "SWE Bench",
    }),
])

df.to_parquet("radar_by_task.parquet")

1: Read the evals data into a dataframe.
2: Convert the dataframe into a scores_radar_by_task() specific dataframe.
3: Filter specific models to plot on the radar chart. Each task in the data should have the same set of models.
4: Choose an optional normalization method to scale the raw scores. Available options: "percentile" (computes percentile rank, useful for identifying consistently strong performers), "min_max" (scales scores between min-max values, sensitive to outliers), or "absolute" (default, no normalization, may result in incomprehensible charts if metrics have different scales).
5: Specify an optional domain when using min-max normalization. If unspecified, min-max values are inferred from the data.
6: Add pretty model names and log links to the dataframe using prepare().
7: Provide an optional task name mapping for pretty task names in prepare().