Scores Timeline

Overview

The scores_timeline() function plots eval scores by model, organization, and release date¹:

from inspect_viz import Data
from inspect_viz.view import scores_timeline

evals = Data.from_file("benchmarks.parquet")
scores_timeline(evals)

Data Preparation

Above we read the data for the plot from a parquet file. This file was in turn created by:

Reading logs into a data frame with evals_df().
Using the prepare() function to add model_info(), frontier() and log_viewer() columns to the data frame.

from inspect_ai.analysis import (
    evals_df, frontier, log_viewer, model_info, prepare
)

df = evals_df("logs")
df = prepare(df, [
    model_info(),
    frontier(),
    log_viewer("eval", {"logs": "https://samples.meridianlabs.ai/"})
])
df.to_parquet("benchmarks.parquet")

Filtering

A select() input for tasks is automatically provided if more than one task exists in the data. A checkbox_group() is automatically provided for organizations if more than one organization exists (you can disable this with organizations_filter=False).

When multiple organizations exist, clicking on the legend for an organization will filter the plot by that organization.

Function Reference

Eval scores by model, organization, and release date.

Source

def scores_timeline(
    data: Data,
    task_name: str = "task_display_name",
    model_name: str = "model_display_name",
    model_organization: str = "model_organization_name",
    model_release_date: str = "model_release_date",
    score_name: str = "score_headline_name",
    score_value: str = "score_headline_value",
    score_stderr: str = "score_headline_stderr",
    organizations: list[str] | None = None,
    filters: bool | list[Literal["task", "organization"]] = True,
    ci: float | bool | NotGiven = NOT_GIVEN,
    time_label: str = "Release Date",
    score_label: str = "Score",
    eval_label: str = "Eval",
    title: str | Title | None = None,
    marks: Marks | None = None,
    width: float | Param | None = None,
    height: float | Param | None = None,
    regression: bool = False,
    legend: Legend | NotGiven | None = NOT_GIVEN,
    **attributes: Unpack[PlotAttributes],
) -> Component

data Data: Data read using evals_df() and amended with model metadata using the model_info() prepare operation (see Data Preparation for details).
task_name str: Column for task name (defaults to “task_display_name”).
model_name str: Column for model name (defaults to “model_display_name”).
model_organization str: Column for model organization (defaults to “model_organization_name”).
model_release_date str: Column for model release date (defaults to “model_release_date”).
score_name str: Column for scorer name (defaults to “score_headline_name”).
score_value str: Column for score value (defaults to “score_headline_value”).
score_stderr str: Column for score stderr (defaults to “score_headline_stderr”)
organizations list[str] | None: List of organizations to include (in order of desired presentation).
filters bool | list[Literal['task', 'organization']]: Provide UI to filter plot by task and organization(s).
ci float | bool | NotGiven: Confidence interval (defaults to 0.95, pass False for no confidence intervals)
time_label str: Label for time (x-axis).
score_label str: Label for score (y-axis).
eval_label str: Label for eval select input.
title str | Title | None: Title for plot (str or mark created with the title() function).
marks Marks | None: Additional marks to include in the plot.
width float | Param | None: The outer width of the plot in pixels, including margins. Defaults to 700.
height float | Param | None: The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
regression bool: If True, adds a regression line to the plot (uses the confidence interval passed using ci). Defaults to False.
legend Legend | NotGiven | None: Legend to use for the plot (defaults to None, which uses the default legend).
**attributes Unpack[PlotAttributes]: Additional PlotAttributes. By default, the x_domain is set to “fixed”, the y_domain is set to [0,1.0], color_label is set to “Organizations”, and color_domain is set to organizations.

Implementation

The Scores Timeline example demonstrates how this view was implemented using lower level plotting components.

Footnotes

This plot was inspired by and includes data from the Epoch AI Benchmarking Hub↩︎