from inspect_viz import Data
from inspect_viz.view.beta import scores_timeline
evals = Data.from_file("benchmarks.parquet")
scores_timeline(evals)Scores Timeline
Overview
The scores_timeline() function plots eval scores by model, organization, and release date1:
Data Preparation
Above we read the data for the plot from a parquet file. This file was in turn created by:
Reading logs into a data frame with
evals_df().Using the
prepare()function to addmodel_info(),frontier()andlog_viewer()columns to the data frame.
from inspect_ai.analysis import (
evals_df, frontier, log_viewer, model_info, prepare
)
df = evals_df("logs")
df = prepare(df, [
model_info(),
frontier(),
log_viewer("eval", {"logs": "https://samples.meridianlabs.ai/"})
])
df.to_parquet("benchmarks.parquet")Filtering
A select() input for tasks is automatically provided if more than one task exists in the data. A checkbox_group() is automatically provided for organizations if more than one organization exists (you can disable this with organizations_filter=False).
When multiple organizations exist, clicking on the legend for an organization will filter the plot by that organization.
Function Reference
Eval scores by model, organization, and release date.
def scores_timeline(
data: Data,
task_name: str = "task_display_name",
model_name: str = "model_display_name",
model_organization: str = "model_organization_name",
model_release_date: str = "model_release_date",
score_name: str = "score_headline_name",
score_value: str = "score_headline_value",
score_stderr: str = "score_headline_stderr",
organizations: list[str] | None = None,
filters: bool | list[Literal["task", "organization"]] = True,
ci: float | bool | NotGiven = NOT_GIVEN,
time_label: str = "Release Date",
score_label: str = "Score",
eval_label: str = "Eval",
title: str | Title | None = None,
marks: Marks | None = None,
width: float | Param | None = None,
height: float | Param | None = None,
regression: bool = False,
legend: Legend | NotGiven | None = NOT_GIVEN,
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
Data read using
evals_df()and amended with model metadata using themodel_info()prepare operation (see Data Preparation for details). task_namestr-
Column for task name (defaults to “task_display_name”).
model_namestr-
Column for model name (defaults to “model_display_name”).
model_organizationstr-
Column for model organization (defaults to “model_organization_name”).
model_release_datestr-
Column for model release date (defaults to “model_release_date”).
score_namestr-
Column for scorer name (defaults to “score_headline_name”).
score_valuestr-
Column for score value (defaults to “score_headline_value”).
score_stderrstr-
Column for score stderr (defaults to “score_headline_stderr”)
organizationslist[str] | None-
List of organizations to include (in order of desired presentation).
filtersbool | list[Literal['task', 'organization']]-
Provide UI to filter plot by task and organization(s).
cifloat | bool | NotGiven-
Confidence interval (defaults to 0.95, pass
Falsefor no confidence intervals) time_labelstr-
Label for time (x-axis).
score_labelstr-
Label for score (y-axis).
eval_labelstr-
Label for eval select input.
titlestr | Title | None-
Title for plot (
stror mark created with the title() function). marksMarks | None-
Additional marks to include in the plot.
widthfloat | Param | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
heightfloat | Param | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
regressionbool-
If
True, adds a regression line to the plot (uses the confidence interval passed using ci). Defaults to False. legendLegend | NotGiven | None-
Legend to use for the plot (defaults to
None, which uses the default legend). **attributesUnpack[PlotAttributes]-
Additional PlotAttributes. By default, the
x_domainis set to “fixed”, they_domainis set to[0,1.0],color_labelis set to “Organizations”, andcolor_domainis set toorganizations.
Implementation
The Scores Timeline example demonstrates how this view was implemented using lower level plotting components.