from inspect_viz import Data
from inspect_viz.view.beta import scores_timeline
= Data.from_file("benchmarks.parquet")
evals scores_timeline(evals)
Scores Timeline
Overview
The scores_timeline() function plots eval scores by model, organization, and release date1:
Data Preparation
Above we read the data for the plot from a parquet file. This file was in turn created by:
Reading logs into a data frame with
evals_df()
.Using the
prepare()
function to addmodel_info()
,frontier()
andlog_viewer()
columns to the data frame.
from inspect_ai.analysis import (
evals_df, frontier, log_viewer, model_into, prepare
)
= evals_df("logs")
df = prepare(df,
df
model_info(),
frontier(),"eval", {"logs": "https://samples.meridianlabs.ai/"}),
log_viewer(
)"benchmarks.parquet") df.to_parquet(
Filtering
A select() input for tasks is automatically provided if more than one task exists in the data
. A checkbox_group() is automatically provided for organizations if more than one organization exists (you can disable this with organizations_filter=False
).
When multiple organizations exist, clicking on the legend for an organization will filter the plot by that organization.
Function Reference
Eval scores by model, organization, and release date.
def scores_timeline(
data: Data,str = "task_display_name",
task_name: str = "model_display_name",
model_name: str = "model_organization_name",
model_organization: str = "model_release_date",
model_release_date: str = "score_headline_name",
score_name: str = "score_headline_value",
score_value: str = "score_headline_stderr",
score_stderr: list[str] | None = None,
organizations: bool | list[Literal["task", "organization"]] = True,
filters: float | bool = 0.95,
ci: str = "Release Date",
time_label: str = "Score",
score_label: str = "Eval",
eval_label: str | Title | None = None,
title: | None = None,
marks: Marks float | Param | None = None,
width: float | Param | None = None,
height: bool = False,
regression: | NotGiven | None = NOT_GIVEN,
legend: Legend **attributes: Unpack[PlotAttributes],
-> Component )
data
Data-
Data read using
evals_df()
and amended with model metadata using themodel_info()
prepare operation (see Data Preparation for details). task_name
str-
Column for task name (defaults to “task_display_name”).
model_name
str-
Column for model name (defaults to “model_display_name”).
model_organization
str-
Column for model organization (defaults to “model_organization_name”).
model_release_date
str-
Column for model release date (defaults to “model_release_date”).
score_name
str-
Column for scorer name (defaults to “score_headline_name”).
score_value
str-
Column for score value (defaults to “score_headline_value”).
score_stderr
str-
Column for score stderr (defaults to “score_headline_stderr”)
organizations
list[str] | None-
List of organizations to include (in order of desired presentation).
filters
bool | list[Literal['task', 'organization']]-
Provide UI to filter plot by task and organization(s).
ci
float | bool-
Confidence interval (defaults to 0.95, pass
False
for no confidence intervals) time_label
str-
Label for time (x-axis).
score_label
str-
Label for score (y-axis).
eval_label
str-
Label for eval select input.
title
str | Title | None-
Title for plot (
str
or mark created with the title() function). marks
Marks | None-
Additional marks to include in the plot.
width
float | Param | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
height
float | Param | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
regression
bool-
If
True
, adds a regression line to the plot (uses the confidence interval passed using ci). Defaults to False. legend
Legend | NotGiven | None-
Legend to use for the plot (defaults to
None
, which uses the default legend). **attributes
Unpack[PlotAttributes]-
Additional PlotAttributes. By default, the
x_domain
is set to “fixed”, they_domain
is set to[0,1.0]
,color_label
is set to “Organizations”, andcolor_domain
is set toorganizations
.
Implementation
The Scores Timeline example demonstrates how this view was implemented using lower level plotting components.