from inspect_viz import Data
from inspect_viz.view.beta import scores_by_task
evals = Data.from_file("evals.parquet")
scores_by_task(evals)Scores by Task
Overview
The scores_by_task() function renders a bar plot for comparing eval scores.
Data Preparation
Above we read the data for the plot from a parquet file. This file was in turn created by:
Reading logs into a data frame with
evals_df().Using the
prepare()function to addmodel_info()andlog_viewer()columns to the data frame.
from inspect_ai.analysis import evals_df, log_viewer, model_info, prepare
df = evals_df("logs")
df = prepare(df, [
model_info(),
log_viewer("eval", {"logs": "https://samples.meridianlabs.ai/"})
])
df.to_parquet("evals.parquet")You can additionally use the task_info() operation to map lower-level task names to task display names (e.g. “gpqa_diamond” -> “GPQA Diamond”).
Note that both the log viewer links and model names are optional (the plot will render without links and use raw model strings if the data isn’t prepared with log_viewer() and model_info()).
Function Reference
Bar plot for comparing eval scores.
Summarize eval scores using a bar plot. By default, scores (y) are plotted by “task_display_name” (fx) and “model_display_name” (x). By default, confidence intervals are also plotted (disable this with y_ci=False).
def scores_by_task(
data: Data,
model_name: str = "model_display_name",
task_name: str = "task_display_name",
score_value: str = "score_headline_value",
score_stderr: str = "score_headline_stderr",
score_label: str | None | NotGiven = NOT_GIVEN,
ci: bool | float = 0.95,
title: str | Title | None = None,
marks: Marks | None = None,
width: float | Param | None = None,
height: float | Param | None = None,
legend: Legend | NotGiven | None = NOT_GIVEN,
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
Evals data table. This is typically created using a data frame read with the inspect
evals_df()function. model_namestr-
Name of field for the model name (defaults to “model_display_name”)
task_namestr-
Name of field for the task name (defaults to “task_display_name”)
score_valuestr-
Name of field for the score value (defaults to “score_headline_value”).
score_stderrstr-
Name of field for stderr (defaults to “score_headline_metric”).
score_labelstr | None | NotGiven-
Score axis label (pass None for no label).
cibool | float-
Confidence interval (e.g. 0.80, 0.90, 0.95, etc.). Defaults to 0.95.
titlestr | Title | None-
Title for plot (
stror mark created with the title() function). marksMarks | None-
Additional marks to include in the plot.
widthfloat | Param | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
heightfloat | Param | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
legendLegend | NotGiven | None-
Options for the legend. Pass None to disable the legend.
**attributesUnpack[PlotAttributes]-
Additional PlotAttributes. By default, the
margin_bottomare is set to 10 pixels andx_ticksis set to[].
Implementation
The Scores by Task example demonstrates how this view was implemented using lower level plotting components.