inspect_viz.view.beta
View functions are currently in beta and are exported from the inspect_ai.view.beta module. The beta module will be preserved after final release so that code written against it now will continue to work after the beta.
Scores
scores_by_task
Bar plot for comparing eval scores.
Summarize eval scores using a bar plot. By default, scores (y) are plotted by “task_display_name” (fx) and “model_display_name” (x). By default, confidence intervals are also plotted (disable this with y_ci=False).
def scores_by_task(
data: Data,
model_name: str = "model_display_name",
task_name: str = "task_display_name",
score_value: str = "score_headline_value",
score_stderr: str = "score_headline_stderr",
score_label: str | None | NotGiven = NOT_GIVEN,
ci: bool | float = 0.95,
title: str | Title | None = None,
marks: Marks | None = None,
width: float | Param | None = None,
height: float | Param | None = None,
legend: Legend | NotGiven | None = NOT_GIVEN,
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
Evals data table. This is typically created using a data frame read with the inspect
evals_df()function. model_namestr-
Name of field for the model name (defaults to “model_display_name”)
task_namestr-
Name of field for the task name (defaults to “task_display_name”)
score_valuestr-
Name of field for the score value (defaults to “score_headline_value”).
score_stderrstr-
Name of field for stderr (defaults to “score_headline_metric”).
score_labelstr | None | NotGiven-
Score axis label (pass None for no label).
cibool | float-
Confidence interval (e.g. 0.80, 0.90, 0.95, etc.). Defaults to 0.95.
titlestr | Title | None-
Title for plot (
stror mark created with the title() function). marksMarks | None-
Additional marks to include in the plot.
widthfloat | Param | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
heightfloat | Param | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
legendLegend | NotGiven | None-
Options for the legend. Pass None to disable the legend.
**attributesUnpack[PlotAttributes]-
Additional PlotAttributes. By default, the
margin_bottomare is set to 10 pixels andx_ticksis set to[].
scores_by_factor
Summarize eval scores with a factor of variation (e.g ‘No hint’ vs. ‘Hint’).
def scores_by_factor(
data: Data,
factor: str,
factor_labels: tuple[str, str],
score_value: str = "score_headline_value",
score_stderr: str = "score_headline_stderr",
score_label: str = "Score",
model: str = "model",
model_label: str = "Model",
ci: bool | float = 0.95,
color: str | tuple[str, str] = "#3266ae",
title: str | Mark | None = None,
marks: Marks | None = None,
width: float | Param | None = None,
height: float | Param | None = None,
legend: Legend | NotGiven | None = NOT_GIVEN,
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
Evals data table. This is typically created using a data frame read with the inspect
evals_df()function. factorstr-
Field with factor of variation (should be of type boolean).
factor_labelstuple[str, str]-
Tuple of labels for factor of variation.
Falsevalue should be first, e.g.("No hint", "Hint"). score_valuestr-
Name of field for x (scoring) axis (defaults to “score_headline_value”).
score_stderrstr-
Name of field for scoring stderr (defaults to “score_headline_stderr”).
score_labelstr-
Label for x-axis (defaults to “Score”).
modelstr-
Name of field for y axis (defaults to “model”).
model_labelstr-
Lable for y axis (defaults to “Model”).
cibool | float-
Confidence interval (e.g. 0.80, 0.90, 0.95, etc.). Defaults to 0.95.)
colorstr | tuple[str, str]-
Hex color value (or tuple of two values). If one value is provided the second is computed by lightening the main color.
titlestr | Mark | None-
Title for plot (
stror mark created with the title() function). marksMarks | None-
Additional marks to include in the plot.
widthfloat | Param | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
heightfloat | Param | None-
The outer height of the plot in pixels, including margins. Default to 65 pixels for each item on the “y” axis.
legendLegend | NotGiven | None-
Options for the legend. Pass None to disable the legend.
**attributesUnpack[PlotAttributes]-
Additional `PlotAttributes
scores_timeline
Eval scores by model, organization, and release date.
def scores_timeline(
data: Data,
task_name: str = "task_display_name",
model_name: str = "model_display_name",
model_organization: str = "model_organization_name",
model_release_date: str = "model_release_date",
score_name: str = "score_headline_name",
score_value: str = "score_headline_value",
score_stderr: str = "score_headline_stderr",
organizations: list[str] | None = None,
filters: bool | list[Literal["task", "organization"]] = True,
ci: float | bool | NotGiven = NOT_GIVEN,
time_label: str = "Release Date",
score_label: str = "Score",
eval_label: str = "Eval",
title: str | Title | None = None,
marks: Marks | None = None,
width: float | Param | None = None,
height: float | Param | None = None,
regression: bool = False,
legend: Legend | NotGiven | None = NOT_GIVEN,
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
Data read using
evals_df()and amended with model metadata using themodel_info()prepare operation (see Data Preparation for details). task_namestr-
Column for task name (defaults to “task_display_name”).
model_namestr-
Column for model name (defaults to “model_display_name”).
model_organizationstr-
Column for model organization (defaults to “model_organization_name”).
model_release_datestr-
Column for model release date (defaults to “model_release_date”).
score_namestr-
Column for scorer name (defaults to “score_headline_name”).
score_valuestr-
Column for score value (defaults to “score_headline_value”).
score_stderrstr-
Column for score stderr (defaults to “score_headline_stderr”)
organizationslist[str] | None-
List of organizations to include (in order of desired presentation).
filtersbool | list[Literal['task', 'organization']]-
Provide UI to filter plot by task and organization(s).
cifloat | bool | NotGiven-
Confidence interval (defaults to 0.95, pass
Falsefor no confidence intervals) time_labelstr-
Label for time (x-axis).
score_labelstr-
Label for score (y-axis).
eval_labelstr-
Label for eval select input.
titlestr | Title | None-
Title for plot (
stror mark created with the title() function). marksMarks | None-
Additional marks to include in the plot.
widthfloat | Param | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
heightfloat | Param | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
regressionbool-
If
True, adds a regression line to the plot (uses the confidence interval passed using ci). Defaults to False. legendLegend | NotGiven | None-
Legend to use for the plot (defaults to
None, which uses the default legend). **attributesUnpack[PlotAttributes]-
Additional PlotAttributes. By default, the
x_domainis set to “fixed”, they_domainis set to[0,1.0],color_labelis set to “Organizations”, andcolor_domainis set toorganizations.
scores_by_model
Bar plot for comparing the scores of different models on a single evaluation.
Summarize eval scores using a bar plot. By default, scores (y) are plotted by “model_display_name” (y). By default, confidence intervals are also plotted (disable this with y_ci=False).
def scores_by_model(
data: Data,
*,
model_name: str = "model_display_name",
score_value: str = "score_headline_value",
score_stderr: str = "score_headline_stderr",
ci: float = 0.95,
sort: Literal["asc", "desc"] | None = None,
score_label: str | None | NotGiven = None,
model_label: str | None | NotGiven = None,
color: str | None = None,
title: str | Title | None = None,
marks: Marks | None = None,
width: float | None = None,
height: float | None = None,
legend: Legend | NotGiven | None = NOT_GIVEN,
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
Evals data table. This is typically created using a data frame read with the inspect
evals_df()function. model_namestr-
Column containing the model name (defaults to “model_display_name”)
score_valuestr-
Column containing the score value (defaults to “score_headline_value”).
score_stderrstr-
Column containing the score standard error (defaults to “score_headline_stderr”).
cifloat-
Confidence interval (e.g. 0.80, 0.90, 0.95, etc.). Defaults to 0.95.
sortLiteral['asc', 'desc'] | None-
Sort order for the bars (sorts using the ‘x’ value). Can be “asc” or “desc”. Defaults to “asc”.
score_labelstr | None | NotGiven-
x-axis label (defaults to None).
model_labelstr | None | NotGiven-
x-axis label (defaults to None).
colorstr | None-
The color for the bars. Defaults to “#416AD0”. Pass any valid hex color value.
titlestr | Title | None-
Title for plot (
stror mark created with the title() function) marksMarks | None-
Additional marks to include in the plot.
widthfloat | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
heightfloat | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
legendLegend | NotGiven | None-
Options for the legend. Pass None to disable the legend.
**attributesUnpack[PlotAttributes]-
Additional PlotAttributes. By default, the
y_inset_topandmargin_bottomare set to 10 pixels andx_ticksis set to[].
scores_heatmap
Creates a heatmap plot of success rate of eval data.
def scores_heatmap(
data: Data,
task_name: str = "task_display_name",
task_label: str | None | NotGiven = None,
model_name: str = "model_display_name",
model_label: str | None | NotGiven = None,
score_value: str = "score_headline_value",
cell: CellOptions | None = None,
tip: bool = True,
title: str | Title | None = None,
marks: Marks | None = None,
height: float | None = None,
width: float | None = None,
legend: Legend | NotGiven | None = NOT_GIVEN,
sort: Literal["ascending", "descending"] | SortOrder | None = "ascending",
orientation: Literal["horizontal", "vertical"] = "horizontal",
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
Evals data table.
task_namestr-
Name of column to use for columns.
task_labelstr | None | NotGiven-
x-axis label (defaults to None).
model_namestr-
Name of column to use for rows.
model_labelstr | None | NotGiven-
y-axis label (defaults to None).
score_valuestr-
Name of the column to use as values to determine cell color.
cellCellOptions | None-
Options for the cell marks.
tipbool-
Whether to show a tooltip with the value when hovering over a cell (defaults to True).
titlestr | Title | None-
Title for plot (
stror mark created with the title() function) marksMarks | None-
Additional marks to include in the plot.
heightfloat | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio).
widthfloat | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
legendLegend | NotGiven | None-
Options for the legend. Pass None to disable the legend.
sortLiteral['ascending', 'descending'] | SortOrder | None-
Sort order for the x and y axes. If ascending, the highest values will be sorted to the top right. If descending, the highest values will appear in the bottom left. If None, no sorting is applied. If a SortOrder is provided, it will be used to sort the x and y axes.
orientationLiteral['horizontal', 'vertical']-
The orientation of the heatmap. If “horizontal”, the tasks will be on the x-axis and models on the y-axis. If “vertical”, the tasks will be on the y-axis and models on the x-axis.
**attributesUnpack[PlotAttributes]-
Additional `PlotAttributes
scores_by_limit
Visualizes success rate as a function of a resource limit (time, tokens).
Model success rate is plotted as a function of the time, tokens, or other resource limit.
def scores_by_limit(
data: Data,
model: str = "model_display_name",
success: str = "success_rate",
stderr: str | None = "standard_error",
facet: str | None = None,
other_termination_rate: str | bool = False,
limit: str | None = None,
limit_label: str | NotGiven = NOT_GIVEN,
scale: Literal["log", "linear", "auto"] = "auto",
title: str | Title | None = None,
marks: Marks | None = None,
height: float | None = None,
width: float | None = None,
legend: Legend | NotGiven | None = NOT_GIVEN,
ci: float = 0.95,
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
A dataframe prepared using the
prepare_limit_dataframefunction. modelstr-
Name of field holding the model (defaults to “model_display_name”).
successstr-
Name of field containing the success rate (defaults to “success_rate”).
stderrstr | None-
Name of field containing the standard_error (defaults to “standard_error”).
facetstr | None-
Name of field to use for faceting (defaults to None).
other_termination_ratestr | bool-
Name of field containing the other termination rate (defaults to “other_termination_rate”).
limitstr | None-
Name of field for x axis (by default, will detect limit type using the columns present in the data frame).
limit_labelstr | NotGiven-
The limit label (by default, will select limit label using the columns present in the data frame). Pass None for no label.
scaleLiteral['log', 'linear', 'auto']-
The scale type for the limit access. If ‘auto’, will use log scale if the range is 2 or more orders of magnitude (defaults to ‘auto’).
titlestr | Title | None-
Title for plot (
stror mark created with the title() function) marksMarks | None-
Additional marks to include in the plot.
heightfloat | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
widthfloat | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
legendLegend | NotGiven | None-
Options for the legend. Pass None to disable the legend.
cifloat-
Confidence interval (e.g. 0.80, 0.90, 0.95, etc.). Defaults to 0.95.
**attributesUnpack[PlotAttributes]-
Additional PlotAttributes.
scores_by_limit_df
Prepares a dataframe for plotting success rate as a function of a resource limit (time, tokens).
def scores_by_limit_df(
df: pd.DataFrame,
score: str,
limit: Literal["total_tokens", "total_time", "working_time"] = "total_tokens",
scale: Literal["log", "linear", "auto"] = "auto",
steps: int = 100,
) -> pd.DataFramedfpd.DataFrame-
A dataframe containing sample summaries and eval information.
scorestr-
Name of field containing the score (0 = fail, 1 = success).
limitLiteral['total_tokens', 'total_time', 'working_time']-
The resource limit to use (one of ‘total_tokens’, ‘total_time’, ‘working_time’). Defaults to ‘total_tokens’.
scaleLiteral['log', 'linear', 'auto']-
The scale type for the limit access. If ‘auto’, will use log scale if the range is 2 or more orders of magnitude (defaults to ‘auto’).
stepsint-
The number of points to use when sampling the limit range (defaults to 100).
Samples
sample_tool_calls
Heat map visualising tool calls over evaluation turns.
def sample_tool_calls(
data: Data,
x: str = "order",
y: str = "id",
tool: str = "tool_call_function",
limit: str = "limit",
tools: list[str] | None = None,
x_label: str | None = "Message",
y_label: str | None = "Sample",
title: str | Title | None = None,
marks: Marks | None = None,
width: float | None = None,
height: float | None = None,
legend: Legend | NotGiven | None = NOT_GIVEN,
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
Messages data table. This is typically created using a data frame read with the inspect
messages_df()function. xstr-
Name of field for x axis (defaults to “order”)
ystr-
Name of field for y axis (defaults to “id”).
toolstr-
Name of field with tool name (defaults to “tool_call_function”)
limitstr-
Name of field with sample limit (defaults to “limit”).
toolslist[str] | None-
Tools to include in plot (and order to include them). Defaults to all tools found in
data. x_labelstr | None-
x-axis label (defaults to “Message”).
y_labelstr | None-
y-axis label (defaults to “Sample”).
titlestr | Title | None-
Title for plot (
stror mark created with the title() function) marksMarks | None-
Additional marks to include in the plot.
widthfloat | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
heightfloat | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
legendLegend | NotGiven | None-
Options for the legend. Pass None to disable the legend.
**attributesUnpack[PlotAttributes]-
Additional PlotAttributes. By default, the
margin_topis set to 0,margin_leftto 20,margin_rightto 100,color_labelis “Tool”,y_ticksis empty, andx_ticksandcolor_domainare calculated fromdata.
sample_heatmap
Creates a heatmap plot of success rate of eval data.
def sample_heatmap(
data: Data,
id: str = "id",
id_label: str | None | NotGiven = None,
model_name: str = "model_display_name",
model_label: str | None | NotGiven = None,
score_value: str | None = None,
cell: CellOptions | None = None,
tip: bool = True,
title: str | Title | None = None,
marks: Marks | None = None,
height: float | None = None,
width: float | None = None,
legend: Legend | NotGiven | None = NOT_GIVEN,
sort: Literal["ascending", "descending"] | SortOrder | None = "ascending",
orientation: Literal["horizontal", "vertical"] = "horizontal",
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
Evals data table.
idstr-
Name of column to use for displaying the sample id.
id_labelstr | None | NotGiven-
x-axis label (defaults to None).
model_namestr-
Name of column to use for rows.
model_labelstr | None | NotGiven-
y-axis label (defaults to None).
score_valuestr | None-
Name of the column to use as values to determine cell color.
cellCellOptions | None-
Options for the cell marks.
tipbool-
Whether to show a tooltip with the value when hovering over a cell (defaults to True).
titlestr | Title | None-
Title for plot (
stror mark created with the title() function) marksMarks | None-
Additional marks to include in the plot.
heightfloat | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio).
widthfloat | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
legendLegend | NotGiven | None-
Options for the legend. Pass None to disable the legend.
sortLiteral['ascending', 'descending'] | SortOrder | None-
Sort order for the x and y axes. If ascending, the highest values will be sorted to the top right. If descending, the highest values will appear in the bottom left. If None, no sorting is applied. If a SortOrder is provided, it will be used to sort the x and y axes.
orientationLiteral['horizontal', 'vertical']-
The orientation of the heatmap. If “horizontal”, the tasks will be on the x-axis and models on the y-axis. If “vertical”, the tasks will be on the y-axis and models on the x-axis.
**attributesUnpack[PlotAttributes]-
Additional `PlotAttributes
General
heatmap
Creates a heatmap plot of arbitrary data.
def heatmap(
data: Data,
x_value: str = "id",
x_label: str | None | NotGiven = None,
y_value: str = "model_display_name",
y_label: str | None | NotGiven = None,
color_value: str | None = None,
channels: dict[str, Any] | None = None,
cell: CellOptions | None = None,
tip: bool = True,
title: str | Title | None = None,
marks: Marks | None = None,
height: float | None = None,
width: float | None = None,
legend: Legend | NotGiven | None = NOT_GIVEN,
sort: Literal["ascending", "descending"] | SortOrder | None = "ascending",
orientation: Literal["horizontal", "vertical"] = "horizontal",
**attributes: Unpack[PlotAttributes],
) -> ComponentdataData-
Evals data table.
x_valuestr-
x-axis value
x_labelstr | None | NotGiven-
x-axis label (defaults to None).
y_valuestr-
y axis value
y_labelstr | None | NotGiven-
y-axis label (defaults to None).
color_valuestr | None-
Name of the column to use as values to determine cell color.
channelsdict[str, Any] | None-
Channels to use for the plot. If None, the default channels are used.
cellCellOptions | None-
Options for the cell marks.
tipbool-
Whether to show a tooltip with the value when hovering over a cell (defaults to True).
titlestr | Title | None-
Title for plot (
stror mark created with the title() function) marksMarks | None-
Additional marks to include in the plot.
heightfloat | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio).
widthfloat | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
legendLegend | NotGiven | None-
Options for the legend. Pass None to disable the legend.
sortLiteral['ascending', 'descending'] | SortOrder | None-
Sort order for the x and y axes. If ascending, the highest values will be sorted to the top right. If descending, the highest values will appear in the bottom left. If None, no sorting is applied. If a SortOrder is provided, it will be used to sort the x and y axes.
orientationLiteral['horizontal', 'vertical']-
The orientation of the heatmap. If “horizontal”, the tasks will be on the x-axis and models on the y-axis. If “vertical”, the tasks will be on the y-axis and models on the x-axis.
**attributesUnpack[PlotAttributes]-
Additional `PlotAttributes
Types
CellOptions
Cell options for the heatmap.
class CellOptions(TypedDict, total=False)Attributes
insetfloat | None-
Inset for the cell marks. Defaults to 1 pixel.
textstr | None-
Text color for the cell marks. Defaults to “white”. Set to None to disable text.