inspect_viz.view.beta
View functions are currently in beta and are exported from the inspect_ai.view.beta module. The beta module will be preserved after final release so that code written against it now will continue to work after the beta.
Scores
scores_by_task
Bar plot for comparing eval scores.
Summarize eval scores using a bar plot. By default, scores (y
) are plotted by “task_display_name” (fx
) and “model_display_name” (x
). By default, confidence intervals are also plotted (disable this with y_ci=False
).
def scores_by_task(
data: Data,str = "model_display_name",
model_name: str = "task_display_name",
task_name: str = "score_headline_value",
score_value: str = "score_headline_stderr",
score_stderr: str | None | NotGiven = NOT_GIVEN,
score_label: bool | float = 0.95,
ci: str | Title | None = None,
title: | None = None,
marks: Marks float | Param | None = None,
width: float | Param | None = None,
height: **attributes: Unpack[PlotAttributes],
-> Component )
data
Data-
Evals data table. This is typically created using a data frame read with the inspect
evals_df()
function. model_name
str-
Name of field for the model name (defaults to “model_display_name”)
task_name
str-
Name of field for the task name (defaults to “task_display_name”)
score_value
str-
Name of field for the score value (defaults to “score_headline_value”).
score_stderr
str-
Name of field for stderr (defaults to “score_headline_metric”).
score_label
str | None | NotGiven-
Score axis label (pass None for no label).
ci
bool | float-
Confidence interval (e.g. 0.80, 0.90, 0.95, etc.). Defaults to 0.95.
title
str | Title | None-
Title for plot (
str
or mark created with the title() function). marks
Marks | None-
Additional marks to include in the plot.
width
float | Param | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
height
float | Param | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
**attributes
Unpack[PlotAttributes]-
Additional PlotAttributes. By default, the
margin_bottom
are is set to 10 pixels andx_ticks
is set to[]
.
scores_by_factor
Summarize eval scores with a factor of variation (e.g ‘No hint’ vs. ‘Hint’).
def scores_by_factor(
data: Data,str,
factor: tuple[str, str],
factor_labels: str = "score_headline_value",
score_value: str = "score_headline_stderr",
score_stderr: str = "Score",
score_label: str = "model",
model: str = "Model",
model_label: bool | float = 0.95,
ci: str | tuple[str, str] = "#3266ae",
color: str | Mark | None = None,
title: | None = None,
marks: Marks float | Param | None = None,
width: float | Param | None = None,
height: **attributes: Unpack[PlotAttributes],
-> Component )
data
Data-
Evals data table. This is typically created using a data frame read with the inspect
evals_df()
function. factor
str-
Field with factor of variation (should be of type boolean).
factor_labels
tuple[str, str]-
Tuple of labels for factor of variation.
False
value should be first, e.g.("No hint", "Hint")
. score_value
str-
Name of field for x (scoring) axis (defaults to “score_headline_value”).
score_stderr
str-
Name of field for scoring stderr (defaults to “score_headline_stderr”).
score_label
str-
Label for x-axis (defaults to “Score”).
model
str-
Name of field for y axis (defaults to “model”).
model_label
str-
Lable for y axis (defaults to “Model”).
ci
bool | float-
Confidence interval (e.g. 0.80, 0.90, 0.95, etc.). Defaults to 0.95.)
color
str | tuple[str, str]-
Hex color value (or tuple of two values). If one value is provided the second is computed by lightening the main color.
title
str | Mark | None-
Title for plot (
str
or mark created with the title() function). marks
Marks | None-
Additional marks to include in the plot.
width
float | Param | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
height
float | Param | None-
The outer height of the plot in pixels, including margins. Default to 65 pixels for each item on the “y” axis.
**attributes
Unpack[PlotAttributes]-
Additional `PlotAttributes
scores_timeline
Eval scores by model, organization, and release date.
def scores_timeline(
data: Data,str = "task_display_name",
task_name: str = "model_display_name",
model_name: str = "model_organization_name",
model_organization: str = "model_release_date",
model_release_date: str = "score_headline_name",
score_name: str = "score_headline_value",
score_value: str = "score_headline_stderr",
score_stderr: list[str] | None = None,
organizations: bool | list[Literal["task", "organization"]] = True,
filters: float | bool = 0.95,
ci: str = "Release Date",
time_label: str = "Score",
score_label: str = "Eval",
eval_label: str | Title | None = None,
title: | None = None,
marks: Marks float | Param | None = None,
width: float | Param | None = None,
height: bool = False,
regression: | NotGiven | None = NOT_GIVEN,
legend: Legend **attributes: Unpack[PlotAttributes],
-> Component )
data
Data-
Data read using
evals_df()
and amended with model metadata using themodel_info()
prepare operation (see Data Preparation for details). task_name
str-
Column for task name (defaults to “task_display_name”).
model_name
str-
Column for model name (defaults to “model_display_name”).
model_organization
str-
Column for model organization (defaults to “model_organization_name”).
model_release_date
str-
Column for model release date (defaults to “model_release_date”).
score_name
str-
Column for scorer name (defaults to “score_headline_name”).
score_value
str-
Column for score value (defaults to “score_headline_value”).
score_stderr
str-
Column for score stderr (defaults to “score_headline_stderr”)
organizations
list[str] | None-
List of organizations to include (in order of desired presentation).
filters
bool | list[Literal['task', 'organization']]-
Provide UI to filter plot by task and organization(s).
ci
float | bool-
Confidence interval (defaults to 0.95, pass
False
for no confidence intervals) time_label
str-
Label for time (x-axis).
score_label
str-
Label for score (y-axis).
eval_label
str-
Label for eval select input.
title
str | Title | None-
Title for plot (
str
or mark created with the title() function). marks
Marks | None-
Additional marks to include in the plot.
width
float | Param | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
height
float | Param | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
regression
bool-
If
True
, adds a regression line to the plot (uses the confidence interval passed using ci). Defaults to False. legend
Legend | NotGiven | None-
Legend to use for the plot (defaults to
None
, which uses the default legend). **attributes
Unpack[PlotAttributes]-
Additional PlotAttributes. By default, the
x_domain
is set to “fixed”, they_domain
is set to[0,1.0]
,color_label
is set to “Organizations”, andcolor_domain
is set toorganizations
.
scores_by_model
Bar plot for comparing the scores of different models on a single evaluation.
Summarize eval scores using a bar plot. By default, scores (y
) are plotted by “model_display_name” (y
). By default, confidence intervals are also plotted (disable this with y_ci=False
).
def scores_by_model(
data: Data,*,
str = "model_display_name",
model_name: str = "score_headline_value",
score_value: str = "score_headline_stderr",
score_stderr: float = 0.95,
ci: "asc", "desc"] | None = None,
sort: Literal[str | None | NotGiven = None,
score_label: str | None | NotGiven = None,
model_label: str | None = None,
color: str | Title | None = None,
title: | None = None,
marks: Marks float | None = None,
width: float | None = None,
height: **attributes: Unpack[PlotAttributes],
-> Component )
data
Data-
Evals data table. This is typically created using a data frame read with the inspect
evals_df()
function. model_name
str-
Column containing the model name (defaults to “model_display_name”)
score_value
str-
Column containing the score value (defaults to “score_headline_value”).
score_stderr
str-
Column containing the score standard error (defaults to “score_headline_stderr”).
ci
float-
Confidence interval (e.g. 0.80, 0.90, 0.95, etc.). Defaults to 0.95.
sort
Literal['asc', 'desc'] | None-
Sort order for the bars (sorts using the ‘x’ value). Can be “asc” or “desc”. Defaults to “asc”.
score_label
str | None | NotGiven-
x-axis label (defaults to None).
model_label
str | None | NotGiven-
x-axis label (defaults to None).
color
str | None-
The color for the bars. Defaults to “#416AD0”. Pass any valid hex color value.
title
str | Title | None-
Title for plot (
str
or mark created with the title() function) marks
Marks | None-
Additional marks to include in the plot.
width
float | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
height
float | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
**attributes
Unpack[PlotAttributes]-
Additional PlotAttributes. By default, the
y_inset_top
andmargin_bottom
are set to 10 pixels andx_ticks
is set to[]
.
scores_heatmap
Creates a heatmap plot of success rate of eval data.
def scores_heatmap(
data: Data,str = "task_display_name",
task_name: str | None | NotGiven = None,
task_label: str = "model_display_name",
model_name: str | None | NotGiven = None,
model_label: str = "score_headline_value",
score_value: | None = None,
cell: CellOptions bool = True,
tip: str | Title | None = None,
title: | None = None,
marks: Marks float | None = None,
height: float | None = None,
width: | bool | None = None,
legend: Legend "ascending", "descending"] | SortOrder | None = "ascending",
sort: Literal["horizontal", "vertical"] = "horizontal",
orientation: Literal[**attributes: Unpack[PlotAttributes],
-> Component )
data
Data-
Evals data table.
task_name
str-
Name of column to use for columns.
task_label
str | None | NotGiven-
x-axis label (defaults to None).
model_name
str-
Name of column to use for rows.
model_label
str | None | NotGiven-
y-axis label (defaults to None).
score_value
str-
Name of the column to use as values to determine cell color.
cell
CellOptions | None-
Options for the cell marks.
tip
bool-
Whether to show a tooltip with the value when hovering over a cell (defaults to True).
title
str | Title | None-
Title for plot (
str
or mark created with the title() function) marks
Marks | None-
Additional marks to include in the plot.
height
float | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio).
width
float | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
legend
Legend | bool | None-
Options for the legend. Pass None to disable the legend.
sort
Literal['ascending', 'descending'] | SortOrder | None-
Sort order for the x and y axes. If ascending, the highest values will be sorted to the top right. If descending, the highest values will appear in the bottom left. If None, no sorting is applied. If a SortOrder is provided, it will be used to sort the x and y axes.
orientation
Literal['horizontal', 'vertical']-
The orientation of the heatmap. If “horizontal”, the tasks will be on the x-axis and models on the y-axis. If “vertical”, the tasks will be on the y-axis and models on the x-axis.
**attributes
Unpack[PlotAttributes]-
Additional `PlotAttributes
scores_by_limit
Visualizes success rate as a function of a resource limit (time, tokens).
Model success rate is plotted as a function of the time, tokens, or other resource limit.
def scores_by_limit(
data: Data,str = "model_display_name",
model: str = "success_rate",
success: str | None = "standard_error",
stderr: str | None = None,
facet: str | bool = False,
other_termination_rate: str | None = None,
limit: str | NotGiven = NOT_GIVEN,
limit_label: "log", "linear", "auto"] = "auto",
scale: Literal[float | None = None,
height: float | None = None,
width: float = 0.95,
ci: **attributes: Unpack[PlotAttributes],
-> Component )
data
Data-
A dataframe prepared using the
prepare_limit_dataframe
function. model
str-
Name of field holding the model (defaults to “model_display_name”).
success
str-
Name of field containing the success rate (defaults to “success_rate”).
stderr
str | None-
Name of field containing the standard_error (defaults to “standard_error”).
facet
str | None-
Name of field to use for faceting (defaults to None).
other_termination_rate
str | bool-
Name of field containing the other termination rate (defaults to “other_termination_rate”).
limit
str | None-
Name of field for x axis (by default, will detect limit type using the columns present in the data frame).
limit_label
str | NotGiven-
The limit label (by default, will select limit label using the columns present in the data frame). Pass None for no label.
scale
Literal['log', 'linear', 'auto']-
The scale type for the limit access. If ‘auto’, will use log scale if the range is 2 or more orders of magnitude (defaults to ‘auto’).
height
float | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
width
float | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
ci
float-
Confidence interval (e.g. 0.80, 0.90, 0.95, etc.). Defaults to 0.95.
**attributes
Unpack[PlotAttributes]-
Additional PlotAttributes.
scores_by_limit_df
Prepares a dataframe for plotting success rate as a function of a resource limit (time, tokens).
def scores_by_limit_df(
df: pd.DataFrame,str,
score: "total_tokens", "total_time", "working_time"] = "total_tokens",
limit: Literal["log", "linear", "auto"] = "auto",
scale: Literal[int = 100,
steps: -> pd.DataFrame )
df
pd.DataFrame-
A dataframe containing sample summaries and eval information.
score
str-
Name of field containing the score (0 = fail, 1 = success).
limit
Literal['total_tokens', 'total_time', 'working_time']-
The resource limit to use (one of ‘total_tokens’, ‘total_time’, ‘working_time’). Defaults to ‘total_tokens’.
scale
Literal['log', 'linear', 'auto']-
The scale type for the limit access. If ‘auto’, will use log scale if the range is 2 or more orders of magnitude (defaults to ‘auto’).
steps
int-
The number of points to use when sampling the limit range (defaults to 100).
Tools
tool_calls
Heat map visualising tool calls over evaluation turns.
def tool_calls(
data: Data,str = "order",
x: str = "id",
y: str = "tool_call_function",
tool: str = "limit",
limit: list[str] | None = None,
tools: str | None = "Message",
x_label: str | None = "Sample",
y_label: str | Title | None = None,
title: | None = None,
marks: Marks float | None = None,
width: float | None = None,
height: **attributes: Unpack[PlotAttributes],
-> Component )
data
Data-
Messages data table. This is typically created using a data frame read with the inspect
messages_df()
function. x
str-
Name of field for x axis (defaults to “order”)
y
str-
Name of field for y axis (defaults to “id”).
tool
str-
Name of field with tool name (defaults to “tool_call_function”)
limit
str-
Name of field with sample limit (defaults to “limit”).
tools
list[str] | None-
Tools to include in plot (and order to include them). Defaults to all tools found in
data
. x_label
str | None-
x-axis label (defaults to “Message”).
y_label
str | None-
y-axis label (defaults to “Sample”).
title
str | Title | None-
Title for plot (
str
or mark created with the title() function) marks
Marks | None-
Additional marks to include in the plot.
width
float | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
height
float | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
**attributes
Unpack[PlotAttributes]-
Additional PlotAttributes. By default, the
margin_top
is set to 0,margin_left
to 20,margin_right
to 100,color_label
is “Tool”,y_ticks
is empty, andx_ticks
andcolor_domain
are calculated fromdata
.
Types
CellOptions
Cell options for the heatmap.
class CellOptions(TypedDict, total=False)
Attributes
inset
float | None-
Inset for the cell marks. Defaults to 1 pixel.
text
str | None-
Text color for the cell marks. Defaults to “white”. Set to None to disable text.