from inspect_viz import Data
from inspect_viz.view.beta import scores_radar_by_metric
evals = Data.from_file("writing_bench_radar.parquet")
scores_radar_by_metric(evals)Scores Radar By Metric
Overview
The scores_radar_by_metric()function renders a radar chart for comparing model scores across multiple metrics from a single task. This is useful for tasks with composite metrics, where each metric is a separate axis on the radar chart.
The scores plotted on this radar chart have been normalized using percentile ranking, which means each score represents the model’s relative performance compared to all other models in the dataset. Specifically, a score of 0.5 indicates that the model performed better than 50% of the other models. Absolute scores are displayed in the tooltips.
You can use scores_radar_by_metric_df() to produce data with min-max normalization, omit normalization entirely, or invert the scores for metrics where lower scores are better.
Data Preparation
Above we read the data for the plot from a parquet file. This file was in turn created by:
Reading evals level data into a data frame with
evals_df().Converting the evals dataframe into a dataframe specifically used by
scores_radar_by_metric()by using thescores_radar_by_metric_df()function. The output ofscores_radar_by_metric_df()can be directly passed toscores_radar_by_metric().scores_radar_by_metric_df()expects a scorer name, an optional list of metric names to visualize, an optional list of metric names to invert where lower scores correspond to better scores, an optional normalization method to scale scores, and an optional min-max domain to use for normalization on the radar chart.Using the
prepare()function to addmodel_info()andlog_viewer()columns to the data frame.
Here is the data preparation code end-to-end:
from inspect_ai.analysis import (
evals_df,
log_viewer,
model_info,
prepare,
)
from inspect_viz.view.beta import scores_radar_by_metric_df
df = evals_df("logs/writing_bench/")
df = scores_radar_by_metric_df(
df,
scorer="multi_scorer_wrapper",
metrics=[
"Abstract",
"Introduction",
"Experiments",
"Literature Review",
"Paper Outline",
],
normalization="percentile",
)
df = prepare(df, [
model_info(),
log_viewer("eval", { "logs": "https://samples.meridianlabs.ai/" })
])
df.to_parquet("writing_bench_radar.parquet")- 1
- Read the evals data into a dataframe.
- 2
-
Convert the dataframe into a
scores_radar_by_metric()specific dataframe. - 3
-
A task might have multiple scorers, specify the scorer which you want to plot. The function only supports plotting one scorer at a time. The scorer name should correspond to columns in
dfnamedscore_{scorer}_{metric}. - 4
-
Specify a list of metrics to plot on the radar chart. If unspecified, all metrics from a scorer will be plotted. Metric names in the list should correspond to columns in
dfnamedscore_{scorer}_{metric}. - 5
-
Choose an optional normalization method to scale the raw scores. Available options:
"percentile"(computes percentile rank, useful for identifying consistently strong performers),"min_max"(scales scores between min-max values, sensitive to outliers), or"absolute"(default, no normalization, may result in incomprehensible charts if metrics have different scales). - 6
-
Add pretty model names and log links to the dataframe using
prepare().
Function Reference
scores_radar_by_metric
Creates a radar chart showing scores for multiple models across multiple metrics in a single task.
This is useful for tasks with multiple metrics, where each metric is a separate axis on the radar chart.
def scores_radar_by_metric(
data: Data,
label: str = "metric",
**kwargs: Any,
) -> Componentscores_radar_by_metric_df
Creates a dataframe for a radar chart showing multiple models across multiple metrics in a single task.
This is useful for tasks with multiple metrics, where each metric is a separate axis on the radar chart.
def scores_radar_by_metric_df(
data: pd.DataFrame,
scorer: str,
metrics: list[str] | None = None,
invert: list[str] | None = None,
normalization: Literal["percentile", "min_max", "absolute"] = "absolute",
domain: tuple[float, float] | None = None,
) -> pd.DataFramedatapd.DataFrame-
Evals data table containing model scores. It assumes one row per model.
scorerstr-
The name of the scorer to use for identifying metric columns.
metricslist[str] | None-
Optional list of specific metrics to plot. If None, all metrics starting with ’score_{scorer}_’ from the data will be used.
invertlist[str] | None-
Optional list of metrics to invert (where lower scores are better).
normalizationLiteral['percentile', 'min_max', 'absolute']-
The normalization method to use for the metric values. Can be “percentile”, “min_max”, or “absolute”. Defaults to “absolute” (no normalization).
domaintuple[float, float] | None-
Optional min-max domain to use for the normalization. Only used if normalization is “min_max”. Otherwise, the domain is inferred from the data. Defaults to None.
Implementation
The Scores Radar By Metric example demonstrates how this view was implemented using lower level plotting components.