from inspect_viz import Data
from inspect_viz.view.beta import scores_by_limit
= Data.from_file("swebench_token_limit.parquet")
evals scores_by_limit(evals)
Scores by Limit
Overview
The scores_by_limit() function renders a line plot for evaluating how a model’s success rate changes as the compute budget increases (e.g., token limit or time). It helps answer “Will performance keep improving if I spend more?”. The shaded band displays the confidence interval derived from the standard error.
This visualization requires that you run your evaluation wiht a very high time or token limit, allowing the model a large amount of the resource to complete each sample in the evaluation. Then, use the scores_by_limit_df() function will to prepare the dataframe for visualization, computing the amount of the time or tokens required to solve each sample.
Data Preparation
Above we read the data for the plot from a parquet file. This file was in turn created by:
Reading sample level data into a data frame with
samples_df()
. In addition to the base sample information, we also read eval specific columns usingEvalInfo
andEvalModel
.Converting the samples dataframe into a dataframe specifically used by scores_by_limit() by using the scores_by_limit_df() function.
Using the
prepare()
function to addmodel_info()
andlog_viewer()
columns to the data frame.
Here is the data preparation code end-to-end:
from inspect_ai.analysis import (
EvalInfo, EvalModel, SampleSummary,
log_viewer, model_info, prepare, samples_df
)from inspect_viz.view.beta import scores_by_limit_df
= samples_df(
df
["logs/swe-bench/"],
=SampleSummary + EvalInfo + EvalModel,
columns
)
= scores_by_limit_df(
df
df,="score_swe_bench_scorer",
score
)
= prepare(df,
df
model_info(),"eval", { "logs": "https://samples.meridianlabs.ai/" })
log_viewer(
)
"swebench_token_limit.parquet") df.to_parquet(
- 1
- Read the samples data info a dataframe.
- 2
-
Be sure to specify the
SampleSummary
,EvalInfo
, andEvalModel
columns. - 3
- Convert the base dataframe into a scores_by_limit() specific dataframe.
- 4
- Add pretty model names and log links to the dataframe.
Note that both the log viewer links and model names are optional (the plot will render without links and use raw model strings if the data isn’t prepared with log_viewer()
and model_info()
).
Function Reference
scores_by_limit
Visualizes success rate as a function of a resource limit (time, tokens).
Model success rate is plotted as a function of the time, tokens, or other resource limit.
def scores_by_limit(
data: Data,str = "model_display_name",
model: str = "success_rate",
success: str | None = "standard_error",
stderr: str | None = None,
facet: str | bool = False,
other_termination_rate: str | None = None,
limit: str | NotGiven = NOT_GIVEN,
limit_label: "log", "linear", "auto"] = "auto",
scale: Literal[float | None = None,
height: float | None = None,
width: float = 0.95,
ci: **attributes: Unpack[PlotAttributes],
-> Component )
data
Data-
A dataframe prepared using the
prepare_limit_dataframe
function. model
str-
Name of field holding the model (defaults to “model_display_name”).
success
str-
Name of field containing the success rate (defaults to “success_rate”).
stderr
str | None-
Name of field containing the standard_error (defaults to “standard_error”).
facet
str | None-
Name of field to use for faceting (defaults to None).
other_termination_rate
str | bool-
Name of field containing the other termination rate (defaults to “other_termination_rate”).
limit
str | None-
Name of field for x axis (by default, will detect limit type using the columns present in the data frame).
limit_label
str | NotGiven-
The limit label (by default, will select limit label using the columns present in the data frame). Pass None for no label.
scale
Literal['log', 'linear', 'auto']-
The scale type for the limit access. If ‘auto’, will use log scale if the range is 2 or more orders of magnitude (defaults to ‘auto’).
height
float | None-
The outer height of the plot in pixels, including margins. The default is width / 1.618 (the golden ratio)
width
float | None-
The outer width of the plot in pixels, including margins. Defaults to 700.
ci
float-
Confidence interval (e.g. 0.80, 0.90, 0.95, etc.). Defaults to 0.95.
**attributes
Unpack[PlotAttributes]-
Additional PlotAttributes.
scores_by_limit_df
Prepares a dataframe for plotting success rate as a function of a resource limit (time, tokens).
def scores_by_limit_df(
df: pd.DataFrame,str,
score: "total_tokens", "total_time", "working_time"] = "total_tokens",
limit: Literal["log", "linear", "auto"] = "auto",
scale: Literal[int = 100,
steps: -> pd.DataFrame )
df
pd.DataFrame-
A dataframe containing sample summaries and eval information.
score
str-
Name of field containing the score (0 = fail, 1 = success).
limit
Literal['total_tokens', 'total_time', 'working_time']-
The resource limit to use (one of ‘total_tokens’, ‘total_time’, ‘working_time’). Defaults to ‘total_tokens’.
scale
Literal['log', 'linear', 'auto']-
The scale type for the limit access. If ‘auto’, will use log scale if the range is 2 or more orders of magnitude (defaults to ‘auto’).
steps
int-
The number of points to use when sampling the limit range (defaults to 100).
Implementation
The Scores by Limit example demonstrates how this view was implemented using lower level plotting components.