Compare LLM Ratings Against Human Norms

The Analysis page is where you compare completed run ratings to the OASIS human norms from Kurdi et al. (2017). You select a scope — a set of runs and models — and the page computes eight statistical views live from your local DuckDB store. Every tab shares the same filtered data; only the lens changes.

Access modes

The Analysis page operates in two modes.

Mode	How you scope the data
Ad-hoc	Use the sidebar filters to pick one or more image sets and models directly. Best for quick, exploratory comparisons.
Curated	Select a saved Analysis bundle. The bundle pins a specific list of `run_id` values so your scope is reproducible and shareable.

Both modes feed into the same analytics body below the filter bar. Switch between them using the mode selector at the top of the sidebar.

An Analysis bundle requires that all pinned runs were executed against the same dataset_id. If you try to add a run from a different image set, OASIS-LLM will reject it with a validation error.

Pipeline overview

The following diagram shows the full path from raw images to the comparison statistics you see on screen. The aggregation step collapses all trials for a given (run_id, image_id, dimension) into a single mean before any comparison is made. Human norms are the published Valence_mean and Arousal_mean columns from OASIS.csv.

Filter controls

Use the filter bar to narrow the data before any tab renders.

Control	What it does
Model	Multi-select. Choose one or more models. “All” pools every model in scope into a single LLM mean.
Category	Filter to Animal, Scene, Person, or Object. Categories are colour-coded throughout the UI.
Dimension	Toggle between valence and arousal, or view both.
Image	Substring search on `image_id`. Useful for drilling into a specific stimulus.
Aggregation scope	Choose one of: Pooled all-LLMs · By model · By category · Model × Category. Controls how the comparison statistics are grouped.

The eight analysis tabs

Shows N, mean, SD, median, and range for humans, each model individually, and the pooled-LLM mean. Use this tab first to check whether your LLM means sit inside the human range before interpreting any inferential statistics.

Fits OLS LLM = a + b · Human for each scope. A perfectly calibrated model produces b = 1, a = 0. A slope greater than 1 indicates scale stretch; a positive intercept indicates an upward shift. Reports slope, intercept, R², and residual SD.

Two-way ANOVA on the per-image absolute error |LLM − human| with Category and Model as factors, including their interaction term. Uses Type II sums of squares (statsmodels). Reports F, p, and η² for each factor. Use this tab to check whether model bias is uniform across image categories or concentrated in specific ones.

Statistics reference

All metrics are computed live from the DuckDB store. No pre-aggregated caches are used.

Statistic	Formula / definition
Paired t	Per-image paired t on N images per (model × dimension). Pooled t collapses across models first.
Cohen’s d (paired)	`mean(diff) / SD(diff)` where `diff = LLM_image_mean − human_image_mean`.
Lin’s CCC	`2 · cov(x, y) / (var(x) + var(y) + (mean_x − mean_y)²)`. Combines precision (correlation) and accuracy (mean agreement).
KS / Wasserstein	Two-sample KS test and Wasserstein distance on raw LLM trial ratings vs human image means.
ICC(2,1)	Two-way random effects, single rater, absolute agreement (Shrout & Fleiss, 1979). Use when treating the runs as a sample from a population of possible models.
ICC(3,1)	Two-way mixed effects, single rater, consistency (Shrout & Fleiss, 1979). Use when the specific models in scope are the only ones of interest.

If you are comparing Lin’s CCC values across dimensions, note that CCC is sensitive to both correlation and mean-level agreement. A model can have a high Pearson r but a low CCC if it applies a systematic scale shift.

How to access

Local (live)
Hosted

Start the dashboard from your terminal:

oasis-llm dashboard

Then select Analysis in the sidebar. All statistics are computed on demand from your local database.

Documentation Index

​Access modes

​Pipeline overview

​Filter controls

​The eight analysis tabs

​Statistics reference

​How to access

Access modes

Pipeline overview

Filter controls

The eight analysis tabs

Statistics reference

How to access