Skip to main content

Documentation Index

Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The Analysis page is where you compare completed run ratings to the OASIS human norms from Kurdi et al. (2017). You select a scope — a set of runs and models — and the page computes eight statistical views live from your local DuckDB store. Every tab shares the same filtered data; only the lens changes.

Access modes

The Analysis page operates in two modes.
ModeHow you scope the data
Ad-hocUse the sidebar filters to pick one or more image sets and models directly. Best for quick, exploratory comparisons.
CuratedSelect a saved Analysis bundle. The bundle pins a specific list of run_id values so your scope is reproducible and shareable.
Both modes feed into the same analytics body below the filter bar. Switch between them using the mode selector at the top of the sidebar.
An Analysis bundle requires that all pinned runs were executed against the same dataset_id. If you try to add a run from a different image set, OASIS-LLM will reject it with a validation error.

Pipeline overview

The following diagram shows the full path from raw images to the comparison statistics you see on screen. The aggregation step collapses all trials for a given (run_id, image_id, dimension) into a single mean before any comparison is made. Human norms are the published Valence_mean and Arousal_mean columns from OASIS.csv.

Filter controls

Use the filter bar to narrow the data before any tab renders.
ControlWhat it does
ModelMulti-select. Choose one or more models. “All” pools every model in scope into a single LLM mean.
CategoryFilter to Animal, Scene, Person, or Object. Categories are colour-coded throughout the UI.
DimensionToggle between valence and arousal, or view both.
ImageSubstring search on image_id. Useful for drilling into a specific stimulus.
Aggregation scopeChoose one of: Pooled all-LLMs · By model · By category · Model × Category. Controls how the comparison statistics are grouped.

The eight analysis tabs

Shows N, mean, SD, median, and range for humans, each model individually, and the pooled-LLM mean. Use this tab first to check whether your LLM means sit inside the human range before interpreting any inferential statistics.

Statistics reference

All metrics are computed live from the DuckDB store. No pre-aggregated caches are used.
StatisticFormula / definition
Paired tPer-image paired t on N images per (model × dimension). Pooled t collapses across models first.
Cohen’s d (paired)mean(diff) / SD(diff) where diff = LLM_image_mean − human_image_mean.
Lin’s CCC2 · cov(x, y) / (var(x) + var(y) + (mean_x − mean_y)²). Combines precision (correlation) and accuracy (mean agreement).
KS / WassersteinTwo-sample KS test and Wasserstein distance on raw LLM trial ratings vs human image means.
ICC(2,1)Two-way random effects, single rater, absolute agreement (Shrout & Fleiss, 1979). Use when treating the runs as a sample from a population of possible models.
ICC(3,1)Two-way mixed effects, single rater, consistency (Shrout & Fleiss, 1979). Use when the specific models in scope are the only ones of interest.
If you are comparing Lin’s CCC values across dimensions, note that CCC is sensitive to both correlation and mean-level agreement. A model can have a high Pearson r but a low CCC if it applies a systematic scale shift.

How to access

Start the dashboard from your terminal:
oasis-llm dashboard
Then select Analysis in the sidebar. All statistics are computed on demand from your local database.