The Analysis page is where you compare completed run ratings to the OASIS human norms from Kurdi et al. (2017). You select a scope — a set of runs and models — and the page computes eight statistical views live from your local DuckDB store. Every tab shares the same filtered data; only the lens changes.Documentation Index
Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Access modes
The Analysis page operates in two modes.| Mode | How you scope the data |
|---|---|
| Ad-hoc | Use the sidebar filters to pick one or more image sets and models directly. Best for quick, exploratory comparisons. |
| Curated | Select a saved Analysis bundle. The bundle pins a specific list of run_id values so your scope is reproducible and shareable. |
An Analysis bundle requires that all pinned runs were executed against the same
dataset_id. If you try to add a run from a different image set, OASIS-LLM will reject it with a validation error.Pipeline overview
The following diagram shows the full path from raw images to the comparison statistics you see on screen. The aggregation step collapses all trials for a given(run_id, image_id, dimension) into a single mean before any comparison is made. Human norms are the published Valence_mean and Arousal_mean columns from OASIS.csv.
Filter controls
Use the filter bar to narrow the data before any tab renders.| Control | What it does |
|---|---|
| Model | Multi-select. Choose one or more models. “All” pools every model in scope into a single LLM mean. |
| Category | Filter to Animal, Scene, Person, or Object. Categories are colour-coded throughout the UI. |
| Dimension | Toggle between valence and arousal, or view both. |
| Image | Substring search on image_id. Useful for drilling into a specific stimulus. |
| Aggregation scope | Choose one of: Pooled all-LLMs · By model · By category · Model × Category. Controls how the comparison statistics are grouped. |
The eight analysis tabs
- Descriptives
- t-tests
- Regression
- Scatter
- Distribution
- Outliers
- Inter-LLM agreement
- Cat × Model ANOVA
Shows N, mean, SD, median, and range for humans, each model individually, and the pooled-LLM mean. Use this tab first to check whether your LLM means sit inside the human range before interpreting any inferential statistics.
Statistics reference
All metrics are computed live from the DuckDB store. No pre-aggregated caches are used.| Statistic | Formula / definition |
|---|---|
| Paired t | Per-image paired t on N images per (model × dimension). Pooled t collapses across models first. |
| Cohen’s d (paired) | mean(diff) / SD(diff) where diff = LLM_image_mean − human_image_mean. |
| Lin’s CCC | 2 · cov(x, y) / (var(x) + var(y) + (mean_x − mean_y)²). Combines precision (correlation) and accuracy (mean agreement). |
| KS / Wasserstein | Two-sample KS test and Wasserstein distance on raw LLM trial ratings vs human image means. |
| ICC(2,1) | Two-way random effects, single rater, absolute agreement (Shrout & Fleiss, 1979). Use when treating the runs as a sample from a population of possible models. |
| ICC(3,1) | Two-way mixed effects, single rater, consistency (Shrout & Fleiss, 1979). Use when the specific models in scope are the only ones of interest. |
How to access
- Local (live)
- Hosted
Start the dashboard from your terminal:Then select Analysis in the sidebar. All statistics are computed on demand from your local database.