OASIS-LLM replays the affective rating procedure from Kurdi, Lozano, & Banaji (2017) against a vision-language model, holding the original instructions constant and substituting the model for a human rater. This page explains what that procedure is, how the prompts are constructed, how each trial is identified, and where the LLM replication deliberately departs from the human study.Documentation Index
Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The original OASIS procedure
Kurdi et al. (2017) normed 900 open-access color images on two affective dimensions:- Valence — the level of positivity or negativity intrinsic to the image.
- Arousal — the level of intensity or calmness intrinsic to the image, treated as orthogonal to valence.
System prompts
OASIS-LLM uses the paper-verbatim instruction wording from the main study (the image-centered variant). The arousal prompt explicitly instructs the model to rate arousal independently of valence, matching the paper’s key methodological requirement.Trial structure
Each trial is uniquely identified by a four-part primary key:run_id— thenamefield from your YAML config.image_id— the OASIS filename stem (e.g.Alarm clock 1,Beach 1).dimension—"valence"or"arousal".sample_idx—0..samples_per_image-1. Withcache_buster: true(the default), each index produces a unique per-trial salt appended to the user prompt to force decoding-path divergence even at temperature=0.
How the LLM replication differs from the human protocol
Within-subject instead of between-subject. A single model run rates both valence and arousal for every image. The paper kept these strictly between-subject so that one rating could not anchor the other. OASIS-LLM collapses them to one run because the per-trial cost is dominated by the image upload, and because you want both columns out of every model with one config. Smaller N per cell. The default issamples_per_image: 5, giving 5 ratings per (image, dimension) cell instead of the paper’s ~100. This is a deliberate cost/reliability trade-off.
Reliability at reduced sample sizes
Using the Spearman-Brown prophecy formula, if the paper-level inter-rater reliability is ρ ≈ 0.984 for valence at k = 102 raters, the expected reliability at k = 5 raters is: That is the ceiling of what 5 model samples can achieve as a cell-mean estimate, assuming the model ratings have the same true between-image variance and noise structure as humans. Treat this number as a budgeting tool, not a guarantee. The table below shows how reliability scales with sample count under the Spearman-Brown formula:samples_per_image | Estimated ρ (valence) |
|---|---|
| 1 | ~0.33 |
| 5 | ~0.76 |
| 20 | ~0.93 |
| 100 | ~0.98 |
Tuning samples_per_image
Becausesamples_per_image is excluded from the run’s canonical_hash, you can resume a run later with a higher value and the harness will only enqueue the new sample indices — no data is lost and no new run name is required.