Skip to main content

Documentation Index

Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

OASIS-LLM is a replication of the Kurdi, Lozano, & Banaji (2017) rating protocol using language models instead of human participants. It is not a perfect replication. This page documents every known way the two diverge, what drives each divergence, and what it costs you when interpreting your results. The goal is a complete, honest list — if you find a divergence not listed here, treat it as a bug and report it.

1. Same-model rater pool vs independent human raters

PaperOASIS-LLM
Who rates valence vs arousalDifferent participants (between-subject)Same model weights — separate stateless calls
Within-call contextOne dimension per sessionOne dimension per call (no cross-dimension context)
Cross-trial statePer-participant carryover within their listNone — every trial is a fresh API call
Each trial in OASIS-LLM is a single stateless API call carrying exactly one dimension. The model never sees a valence and an arousal question in the same conversation, so there is no anchoring path between dimensions within a session — because there is no session. The remaining divergence is at the population level. The paper drew valence raters and arousal raters from disjoint pools of humans. OASIS-LLM draws both dimensions from the same model (same weights, same provider, same configuration). If that model encodes valence and arousal as a shared internal affect representation, ratings of the two dimensions on the same image will share systematic bias even though no individual call sees both questions. What this costs you: correlations between LLM valence and LLM arousal cannot be cleanly attributed to two raters agreeing about the world — they may also reflect one rater computing both from the same internal feature. Compare model–human Spearman ρ separately per dimension to keep the interpretation clean.
For stricter parity with the paper’s between-subject design, run different models for valence and arousal — for example, one provider for valence and a different provider for arousal — and treat them as independent rater populations.

2. N = 5 per cell vs ~100 per cell

PaperOASIS-LLM (default)
Ratings per (image, dimension)~1005 (samples_per_image: 5)
By the Spearman-Brown prophecy formula, going from k = 100 to k = 5 raters drops the cell-mean reliability ceiling roughly from 0.98 to 0.76 for valence and from 0.93 to ~0.40 for arousal. That is the ceiling assuming the model has the same noise structure as humans — a strong assumption. What this costs you: individual cell means are 5-sample point estimates, not stable norms. Treat them accordingly and report the per-cell SD alongside the mean. When you need tighter estimates, increase samples_per_image in your run configuration. Note that samples_per_image is excluded from the canonical run hash, so bumping it creates a new run without invalidating existing cached trials.
Increasing samples_per_image substantially raises API cost and latency. At temperature = 0 with a cache buster, additional samples add noise without sampling new model states — see divergence 6 for context.

3. No list randomization, no order randomization

PaperOASIS-LLM
List structure4 lists × 225 images eachNone — every model sees every image
OrderPer-participant randomFixed scheduling order (attempts, sample_idx, image_id, dimension)
Each model conversation is single-shot per trial: a fresh API call for each (image, dimension, sample_idx) triple. There is no session carrying state across images, so order effects within the model are essentially absent. The trade-off is that you also lose the habituation and calibration effects humans accumulate from earlier trials in their list. What this costs you: you cannot study order effects, list effects, or rating fatigue. You also cannot replicate variance that arises from “rater has already seen N other images” — the kind of anchoring that may suppress or exaggerate extreme ratings in humans.

4. Rating + reasoning vs rating only

PaperOASIS-LLM
OutputSingle integer (1–7)Integer + one-sentence reasoning (capture_reasoning: true)
Asking the model to provide reasoning is a deliberate change to the task. It gives you text you can audit, pushes the model toward verbalizing the target dimension, and plausibly changes the rating itself by encouraging chain-of-thought tokens before the final answer. What this costs you: you cannot fairly compare capture_reasoning: true runs to the human single-integer protocol on equal footing. The reasoning step may inflate inter-item consistency (the model “talks itself into” a rating) or introduce dimension-inappropriate reasoning. For strict parity with the human protocol, set capture_reasoning: false. The harness then enforces a strict-schema {"rating": int} response, with no chain-of-thought path.
See the Reasoning capture page for the full prompt diff between reasoning-on and reasoning-off modes.

5. Single-pass image, no fixation, no timing

The OASIS human protocol presents each image with a fixation cross and enforces per-trial timing, giving participants a defined window to view the stimulus before responding. OASIS-LLM sends the image once as a base64 data URL and lets the model attend to it however it does internally. There is no analogue of “looking at the image for N seconds before responding.” What this costs you: any effects driven by viewing time, eye-movement patterns, or attention allocation are outside the scope of OASIS-LLM. If your research question involves how viewing duration affects affect ratings, this tool cannot address it.

6. Provider determinism vs human variance

PaperOASIS-LLM
Variance sourceReal between-rater human varianceSampling noise (temperature > 0) or cache-buster noise (temperature = 0)
The paper’s variance is genuine between-rater variance — different people with different emotional responses. OASIS-LLM mixes two distinct noise sources:
  • Sampling-noise variance (when temperature > 0): real stochasticity in the token decode. This is the closest analogue to between-rater variance.
  • Cache-buster variance (when temperature = 0, the default for many providers): forced by appending a unique salt to each user prompt to defeat provider-side caching. This is not the same noise process as humans. At greedy decoding the rating space is too coarse — 7 integer steps — to produce human-scale spread, and the distribution of cache-buster-driven ratings is not calibrated to anything.
What this costs you: standard deviations from temperature = 0 + cache_buster: true runs are a floor on model uncertainty, not a calibrated estimate of it. Do not compare these SDs directly to human within-condition SDs.
To get between-sample variance that is more comparable in scale to human between-rater spread, set a non-zero temperature (e.g. temperature: 0.7) and increase samples_per_image. Be aware that different providers have different effective temperature scales — a temperature of 1.0 on one provider is not the same as 1.0 on another.

Summary

DivergenceSeverity for cell-mean estimatesSeverity for variance estimates
Same-model rater poolLowLow (calls are independent)
N = 5 vs N = 100HighHigh
No list/order randomizationLowMedium
Rating + reasoningMediumMedium
No fixation/timingLowLow
Provider determinismLowHigh