Documentation Index
Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
OASIS-LLM is a replication of the Kurdi, Lozano, & Banaji (2017) rating protocol using language models instead of human participants. It is not a perfect replication. This page documents every known way the two diverge, what drives each divergence, and what it costs you when interpreting your results. The goal is a complete, honest list — if you find a divergence not listed here, treat it as a bug and report it.
1. Same-model rater pool vs independent human raters
| Paper | OASIS-LLM |
|---|
| Who rates valence vs arousal | Different participants (between-subject) | Same model weights — separate stateless calls |
| Within-call context | One dimension per session | One dimension per call (no cross-dimension context) |
| Cross-trial state | Per-participant carryover within their list | None — every trial is a fresh API call |
Each trial in OASIS-LLM is a single stateless API call carrying exactly one dimension. The model never sees a valence and an arousal question in the same conversation, so there is no anchoring path between dimensions within a session — because there is no session.
The remaining divergence is at the population level. The paper drew valence raters and arousal raters from disjoint pools of humans. OASIS-LLM draws both dimensions from the same model (same weights, same provider, same configuration). If that model encodes valence and arousal as a shared internal affect representation, ratings of the two dimensions on the same image will share systematic bias even though no individual call sees both questions.
What this costs you: correlations between LLM valence and LLM arousal cannot be cleanly attributed to two raters agreeing about the world — they may also reflect one rater computing both from the same internal feature. Compare model–human Spearman ρ separately per dimension to keep the interpretation clean.
For stricter parity with the paper’s between-subject design, run different models for valence and arousal — for example, one provider for valence and a different provider for arousal — and treat them as independent rater populations.
2. N = 5 per cell vs ~100 per cell
| Paper | OASIS-LLM (default) |
|---|
| Ratings per (image, dimension) | ~100 | 5 (samples_per_image: 5) |
By the Spearman-Brown prophecy formula, going from k = 100 to k = 5 raters drops the cell-mean reliability ceiling roughly from 0.98 to 0.76 for valence and from 0.93 to ~0.40 for arousal. That is the ceiling assuming the model has the same noise structure as humans — a strong assumption.
What this costs you: individual cell means are 5-sample point estimates, not stable norms. Treat them accordingly and report the per-cell SD alongside the mean. When you need tighter estimates, increase samples_per_image in your run configuration. Note that samples_per_image is excluded from the canonical run hash, so bumping it creates a new run without invalidating existing cached trials.
Increasing samples_per_image substantially raises API cost and latency. At temperature = 0 with a cache buster, additional samples add noise without sampling new model states — see divergence 6 for context.
3. No list randomization, no order randomization
| Paper | OASIS-LLM |
|---|
| List structure | 4 lists × 225 images each | None — every model sees every image |
| Order | Per-participant random | Fixed scheduling order (attempts, sample_idx, image_id, dimension) |
Each model conversation is single-shot per trial: a fresh API call for each (image, dimension, sample_idx) triple. There is no session carrying state across images, so order effects within the model are essentially absent. The trade-off is that you also lose the habituation and calibration effects humans accumulate from earlier trials in their list.
What this costs you: you cannot study order effects, list effects, or rating fatigue. You also cannot replicate variance that arises from “rater has already seen N other images” — the kind of anchoring that may suppress or exaggerate extreme ratings in humans.
4. Rating + reasoning vs rating only
| Paper | OASIS-LLM |
|---|
| Output | Single integer (1–7) | Integer + one-sentence reasoning (capture_reasoning: true) |
Asking the model to provide reasoning is a deliberate change to the task. It gives you text you can audit, pushes the model toward verbalizing the target dimension, and plausibly changes the rating itself by encouraging chain-of-thought tokens before the final answer.
What this costs you: you cannot fairly compare capture_reasoning: true runs to the human single-integer protocol on equal footing. The reasoning step may inflate inter-item consistency (the model “talks itself into” a rating) or introduce dimension-inappropriate reasoning.
For strict parity with the human protocol, set capture_reasoning: false. The harness then enforces a strict-schema {"rating": int} response, with no chain-of-thought path.
See the Reasoning capture page for the full prompt diff between reasoning-on and reasoning-off modes.
5. Single-pass image, no fixation, no timing
The OASIS human protocol presents each image with a fixation cross and enforces per-trial timing, giving participants a defined window to view the stimulus before responding. OASIS-LLM sends the image once as a base64 data URL and lets the model attend to it however it does internally. There is no analogue of “looking at the image for N seconds before responding.”
What this costs you: any effects driven by viewing time, eye-movement patterns, or attention allocation are outside the scope of OASIS-LLM. If your research question involves how viewing duration affects affect ratings, this tool cannot address it.
6. Provider determinism vs human variance
| Paper | OASIS-LLM |
|---|
| Variance source | Real between-rater human variance | Sampling noise (temperature > 0) or cache-buster noise (temperature = 0) |
The paper’s variance is genuine between-rater variance — different people with different emotional responses. OASIS-LLM mixes two distinct noise sources:
- Sampling-noise variance (when
temperature > 0): real stochasticity in the token decode. This is the closest analogue to between-rater variance.
- Cache-buster variance (when
temperature = 0, the default for many providers): forced by appending a unique salt to each user prompt to defeat provider-side caching. This is not the same noise process as humans. At greedy decoding the rating space is too coarse — 7 integer steps — to produce human-scale spread, and the distribution of cache-buster-driven ratings is not calibrated to anything.
What this costs you: standard deviations from temperature = 0 + cache_buster: true runs are a floor on model uncertainty, not a calibrated estimate of it. Do not compare these SDs directly to human within-condition SDs.
To get between-sample variance that is more comparable in scale to human between-rater spread, set a non-zero temperature (e.g. temperature: 0.7) and increase samples_per_image. Be aware that different providers have different effective temperature scales — a temperature of 1.0 on one provider is not the same as 1.0 on another.
Summary
| Divergence | Severity for cell-mean estimates | Severity for variance estimates |
|---|
| Same-model rater pool | Low | Low (calls are independent) |
| N = 5 vs N = 100 | High | High |
| No list/order randomization | Low | Medium |
| Rating + reasoning | Medium | Medium |
| No fixation/timing | Low | Low |
| Provider determinism | Low | High |