Known Divergences from the OASIS Human Protocol

OASIS-LLM is a replication of the Kurdi, Lozano, & Banaji (2017) rating protocol using language models instead of human participants. It is not a perfect replication. This page documents every known way the two diverge, what drives each divergence, and what it costs you when interpreting your results. The goal is a complete, honest list — if you find a divergence not listed here, treat it as a bug and report it.

1. Same-model rater pool vs independent human raters

	Paper	OASIS-LLM
Who rates valence vs arousal	Different participants (between-subject)	Same model weights — separate stateless calls
Within-call context	One dimension per session	One dimension per call (no cross-dimension context)
Cross-trial state	Per-participant carryover within their list	None — every trial is a fresh API call

Each trial in OASIS-LLM is a single stateless API call carrying exactly one dimension. The model never sees a valence and an arousal question in the same conversation, so there is no anchoring path between dimensions within a session — because there is no session. The remaining divergence is at the population level. The paper drew valence raters and arousal raters from disjoint pools of humans. OASIS-LLM draws both dimensions from the same model (same weights, same provider, same configuration). If that model encodes valence and arousal as a shared internal affect representation, ratings of the two dimensions on the same image will share systematic bias even though no individual call sees both questions. What this costs you: correlations between LLM valence and LLM arousal cannot be cleanly attributed to two raters agreeing about the world — they may also reflect one rater computing both from the same internal feature. Compare model–human Spearman ρ separately per dimension to keep the interpretation clean.

For stricter parity with the paper’s between-subject design, run different models for valence and arousal — for example, one provider for valence and a different provider for arousal — and treat them as independent rater populations.

2. N = 5 per cell vs ~100 per cell

	Paper	OASIS-LLM (default)
Ratings per (image, dimension)	~100	5 (`samples_per_image: 5`)

By the Spearman-Brown prophecy formula, going from k = 100 to k = 5 raters drops the cell-mean reliability ceiling roughly from 0.98 to 0.76 for valence and from 0.93 to ~0.40 for arousal. That is the ceiling assuming the model has the same noise structure as humans — a strong assumption. What this costs you: individual cell means are 5-sample point estimates, not stable norms. Treat them accordingly and report the per-cell SD alongside the mean. When you need tighter estimates, increase samples_per_image in your run configuration. Note that samples_per_image is excluded from the canonical run hash, so bumping it creates a new run without invalidating existing cached trials.

Increasing samples_per_image substantially raises API cost and latency. At temperature = 0 with a cache buster, additional samples add noise without sampling new model states — see divergence 6 for context.

3. No list randomization, no order randomization

	Paper	OASIS-LLM
List structure	4 lists × 225 images each	None — every model sees every image
Order	Per-participant random	Fixed scheduling order (`attempts, sample_idx, image_id, dimension`)

Each model conversation is single-shot per trial: a fresh API call for each (image, dimension, sample_idx) triple. There is no session carrying state across images, so order effects within the model are essentially absent. The trade-off is that you also lose the habituation and calibration effects humans accumulate from earlier trials in their list. What this costs you: you cannot study order effects, list effects, or rating fatigue. You also cannot replicate variance that arises from “rater has already seen N other images” — the kind of anchoring that may suppress or exaggerate extreme ratings in humans.

4. Rating + reasoning vs rating only

	Paper	OASIS-LLM
Output	Single integer (1–7)	Integer + one-sentence reasoning (`capture_reasoning: true`)

Asking the model to provide reasoning is a deliberate change to the task. It gives you text you can audit, pushes the model toward verbalizing the target dimension, and plausibly changes the rating itself by encouraging chain-of-thought tokens before the final answer. What this costs you: you cannot fairly compare capture_reasoning: true runs to the human single-integer protocol on equal footing. The reasoning step may inflate inter-item consistency (the model “talks itself into” a rating) or introduce dimension-inappropriate reasoning. For strict parity with the human protocol, set capture_reasoning: false. The harness then enforces a strict-schema {"rating": int} response, with no chain-of-thought path.

See the Reasoning capture page for the full prompt diff between reasoning-on and reasoning-off modes.

5. Single-pass image, no fixation, no timing

The OASIS human protocol presents each image with a fixation cross and enforces per-trial timing, giving participants a defined window to view the stimulus before responding. OASIS-LLM sends the image once as a base64 data URL and lets the model attend to it however it does internally. There is no analogue of “looking at the image for N seconds before responding.” What this costs you: any effects driven by viewing time, eye-movement patterns, or attention allocation are outside the scope of OASIS-LLM. If your research question involves how viewing duration affects affect ratings, this tool cannot address it.

6. Provider determinism vs human variance

	Paper	OASIS-LLM
Variance source	Real between-rater human variance	Sampling noise (temperature > 0) or cache-buster noise (temperature = 0)

The paper’s variance is genuine between-rater variance — different people with different emotional responses. OASIS-LLM mixes two distinct noise sources:

Sampling-noise variance (when temperature > 0): real stochasticity in the token decode. This is the closest analogue to between-rater variance.
Cache-buster variance (when temperature = 0, the default for many providers): forced by appending a unique salt to each user prompt to defeat provider-side caching. This is not the same noise process as humans. At greedy decoding the rating space is too coarse — 7 integer steps — to produce human-scale spread, and the distribution of cache-buster-driven ratings is not calibrated to anything.

What this costs you: standard deviations from temperature = 0 + cache_buster: true runs are a floor on model uncertainty, not a calibrated estimate of it. Do not compare these SDs directly to human within-condition SDs.

To get between-sample variance that is more comparable in scale to human between-rater spread, set a non-zero temperature (e.g. temperature: 0.7) and increase samples_per_image. Be aware that different providers have different effective temperature scales — a temperature of 1.0 on one provider is not the same as 1.0 on another.

Summary

Divergence	Severity for cell-mean estimates	Severity for variance estimates
Same-model rater pool	Low	Low (calls are independent)
N = 5 vs N = 100	High	High
No list/order randomization	Low	Medium
Rating + reasoning	Medium	Medium
No fixation/timing	Low	Low
Provider determinism	Low	High

Documentation Index

​1. Same-model rater pool vs independent human raters

​2. N = 5 per cell vs ~100 per cell

​3. No list randomization, no order randomization

​4. Rating + reasoning vs rating only

​5. Single-pass image, no fixation, no timing

​6. Provider determinism vs human variance

​Summary

1. Same-model rater pool vs independent human raters

2. N = 5 per cell vs ~100 per cell

3. No list randomization, no order randomization

4. Rating + reasoning vs rating only

5. Single-pass image, no fixation, no timing

6. Provider determinism vs human variance

Summary