Experiment Design: Valence and Arousal Ratings

OASIS-LLM replays the affective rating procedure from Kurdi, Lozano, & Banaji (2017) against a vision-language model, holding the original instructions constant and substituting the model for a human rater. This page explains what that procedure is, how the prompts are constructed, how each trial is identified, and where the LLM replication deliberately departs from the human study.

The original OASIS procedure

Kurdi et al. (2017) normed 900 open-access color images on two affective dimensions:

Valence — the level of positivity or negativity intrinsic to the image.
Arousal — the level of intensity or calmness intrinsic to the image, treated as orthogonal to valence.

Both dimensions were rated on a 7-point Likert scale with verbal anchors at every step. The paper’s sample included 822 usable MTurk participants, restricted to ≥90% approval and ≥50 prior HITs. Images were split into four lists of 225 images each, and each participant saw one list in individually randomized order. Assignment was between-subject: each participant rated only one dimension — valence or arousal, never both — yielding approximately 100 ratings per (image, dimension) cell after balancing across lists.

System prompts

OASIS-LLM uses the paper-verbatim instruction wording from the main study (the image-centered variant). The arousal prompt explicitly instructs the model to rate arousal independently of valence, matching the paper’s key methodological requirement.

In this study you will be presented with a series of images. We are interested in
the affective response that these images evoke. The dimension that we are asking
you to rate is VALENCE.

Valence refers to the level of positivity or negativity intrinsic to an image.
At one extreme of the valence scale, an image is very negative; at the other
extreme, an image is very positive.

You will rate each image on a 7-point scale with the following labels:
1 = Very negative
2 = Moderately negative
3 = Somewhat negative
4 = Neutral
5 = Somewhat positive
6 = Moderately positive
7 = Very positive

Please respond with a single integer from 1 to 7.

Trial structure

Each trial is uniquely identified by a four-part primary key:

(run_id, image_id, dimension, sample_idx)

run_id — the name field from your YAML config.
image_id — the OASIS filename stem (e.g. Alarm clock 1, Beach 1).
dimension — "valence" or "arousal".
sample_idx — 0..samples_per_image-1. With cache_buster: true (the default), each index produces a unique per-trial salt appended to the user prompt to force decoding-path divergence even at temperature=0.

How the LLM replication differs from the human protocol

Two design choices in the LLM replication differ deliberately from the human protocol. They make the replication cheaper and simpler, but they limit what you can claim by direct comparison to the original norms.

Within-subject instead of between-subject. A single model run rates both valence and arousal for every image. The paper kept these strictly between-subject so that one rating could not anchor the other. OASIS-LLM collapses them to one run because the per-trial cost is dominated by the image upload, and because you want both columns out of every model with one config. Smaller N per cell. The default is samples_per_image: 5, giving 5 ratings per (image, dimension) cell instead of the paper’s ~100. This is a deliberate cost/reliability trade-off.

Reliability at reduced sample sizes

Using the Spearman-Brown prophecy formula, if the paper-level inter-rater reliability is ρ ≈ 0.984 for valence at k = 102 raters, the expected reliability at k = 5 raters is:

\rho_{k'} = \frac{(k'/k) \cdot \rho}{1 + ((k'/k) - 1) \cdot \rho} \approx 0.762

That is the ceiling of what 5 model samples can achieve as a cell-mean estimate, assuming the model ratings have the same true between-image variance and noise structure as humans. Treat this number as a budgeting tool, not a guarantee. The table below shows how reliability scales with sample count under the Spearman-Brown formula:

`samples_per_image`	Estimated ρ (valence)
1	~0.33
5	~0.76
20	~0.93
100	~0.98

Tuning samples_per_image

Because samples_per_image is excluded from the run’s canonical_hash, you can resume a run later with a higher value and the harness will only enqueue the new sample indices — no data is lost and no new run name is required.

Start with the default of 5 to validate your setup and model, then bump samples_per_image to 20 or higher on the same run name if you need tighter cell-mean estimates for publication-quality comparisons.

Documentation Index

​The original OASIS procedure

​System prompts

​Trial structure

​How the LLM replication differs from the human protocol

​Reliability at reduced sample sizes

​Tuning samples_per_image

The original OASIS procedure

System prompts

Trial structure

How the LLM replication differs from the human protocol

Reliability at reduced sample sizes

Tuning samples_per_image