Glossary of OASIS-LLM Terms and Technical Concepts

Valence

The level of positivity vs. negativity intrinsic to a stimulus. In OASIS, valence is rated on a 7-point scale from “Very negative” (1) to “Very positive” (7). Valence is one of the two canonical affective dimensions in the dimensional model of emotion (Russell, 1980) — it answers the question “how pleasant or unpleasant is this?”

Arousal

The level of intensity, activation, or excitement intrinsic to a stimulus, treated as orthogonal to valence. In OASIS, arousal is rated 1 (“Very low” — calm, sleepy) to 7 (“Very high” — stimulating, frenzied). A stimulus can be high-valence and low-arousal (a relaxing beach) or low-valence and high-arousal (a car crash).

OASIS

Open Affective Standardized Image Set. A collection of 900 open-access color images normed by Kurdi, Lozano, & Banaji (2017) on 7-point valence and arousal scales using approximately 822 Amazon Mechanical Turk participants. OASIS was designed as a Creative Commons-licensed alternative to the International Affective Picture System (IAPS), which requires a restricted license for research use.

ICC(2,k)

Two-way random-effects, average-measures intraclass correlation coefficient. ICC(2,k) quantifies the reliability of the cell-mean rating when that mean is computed across k raters per image.The OASIS paper reports ICC(2,k) ≈ 0.984 for valence and ≈ 0.929 for arousal at the paper’s k ≈ 100 raters per image. These values are substantially higher than the single-rater ICC(2,1) because averaging across raters cancels per-rater noise — the more raters you average, the more the idiosyncratic variance washes out, and the more stable the cell mean becomes.

Spearman-Brown prophecy

A formula for predicting reliability when you change the number of raters from k to a new number k':

\rho_{k'} = \frac{(k'/k) \cdot \rho_k}{1 + ((k'/k) - 1) \cdot \rho_k}

In this project, the formula is used to extrapolate what ICC you should expect at samples_per_image=5 if model ratings have a human-like noise structure — that is, if the variance across model samples resembles the variance across human raters. The extrapolation is speculative (model variance may differ structurally from human variance) but provides a useful reference point.

Cronbach's alpha

A measure of internal-consistency reliability for a multi-item scale; it can also be understood as a special case of ICC for binary-coded raters. The OASIS paper reports both Cronbach’s alpha and ICC(2,k) for its rating scales. At large k the two statistics converge, and the paper’s reported values are very close. Alpha is sensitive to the number of items (raters) in the scale — the Spearman-Brown prophecy is the generalization that predicts how alpha changes as k changes.

KV cache (prefix caching)

A transformer inference optimization where the key/value tensors for a fixed prompt prefix are computed once and reused across all subsequent calls that share that prefix. Anthropic, OpenAI, vLLM, and SGLang all implement some form of prefix caching.OASIS-LLM’s cache buster places its per-sample salt at the end of the user turn so that the long image-plus-instruction prefix remains cacheable. If the salt were inserted earlier in the prompt (for example, before the image), it would invalidate the prefix cache for every trial, eliminating the cost and latency savings that caching provides.

JSON schema strict mode

A provider-side feature that constrains the model at decode time to emit only tokens that conform to a given JSON schema. It is available on OpenAI (response_format: json_schema, strict: true), Anthropic (via tool definitions), and a growing set of open-weights inference stacks.OASIS-LLM uses strict mode when capture_reasoning=false to enforce a simple {"rating": <int 1–7>} shape. When capture_reasoning=true, strict mode is skipped entirely — some smaller models (e.g. Gemma 4) enter degenerate output loops when a multi-field schema is required under strict enforcement. In that case, the runner falls back to prompt instructions and the _parse_rating regex fallback.

Between-subject vs. within-subject

Between-subject: each participant contributes data to only one condition. This eliminates within-rater contamination (anchoring, contrast effects, order effects) at the cost of needing more total raters to achieve the same statistical power. The original OASIS paper used a between-subject design for valence vs. arousal — no individual rater rated both dimensions for the same image.Within-subject: each participant contributes data to all conditions. More statistical power per rater, but introduces order effects, anchoring, and potential contamination between conditions. OASIS-LLM defaults to within-subject (a single model call rates both valence and arousal for each image in separate trials) because LLMs do not carry episodic memory across API calls. See Discrepancies for a discussion of how this design choice may affect comparisons with the human norms.

Cache buster (this project)

A per-trial salt appended to the end of the user prompt, deterministic in the tuple (run_name, image_id, dimension, sample_idx). The salt is computed as sha256(f"{run_name}|{image_id}|{dimension}|{sample_idx}")[:10] and appended in the format [trial-id: <hex>].Its purpose is to force a different decoding path for each sample even when temperature=0, so that repeated samples of the same image and dimension are not identical copies of each other. Because the salt is placed at the end of the user turn, the long image-plus-instruction prefix that precedes it remains intact for KV-cache reuse. See Cache buster for configuration details.

Canonical hash (this project)

A 16-character hex digest stored alongside every run record in DuckDB: sha256(model_dump(exclude={name, max_concurrency, request_timeout_s, max_retries, samples_per_image}))[:16].The hash is checked every time you invoke oasis-llm run for an existing run name. If the hash of the current config does not match the stored hash, the runner raises a RuntimeError rather than silently mixing results from two different experiment configurations. The excluded fields (name, max_concurrency, request_timeout_s, max_retries, samples_per_image) are operational parameters that do not affect what the model sees, so changing them between invocations is safe. Changing the model, prompts, dimensions, or image set requires a new run name. See Configuration for details.

Documentation Index