Cache Buster: Forcing Decoding Variance at Temperature 0

Many provider defaults — and many self-hosted vLLM/SGLang setups — use temperature=0 (greedy decoding). Under greedy decoding, two API calls with identical inputs produce identical outputs. This means samples_per_image: 5 collapses to one rating repeated five times: zero variance information, five times the cost. The cache buster is the fix.

The problem

Setting temperature > 0 is the obvious solution, but it has real costs:

It changes a genuine degree of freedom of the experiment, making results incomparable to deterministic baselines.
Many open-weights models behave erratically at non-zero temperature on highly structured tasks like integer rating.
You lose reproducibility: the same config on a resume would produce different outputs.

What is needed is a way to get different decoding paths without changing the temperature.

The fix: per-sample salt

When cache_buster: true (the default), the runner appends a 10-character hex salt to the end of the user turn:

if cfg.cache_buster and image_id is not None and sample_idx is not None:
    salt = hashlib.sha256(
        f"{cfg.name}|{image_id}|{dimension}|{sample_idx}".encode()
    ).hexdigest()[:10]
    usr_p = f"{usr_p}\n\n[trial-id: {salt}]"

The salt is deterministic in (run_name, image_id, dimension, sample_idx):

Resuming a run produces the same salts, same prompts, and the same prompt_hash.
Each sample_idx produces a different decoding path, even at temperature=0.
Across runs of the same experiment, sample 0 always gets the same salt — you can compare runs directly.

Why it works

The salt forces a different decoding path. Greedy decoding is deterministic in the input, not deterministic across inputs. A 10-character hex string is enough to perturb the attention pattern over the prefix and generally produces a different rating than the unsalted prompt would, even though the meaningful instruction content is identical. It is KV-cache friendly. The salt is appended after the image and after the original instruction text. Providers that do prefix caching (Anthropic, OpenAI structured outputs, vLLM, SGLang) cache the long prefix — image data URL plus paper-verbatim instructions — and only recompute the small tail that includes the salt:

Across the 5 sample_idx values for one (image, dimension), the expensive prefix is cached once and reused.
The salt is the only thing that changes per sample, so you pay for one full call and four cheap tail-only calls.

If the salt were placed at the start of the user turn or in the system message, prefix caching would invalidate on every call and the cost advantage would disappear.

Limitations

The cache buster does not give you human-scale rating spread. It cannot.

An empirical comparison of cache_buster=false (v1) and cache_buster=true (v2) on the same Gemma-4-31B pilot showed the mean per-cell standard deviation moved from roughly 0.10 to 0.07 rating points. These numbers are near zero in both cases, and even those non-zero values come largely from occasional parsing edge cases rather than from the salt itself. The deeper issue is that the output space is too coarse. The rating space is 7 integers. Two decoding paths that disagree about which integer is most probable will produce different ratings; two paths that agree on the argmax will produce identical ratings even if their full probability distributions differ. At temperature=0 you only see the argmax. So:

If the model is confident, the salt does nothing — argmax does not move.
If the model is uncertain, the salt occasionally tips the argmax to a neighboring integer — small standard deviation.
You will never see the 1.0–1.5 rating-point standard deviations that 100 humans produce on an ambiguous image.

The cache buster provides floor variance: some information when you would otherwise have zero.

When to use it

Default on. It is free (one extra sha256 per trial, the tail goes through the cache untouched), and it gives you some variance signal when you would otherwise have none.
Combine with non-zero temperature if you actually need calibrated model uncertainty. temperature: 0.7 combined with cache_buster: true is reasonable for variance-focused runs.
Keep it consistent across resumes. Toggling cache_buster mid-run invalidates the canonical prompt_hash, so flipping the setting requires a new name to avoid accidentally mixing salted and unsalted trials in analysis.

Config option

cache_buster: true    # default — appends per-sample salt to user turn
cache_buster: false   # all samples for a given (image, dimension) get identical prompts

How the salt is computed

The salt generation uses SHA-256 over the trial’s unique key, truncated to 10 hex characters:

salt = hashlib.sha256(
    f"{cfg.name}|{image_id}|{dimension}|{sample_idx}".encode()
).hexdigest()[:10]
usr_p = f"{usr_p}\n\n[trial-id: {salt}]"

The salt is included in the prompt_hash column stored on each trial record, so you can audit exactly which prompt went out the door for each trial. If you ever need to compare salted vs. unsalted runs, name them differently and use the canonical hash to protect against accidental mixing in your analysis queries.

Configuration — cache_buster field reference
Discrepancies — how cache-buster variance differs from human between-rater variance

Documentation Index

​The problem

​The fix: per-sample salt

​Why it works

​Limitations

​When to use it

​Config option

​How the salt is computed

​Related pages

The problem

The fix: per-sample salt

Why it works

Limitations

When to use it

Config option

How the salt is computed

Related pages