Many provider defaults — and many self-hosted vLLM/SGLang setups — useDocumentation Index
Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
temperature=0 (greedy decoding). Under greedy decoding, two API calls with identical inputs produce identical outputs. This means samples_per_image: 5 collapses to one rating repeated five times: zero variance information, five times the cost. The cache buster is the fix.
The problem
Settingtemperature > 0 is the obvious solution, but it has real costs:
- It changes a genuine degree of freedom of the experiment, making results incomparable to deterministic baselines.
- Many open-weights models behave erratically at non-zero temperature on highly structured tasks like integer rating.
- You lose reproducibility: the same config on a resume would produce different outputs.
The fix: per-sample salt
Whencache_buster: true (the default), the runner appends a 10-character hex salt to the end of the user turn:
(run_name, image_id, dimension, sample_idx):
- Resuming a run produces the same salts, same prompts, and the same
prompt_hash. - Each
sample_idxproduces a different decoding path, even attemperature=0. - Across runs of the same experiment, sample 0 always gets the same salt — you can compare runs directly.
Why it works
The salt forces a different decoding path. Greedy decoding is deterministic in the input, not deterministic across inputs. A 10-character hex string is enough to perturb the attention pattern over the prefix and generally produces a different rating than the unsalted prompt would, even though the meaningful instruction content is identical. It is KV-cache friendly. The salt is appended after the image and after the original instruction text. Providers that do prefix caching (Anthropic, OpenAI structured outputs, vLLM, SGLang) cache the long prefix — image data URL plus paper-verbatim instructions — and only recompute the small tail that includes the salt:- Across the 5
sample_idxvalues for one(image, dimension), the expensive prefix is cached once and reused. - The salt is the only thing that changes per sample, so you pay for one full call and four cheap tail-only calls.
Limitations
An empirical comparison ofcache_buster=false (v1) and cache_buster=true (v2) on the same Gemma-4-31B pilot showed the mean per-cell standard deviation moved from roughly 0.10 to 0.07 rating points. These numbers are near zero in both cases, and even those non-zero values come largely from occasional parsing edge cases rather than from the salt itself.
The deeper issue is that the output space is too coarse. The rating space is 7 integers. Two decoding paths that disagree about which integer is most probable will produce different ratings; two paths that agree on the argmax will produce identical ratings even if their full probability distributions differ. At temperature=0 you only see the argmax. So:
- If the model is confident, the salt does nothing — argmax does not move.
- If the model is uncertain, the salt occasionally tips the argmax to a neighboring integer — small standard deviation.
- You will never see the 1.0–1.5 rating-point standard deviations that 100 humans produce on an ambiguous image.
When to use it
- Default on. It is free (one extra
sha256per trial, the tail goes through the cache untouched), and it gives you some variance signal when you would otherwise have none. - Combine with non-zero temperature if you actually need calibrated model uncertainty.
temperature: 0.7combined withcache_buster: trueis reasonable for variance-focused runs. - Keep it consistent across resumes. Toggling
cache_bustermid-run invalidates the canonicalprompt_hash, so flipping the setting requires a newnameto avoid accidentally mixing salted and unsalted trials in analysis.
Config option
How the salt is computed
The salt generation uses SHA-256 over the trial’s unique key, truncated to 10 hex characters:prompt_hash column stored on each trial record, so you can audit exactly which prompt went out the door for each trial. If you ever need to compare salted vs. unsalted runs, name them differently and use the canonical hash to protect against accidental mixing in your analysis queries.
Related pages
- Configuration —
cache_busterfield reference - Discrepancies — how cache-buster variance differs from human between-rater variance