The Experiments page shows a cost estimate before you press start. Getting that number right on a fresh machine — with no prior run history — turned out to require abandoning the first implementation entirely. This page documents what Phase 4a tried to do, why it was wrong, what Phase 4b replaced it with, and the calibration data that makes day-one estimates accurate to within ~10%.Documentation Index
Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Phase 4a — empirical-first (abandoned)
The first implementation computedmean cost / trial from each run’s own completed trials, falling back across runs of the same model when the current run had fewer than a minimum history threshold. When even the cross-run history was empty, a synthetic fallback used 1500 input tokens / 150 output tokens multiplied by litellm.model_cost.
Two problems surfaced within the first week:
Cold-install penalty
A new machine with no prior trials always hit the synthetic fallback. The 1500/150 numbers were copy-pasted from a generic chat-completion estimate and had no relation to the OASIS task, which sends a single image plus a short paper-verbatim prompt and expects a one-sentence reasoning string. The fallback systematically over-quoted by 3–5× compared to actual billed cost.
Empirical history was almost free to skip
Once actual token usage was measured (see the calibration section below), it became clear that the OASIS task is so structurally tight that per-model variance does not justify the bookkeeping overhead. A single global token-count pair handles every model within ~10%. The empirical pipeline existed to recover something that did not need recovering.
Phase 4b — calibrated tokens × live pricing (current)
Phase 4a was replaced with two changes, implemented insrc/oasis_llm/estimates.py:
- Calibrate the token counts once, hard-code them. Run a SQL query over historical trials and cement the result as module constants.
- Source pricing live from OpenRouter for OpenRouter-routed models; fall back to LiteLLM’s static price table for everything else; return
"free"for Ollama.
Calibration data: n=10,598 trials
The calibration query ran against 10,598 completed trials across nine vision models on the OASIS valence/arousal rating task withcapture_reasoning=true and cache_buster=true.
| Quantity | Mean | Median | σ | Range |
|---|---|---|---|---|
| Input tokens / trial | 544 | 529 | 51 | 379 – 652 |
| Output tokens / trial | 31 | 32 | 8 | 2 – 256 |
estimates.py:48 are:
If the prompt template changes — for example to support multi-image trials — recalibrate by re-running the SQL in
scripts/. The constants in estimates.py are deliberately easy to edit. Update them when the new mean differs from the current constant by more than ~5%.Reproducing the calibration
data/llm_runs.duckdb (or a snapshot if the dashboard holds the lock).
Live OpenRouter pricing
OpenRouter publishes per-model pricing at/api/v1/models. The harness fetches it with urllib, caches the response for one hour, and indexes into it by both the bare model id and the openrouter/ prefixed form — whichever way the model name appears in your config will resolve.
Three behaviours fell out of validating the live pricing implementation:
Retain previous cache on transient network failure
The first implementation cleared the cache to
None whenever the fetch raised. A single-second network blip wiped the cost column for every Experiments preview until the next user-triggered refresh. The fix: return the existing cache on failure and only update the timestamp on success. The user-facing price stays accurate through brief outages; a re-fetch happens on the next call after the TTL expires.Treat zero-priced models as unknown
A handful of OpenRouter entries have
pricing.prompt = "0" (free routes, deprecated entries). A zero computed price is treated as "unknown" rather than "free", because the harness already has a distinct "free" source for Ollama and a deprecated OpenRouter entry silently quoting X`. The user sees an indicator when any model in the config returns unknown.Skip OpenRouter API for non-OpenRouter providers
Anthropic, OpenAI, and Google models that go via LiteLLM directly should not consult OpenRouter pricing. Their bare model ids coincidentally collide with OpenRouter route names and would resolve to the wrong price. The
provider field in RunConfig gates this: estimate_cost_per_trial only calls _openrouter_cost_per_trial when provider == "openrouter".Estimate quality vs reality
The Phase 4b constants validate against actual billed cost for the reference models the harness is exercised against most:| Model | Phase 4b estimate | Empirical mean (n>500) | Delta |
|---|---|---|---|
openai/gpt-4o | $0.0018 / trial | $0.00184 / trial | +2% |
anthropic/claude-sonnet-4.6 | $0.0022 / trial | $0.00242 / trial | −9% |
google/gemma-4-31b-it (OpenRouter) | $0.000087 / trial | $0.0000867 / trial | <1% |
Live in-progress projection uses empirical mean
The pre-launch estimate is the only place that uses calibrated constants. Once a run starts, the Runs page detail view computesprojected_total_usd as cost_so_far + (remaining × empirical mean cost per done trial). Empirical data is free once trials start landing and is strictly more accurate than the calibrated estimate, so it takes precedence automatically.
Latency forecasting was not solved
A pre-launch time estimate was deliberately not shipped. Cold-start latency for Ollama varies by 10× depending on whether the model is already warm, and for OpenRouter it varies with provider routing decisions outside the harness’s control. The Runs detail view shows a live ETA from a 20-trial rolling window once the run starts; the Experiments page shows the placeholder text “shown live on Runs once started” instead.What was deliberately not done
- Per-model token calibration. Per-model means cluster tightly enough that a single global pair is within ~10%. Maintaining nine separate constants is not worth the bookkeeping.
- Latency forecasting at experiment-design time. Cold-start variance is too high for the number to be useful. Live ETA on the Runs page covers the in-flight case.
- A “refresh prices” button in the UI. The 1-hour TTL refreshes automatically. Adding a button would invite spurious clicks without providing a clear signal distinguishing “cached” from “stale.”