Cost Estimation: From Empirical Model to Token Pricing

The Experiments page shows a cost estimate before you press start. Getting that number right on a fresh machine — with no prior run history — turned out to require abandoning the first implementation entirely. This page documents what Phase 4a tried to do, why it was wrong, what Phase 4b replaced it with, and the calibration data that makes day-one estimates accurate to within ~10%.

Phase 4a — empirical-first (abandoned)

The first implementation computed mean cost / trial from each run’s own completed trials, falling back across runs of the same model when the current run had fewer than a minimum history threshold. When even the cross-run history was empty, a synthetic fallback used 1500 input tokens / 150 output tokens multiplied by litellm.model_cost. Two problems surfaced within the first week:

Cold-install penalty

A new machine with no prior trials always hit the synthetic fallback. The 1500/150 numbers were copy-pasted from a generic chat-completion estimate and had no relation to the OASIS task, which sends a single image plus a short paper-verbatim prompt and expects a one-sentence reasoning string. The fallback systematically over-quoted by 3–5× compared to actual billed cost.

Empirical history was almost free to skip

Once actual token usage was measured (see the calibration section below), it became clear that the OASIS task is so structurally tight that per-model variance does not justify the bookkeeping overhead. A single global token-count pair handles every model within ~10%. The empirical pipeline existed to recover something that did not need recovering.

Phase 4b — calibrated tokens × live pricing (current)

Phase 4a was replaced with two changes, implemented in src/oasis_llm/estimates.py:

Calibrate the token counts once, hard-code them. Run a SQL query over historical trials and cement the result as module constants.
Source pricing live from OpenRouter for OpenRouter-routed models; fall back to LiteLLM’s static price table for everything else; return "free" for Ollama.

The implementation is fewer lines than the Phase 4a version it replaces.

Calibration data: n=10,598 trials

The calibration query ran against 10,598 completed trials across nine vision models on the OASIS valence/arousal rating task with capture_reasoning=true and cache_buster=true.

Quantity	Mean	Median	σ	Range
Input tokens / trial	544	529	51	379 – 652
Output tokens / trial	31	32	8	2 – 256

Per-model means cluster within the global dispersion: input mean ranges 449–606 across the nine models, output mean 30–40. The 256-token output maximum is a parsing-failure outlier where the model emitted a long free-text explanation before recovering — it does not move the mean materially. The shipped constants in estimates.py:48 are:

OASIS_INPUT_TOKENS = 560
OASIS_OUTPUT_TOKENS = 35

Input is padded slightly above the mean to avoid systematic under-quoting. Output stays at the rounded mean.

If the prompt template changes — for example to support multi-image trials — recalibrate by re-running the SQL in scripts/. The constants in estimates.py are deliberately easy to edit. Update them when the new mean differs from the current constant by more than ~5%.

Reproducing the calibration

SELECT
  count(*)                          AS n,
  round(avg(input_tokens),  1)      AS input_mean,
  round(stddev(input_tokens), 1)    AS input_sd,
  min(input_tokens)                 AS input_min,
  max(input_tokens)                 AS input_max,
  round(avg(output_tokens), 1)      AS output_mean,
  round(stddev(output_tokens), 1)   AS output_sd,
  min(output_tokens)                AS output_min,
  max(output_tokens)                AS output_max
FROM trials
WHERE status = 'done'
  AND input_tokens IS NOT NULL
  AND output_tokens IS NOT NULL;

Run this against data/llm_runs.duckdb (or a snapshot if the dashboard holds the lock).

Live OpenRouter pricing

OpenRouter publishes per-model pricing at /api/v1/models. The harness fetches it with urllib, caches the response for one hour, and indexes into it by both the bare model id and the openrouter/ prefixed form — whichever way the model name appears in your config will resolve. Three behaviours fell out of validating the live pricing implementation:

Retain previous cache on transient network failure

The first implementation cleared the cache to None whenever the fetch raised. A single-second network blip wiped the cost column for every Experiments preview until the next user-triggered refresh. The fix: return the existing cache on failure and only update the timestamp on success. The user-facing price stays accurate through brief outages; a re-fetch happens on the next call after the TTL expires.

Treat zero-priced models as unknown

A handful of OpenRouter entries have pricing.prompt = "0" (free routes, deprecated entries). A zero computed price is treated as "unknown" rather than "free", because the harness already has a distinct "free" source for Ollama and a deprecated OpenRouter entry silently quoting

0 for a paid run is worse than showing `≥

X`. The user sees an indicator when any model in the config returns unknown.

Skip OpenRouter API for non-OpenRouter providers

Anthropic, OpenAI, and Google models that go via LiteLLM directly should not consult OpenRouter pricing. Their bare model ids coincidentally collide with OpenRouter route names and would resolve to the wrong price. The provider field in RunConfig gates this: estimate_cost_per_trial only calls _openrouter_cost_per_trial when provider == "openrouter".

Estimate quality vs reality

The Phase 4b constants validate against actual billed cost for the reference models the harness is exercised against most:

Model	Phase 4b estimate	Empirical mean (n>500)	Delta
`openai/gpt-4o`	$0.0018 / trial	$0.00184 / trial	+2%
`anthropic/claude-sonnet-4.6`	$0.0022 / trial	$0.00242 / trial	−9%
`google/gemma-4-31b-it` (OpenRouter)	$0.000087 / trial	$0.0000867 / trial	<1%

Three models is not a thorough validation, but the ~10% target is met for all of them and the systematic 3–5× over-quoting from Phase 4a’s 1500/150 fallback is gone.

Live in-progress projection uses empirical mean

The pre-launch estimate is the only place that uses calibrated constants. Once a run starts, the Runs page detail view computes projected_total_usd as cost_so_far + (remaining × empirical mean cost per done trial). Empirical data is free once trials start landing and is strictly more accurate than the calibrated estimate, so it takes precedence automatically.

Latency forecasting was not solved

A pre-launch time estimate was deliberately not shipped. Cold-start latency for Ollama varies by 10× depending on whether the model is already warm, and for OpenRouter it varies with provider routing decisions outside the harness’s control. The Runs detail view shows a live ETA from a 20-trial rolling window once the run starts; the Experiments page shows the placeholder text “shown live on Runs once started” instead.

What was deliberately not done

Per-model token calibration. Per-model means cluster tightly enough that a single global pair is within ~10%. Maintaining nine separate constants is not worth the bookkeeping.
Latency forecasting at experiment-design time. Cold-start variance is too high for the number to be useful. Live ETA on the Runs page covers the in-flight case.
A “refresh prices” button in the UI. The 1-hour TTL refreshes automatically. Adding a button would invite spurious clicks without providing a clear signal distinguishing “cached” from “stale.”

Documentation Index

​Phase 4a — empirical-first (abandoned)

​Phase 4b — calibrated tokens × live pricing (current)

​Calibration data: n=10,598 trials

​Reproducing the calibration

​Live OpenRouter pricing

​Estimate quality vs reality

​Live in-progress projection uses empirical mean

​Latency forecasting was not solved

​What was deliberately not done

​References

Phase 4a — empirical-first (abandoned)

Phase 4b — calibrated tokens × live pricing (current)

Calibration data: n=10,598 trials

Reproducing the calibration

Live OpenRouter pricing

Estimate quality vs reality

Live in-progress projection uses empirical mean

Latency forecasting was not solved

What was deliberately not done

References