How OASIS-LLM Documents Research Discoveries

The pages in this group are not reference documentation. They are post-hoc research notes for the non-trivial things this harness has run into: reasoning-capture failures on Gemma 4, the cost-estimation rewrite, the Ollama three-bug stack, and the cache-buster design. Each one started with a contradiction between an expected behaviour and an observed one, and ended with either a code change, an operational workaround, or both. This preface describes the protocol used. It is short on purpose — the value is in the sagas themselves, not in the meta-process.

The 6-step protocol

Reproduce, then narrow

A bug we cannot reproduce is a bug we cannot fix. Before any code change, the investigation requires a deterministic repro: the exact config, the exact model, the exact log line. This is why every saga page opens with The symptom — quoted log output or a numerical anomaly with timestamps — rather than a hypothesis.

Read the primary source

When the failure involves an upstream component (LiteLLM, Ollama, OpenRouter, DuckDB), read its issue tracker before forming a theory. Each saga cites the upstream issues it relied on by number. Two-thirds of the time, the problem being debugged is already filed and partially diagnosed by somebody with a better repro.

Contradict the obvious explanation

The first plausible-sounding theory is usually wrong. The Ollama investigation opens with a “cold-start latency” theory that the data immediately falsified (see Ollama operations). Dead ends are kept in the document. Future readers should be able to follow the same path of elimination instead of just landing at the answer.

Quantify

Every claim in these sagas is paired with a number drawn from a real run, a real log, or a real benchmark — not an invented example. Where calibration matters, the sample size (e.g. n=10,598 trials) and the dispersion (σ=51) are cited, not just the mean. This is the difference between “the model is slow” and “gemma4:e4b p95 was 6.1s on n=6,133 trials, then degraded to >60s after 17:58 once memory pressure crossed an inflection point.”

Ship the smallest correct fix

Where possible, the implemented fix is one of:

A code change — the smallest diff that repairs the contract the upstream component breaks. Counter-examples (large rewrites that “would also fix unrelated things”) are explicitly rejected.
An operational workaround — an environment variable, a kill command, a config flag — when the upstream component is the right thing to fix but that code isn’t owned here. The workaround is documented along with the upstream tracker so it can be retired later.
A documentation change — when the failure is a foot-gun rather than a defect, surfacing it in the docs is the correct fix.

What was deliberately not done is listed at the end of each saga. Future readers should not have to reverse-engineer the choice.

Make the fix self-documenting

Each saga ends with two sections:

Reproducing the diagnostic — the commands a future maintainer should run when they see the same symptom for the first time.
References — the upstream issues, the relevant source modules, the SQL queries used.

If a fix requires a magic environment variable or an obscure terminal command, the saga is the canonical place to find it. Reference pages (Configuration, Quickstart) link back here rather than duplicating the explanation.

What lives in this group

Reasoning capture

The Gemma 4 saga: required-schema reasoning broke smaller models, and the prompt-rewrite fix that ships today.

Cost estimation

Why the empirical-first cost model (Phase 4a) was abandoned in favour of calibrated tokens × live OpenRouter pricing (Phase 4b).

Ollama operations

The 60s-timeout investigation: a “Stopping…” deadlock, macOS unified-memory pressure, and a confirmed Flash Attention bug stacked together.

Cache buster

Per-sample salts that force decoding variance at temperature=0 without invalidating prefix caching.

What does not belong here

Configuration knobs and their defaults — those live in Configuration.
Step-by-step usage instructions — those live in Quickstart.
Trial-schema and runner-state reference — those live in Workflow.

Discovery pages cite reference pages, not the other way around. This keeps the reference docs short and stable while giving the sagas room to be specific about what they actually saw.

Documentation Index

​The 6-step protocol

​What lives in this group