Skip to main content

Documentation Index

Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

On a single afternoon, an Ollama-backed run that had been completing trials in 4–8 seconds for hours degraded to back-to-back 60-second timeouts. The naive explanation — “the model is slow” — was wrong. The same model on the same hardware had completed 6,133 prior trials at p95 ≈ 6.1 seconds and continued to report 100% GPU utilisation in ollama ps. Diagnostic work identified three independent defects stacked on top of each other. This page reconstructs the investigation, documents the operational workarounds that recover a healthy state, and lists the implementation work that has not yet been done.

The symptom

A run that had been healthy all morning began producing log entries like:
2025-11-XX 17:58:01 trial 12455 timed out after 1m0s (litellm.Timeout)
2025-11-XX 17:58:14 trial 12456 timed out after 1m0s (litellm.Timeout)
2025-11-XX 17:58:27 trial 12457 timed out after 1m0s (litellm.Timeout)

2025-11-XX 18:04:09 trial 12728 timed out after 1m0s (litellm.Timeout)
273 trials timed out within six minutes. A single subsequent trial completed in 248,239 ms — a >40× slowdown from the day’s running mean. The retry-and-evict path in _worker activated (runner.py:434), the runner posted keep_alive=0 to the Ollama API, and the inference process appeared to obey: ollama ps reported Stopping…. Twenty minutes later, ollama ps still reported Stopping…. The runner process was using 99.7% CPU and 10.6 GB RSS. No new trials landed. By the day’s totals: 12,454 successful trials, 376 failed, with the failures concentrated in a six-minute window after seven hours of healthy operation.

Why the cold-start theory is wrong

The first plausible-sounding hypothesis was that the model had been evicted between 17:42 and 17:58 and was paying re-load cost on every trial. Two data points falsified this within minutes:
  • ollama ps continued to list gemma4:e4b as loaded throughout the failure window, with 100% GPU and no Stopping… flag yet.
  • ~/.ollama/logs/server.log showed the vision encoder running in 17 ms on every failed trial. Encoder timing is sensitive to whether weights are resident; 17 ms means the model was warm.
The timeline reconstruction confirmed the falsification clearly:
TimeLatency p50Latency p95Failure rate
09:00 – 17:424.1 s6.1 s0.05%
17:58 – 18:04(timed out)60 s~98%
The model was loaded and the encoder was running — timeouts were happening in the generation phase of an otherwise-warm model. That is not what cold start looks like.

The three stacked bugs

1

Bug 1 — "Stopping…" deadlock

_evict_ollama_model in runner.py:306 requests eviction by POST-ing keep_alive=0 to the Ollama API. This is a soft eviction — it tells the runner to unload after the current request finishes. If the current request is itself stuck (because of Bug 2 or Bug 3), the runner enters a Stopping… state from which it never returns.Symptoms: ollama ps reports Stopping… indefinitely. Runner PID pegged at ~99% CPU. New requests queue but do not start. The harness’s own retry path then times out against the same stuck runner because the API endpoint is up and accepts new connections.
2

Bug 2 — macOS unified-memory pressure

With ~50 GB physical RAM on an M-series machine, a multi-GB model plus Streamlit dashboard plus DuckDB plus normal desktop processes fits comfortably for hours. After 12,000+ trials the system hits a pressure threshold where the Metal compute graph starts touching swapped pages.vm_stat snapshots told the actual story:
17:00     Pages free:     2,103,449   Swapouts:    0
17:30     Pages free:     1,220,118   Swapouts:    0
17:45     Pages free:       412,302   Swapouts:   42,118
18:00     Pages free:        58,402   Swapouts:  504,282
18:15     Pages free:        58,011   Swapouts:  837,610
                                       Pages purged:  1,988,377
By 18:00, 228 MB of 51.5 GB was free and the system had performed half a million swap-out operations. The encoder ran fast (its working set is small), but generation — which streams through the KV cache — stalled behind disk I/O. The gradient matters more than the snapshot: pages-free dropped two orders of magnitude and swapouts went from 0 to 500,000 in fifteen minutes.
3

Bug 3 — Flash Attention on Gemma 4 (upstream)

With OLLAMA_FLASH_ATTENTION=1 (the default in current Ollama builds), Gemma 4 silently runs its compute graph on the CPU on some M-series Metal builds, despite ollama ps continuing to report 100% GPU.This bug interacts multiplicatively with Bug 2: a CPU-spilled compute graph is far more memory-bandwidth-sensitive than a GPU graph, so the inflection point where memory pressure becomes a problem arrives much earlier than it would on a healthy GPU path.Confirmed upstream issues:
  • #15237 — Gemma 3/4 garbage output with Flash Attention on Apple Silicon.
  • #15368 — Flash Attention silently spills to CPU; ollama ps reports 100% GPU.
  • #15350 — 3–5× speedup reported by users who set OLLAMA_FLASH_ATTENTION=0.
  • PR #15378 (Ollama v0.20.4) addressed part of #15237 but did not close #15368. Issue #15634 tracks the unresolved CPU-spill case on Apple Silicon.

Why morning-fine and afternoon-broken makes sense

The initial reaction was that a code change must have regressed somewhere between 17:42 and 17:58. There was no such change. The runner started the day with Flash Attention enabled (Bug 3 latent) and ran for seven hours because available memory absorbed the elevated CPU traffic. As trials accumulated, memory pressure (Bug 2) crossed an inflection point. The first 60-second timeout triggered the eviction path, which raced an in-flight stalled inference and produced the deadlock (Bug 1). After that, every retry hit the deadlocked runner and timed out. Fixing any one of the three would not have prevented the failure. Fixing Bug 3 (OLLAMA_FLASH_ATTENTION=0) makes Bug 2 much less likely to trigger. Fixing Bug 1 (hard-kill the runner on Stopping…) makes Bug 2 recoverable instead of terminal. Fixing Bug 2 alone is hardest because the macOS memory manager is not under the harness’s control.

Operational workarounds

These are the commands to run today if the same symptom appears.

Disable Flash Attention for Ollama

launchctl setenv OLLAMA_FLASH_ATTENTION 0
osascript -e 'quit app "Ollama"'
open -a Ollama
This survives reboot. Verify the setting took effect:
ps -E -p $(pgrep -f 'ollama serve') | tr ' ' '\n' | grep FLASH
# OLLAMA_FLASH_ATTENTION=0

Recover from a Stopping… deadlock

# 1. Identify the stuck runner (high RSS, not `ollama serve`)
ps -axm -o pid,rss,pcpu,command | grep '[o]llama runner'

# 2. Hard-kill it; `ollama serve` will respawn a fresh runner on next call
kill -9 <PID>

# 3. Confirm it is gone
ollama ps   # should be empty or omit the formerly-stuck model
This is safe: the OASIS runner has retry logic, so trials that were in flight at the moment of the kill are retried automatically.

Reduce concurrency on Ollama runs

Set max_concurrency: 1 in your run config. It is the safest setting for Ollama-backed runs on M-series hardware. Ollama serialises Metal kernel launches anyway, so the throughput gain from concurrency is small and the memory cost is multiplicative.

Pre-flight memory check

Before a long Ollama run, verify there is headroom:
vm_stat | awk '/Pages free/ {print $3 * 4096 / 1024 / 1024, "MB free"}'
A long run on gemma4:e4b should not start with less than ~8 GB free.

What is not yet implemented

The following work has not been landed. Operational workarounds are sufficient for now; each item has a trade-off the team has not signed off on.
  • Hard-kill on Stopping…. A version of _evict_ollama_model that, after issuing the soft eviction, polls ollama ps and on detecting Stopping… for >N seconds calls kill -9 on the runner PID. This is the closest fix to Bug 1 but introduces a hard dependency on parsing ollama ps output, which is undocumented and has changed format twice.
  • Memory-pressure preflight. A check at run start that fails fast if vm_stat reports less than N GB free. Trade-off: false positives on machines with effective swap.
  • Periodic memory-pressure checkpoint. Inject a vm_stat read between trials and pause the run if pressure crosses a threshold. Trade-off: adds latency to a hot path.
  • Auto-set OLLAMA_FLASH_ATTENTION=0 on launch. The harness could refuse to start an Ollama run when it detects OLLAMA_FLASH_ATTENTION=1 on Apple Silicon. Not shipped because the upstream bug may be closed in a future Ollama release and the harness should not pin a workaround past its useful life.

Reproducing the diagnostic

If a future run shows the same symptom, the following sequence rebuilds the evidence in this page:
# 1. Confirm the symptom: timeouts clustered in time, runner pegged
grep "timed out after" runner.log | tail -50

# 2. Inspect the runner state
ollama ps
ps -axm -o pid,rss,pcpu,command | grep '[o]llama'

# 3. Memory pressure
vm_stat | sed -n '1,12p'
sysctl vm.swapusage

# 4. Flash Attention setting
ps -E -p $(pgrep -f 'ollama serve') | tr ' ' '\n' | grep FLASH

# 5. Encoder vs generation timing in the Ollama log
grep -E '(image-encoder|generate)' ~/.ollama/logs/server.log | tail -50
If steps 1–3 show timeouts plus high RSS plus low free pages, you are looking at Bug 2 (and likely Bug 1 as a downstream consequence). If step 4 shows OLLAMA_FLASH_ATTENTION=1, you are also looking at Bug 3 — set it to 0 and restart Ollama before debugging further.

Config options relevant to Ollama runs

Config fieldDefaultEffect
request_timeout_s60Per-call timeout in seconds. Lowering this catches memory-pressure stalls earlier; raising it does not help with deadlock (Bug 1).
ollama_evict_threshold3Number of consecutive stall errors before the runner requests eviction and reload.
max_concurrency4Set to 1 for Ollama runs on M-series hardware to avoid amplifying memory pressure.

References

Upstream issues

Additional sources

  • ~/.ollama/logs/server.log — encoder vs generate timing per trial; available on your local machine
  • Your local DuckDB trial table — query trials grouped by minute to reproduce the failure-window timeline