On a single afternoon, an Ollama-backed run that had been completing trials in 4–8 seconds for hours degraded to back-to-back 60-second timeouts. The naive explanation — “the model is slow” — was wrong. The same model on the same hardware had completed 6,133 prior trials at p95 ≈ 6.1 seconds and continued to report 100% GPU utilisation inDocumentation Index
Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
ollama ps. Diagnostic work identified three independent defects stacked on top of each other. This page reconstructs the investigation, documents the operational workarounds that recover a healthy state, and lists the implementation work that has not yet been done.
The symptom
A run that had been healthy all morning began producing log entries like:_worker activated (runner.py:434), the runner posted keep_alive=0 to the Ollama API, and the inference process appeared to obey: ollama ps reported Stopping….
Twenty minutes later, ollama ps still reported Stopping…. The runner process was using 99.7% CPU and 10.6 GB RSS. No new trials landed.
By the day’s totals: 12,454 successful trials, 376 failed, with the failures concentrated in a six-minute window after seven hours of healthy operation.
Why the cold-start theory is wrong
The first plausible-sounding hypothesis was that the model had been evicted between 17:42 and 17:58 and was paying re-load cost on every trial. Two data points falsified this within minutes:ollama pscontinued to listgemma4:e4bas loaded throughout the failure window, with100% GPUand noStopping…flag yet.~/.ollama/logs/server.logshowed the vision encoder running in 17 ms on every failed trial. Encoder timing is sensitive to whether weights are resident; 17 ms means the model was warm.
| Time | Latency p50 | Latency p95 | Failure rate |
|---|---|---|---|
| 09:00 – 17:42 | 4.1 s | 6.1 s | 0.05% |
| 17:58 – 18:04 | (timed out) | 60 s | ~98% |
The three stacked bugs
Bug 1 — "Stopping…" deadlock
_evict_ollama_model in runner.py:306 requests eviction by POST-ing keep_alive=0 to the Ollama API. This is a soft eviction — it tells the runner to unload after the current request finishes. If the current request is itself stuck (because of Bug 2 or Bug 3), the runner enters a Stopping… state from which it never returns.Symptoms: ollama ps reports Stopping… indefinitely. Runner PID pegged at ~99% CPU. New requests queue but do not start. The harness’s own retry path then times out against the same stuck runner because the API endpoint is up and accepts new connections.Bug 2 — macOS unified-memory pressure
With ~50 GB physical RAM on an M-series machine, a multi-GB model plus Streamlit dashboard plus DuckDB plus normal desktop processes fits comfortably for hours. After 12,000+ trials the system hits a pressure threshold where the Metal compute graph starts touching swapped pages.By 18:00, 228 MB of 51.5 GB was free and the system had performed half a million swap-out operations. The encoder ran fast (its working set is small), but generation — which streams through the KV cache — stalled behind disk I/O. The gradient matters more than the snapshot: pages-free dropped two orders of magnitude and swapouts went from 0 to 500,000 in fifteen minutes.
vm_stat snapshots told the actual story:Bug 3 — Flash Attention on Gemma 4 (upstream)
With
OLLAMA_FLASH_ATTENTION=1 (the default in current Ollama builds), Gemma 4 silently runs its compute graph on the CPU on some M-series Metal builds, despite ollama ps continuing to report 100% GPU.This bug interacts multiplicatively with Bug 2: a CPU-spilled compute graph is far more memory-bandwidth-sensitive than a GPU graph, so the inflection point where memory pressure becomes a problem arrives much earlier than it would on a healthy GPU path.Confirmed upstream issues:- #15237 — Gemma 3/4 garbage output with Flash Attention on Apple Silicon.
- #15368 — Flash Attention silently spills to CPU;
ollama psreports100% GPU. - #15350 — 3–5× speedup reported by users who set
OLLAMA_FLASH_ATTENTION=0. - PR #15378 (Ollama v0.20.4) addressed part of #15237 but did not close #15368. Issue #15634 tracks the unresolved CPU-spill case on Apple Silicon.
Why morning-fine and afternoon-broken makes sense
The initial reaction was that a code change must have regressed somewhere between 17:42 and 17:58. There was no such change. The runner started the day with Flash Attention enabled (Bug 3 latent) and ran for seven hours because available memory absorbed the elevated CPU traffic. As trials accumulated, memory pressure (Bug 2) crossed an inflection point. The first 60-second timeout triggered the eviction path, which raced an in-flight stalled inference and produced the deadlock (Bug 1). After that, every retry hit the deadlocked runner and timed out. Fixing any one of the three would not have prevented the failure. Fixing Bug 3 (OLLAMA_FLASH_ATTENTION=0) makes Bug 2 much less likely to trigger. Fixing Bug 1 (hard-kill the runner on Stopping…) makes Bug 2 recoverable instead of terminal. Fixing Bug 2 alone is hardest because the macOS memory manager is not under the harness’s control.
Operational workarounds
These are the commands to run today if the same symptom appears.Disable Flash Attention for Ollama
Recover from a Stopping… deadlock
This is safe: the OASIS runner has retry logic, so trials that were in flight at the moment of the kill are retried automatically.
Reduce concurrency on Ollama runs
Setmax_concurrency: 1 in your run config. It is the safest setting for Ollama-backed runs on M-series hardware. Ollama serialises Metal kernel launches anyway, so the throughput gain from concurrency is small and the memory cost is multiplicative.
Pre-flight memory check
Before a long Ollama run, verify there is headroom:gemma4:e4b should not start with less than ~8 GB free.
What is not yet implemented
The following work has not been landed. Operational workarounds are sufficient for now; each item has a trade-off the team has not signed off on.- Hard-kill on
Stopping…. A version of_evict_ollama_modelthat, after issuing the soft eviction, pollsollama psand on detectingStopping…for >N seconds callskill -9on the runner PID. This is the closest fix to Bug 1 but introduces a hard dependency on parsingollama psoutput, which is undocumented and has changed format twice. - Memory-pressure preflight. A check at run start that fails fast if
vm_statreports less than N GB free. Trade-off: false positives on machines with effective swap. - Periodic memory-pressure checkpoint. Inject a
vm_statread between trials and pause the run if pressure crosses a threshold. Trade-off: adds latency to a hot path. - Auto-set
OLLAMA_FLASH_ATTENTION=0on launch. The harness could refuse to start an Ollama run when it detectsOLLAMA_FLASH_ATTENTION=1on Apple Silicon. Not shipped because the upstream bug may be closed in a future Ollama release and the harness should not pin a workaround past its useful life.
Reproducing the diagnostic
If a future run shows the same symptom, the following sequence rebuilds the evidence in this page:OLLAMA_FLASH_ATTENTION=1, you are also looking at Bug 3 — set it to 0 and restart Ollama before debugging further.
Config options relevant to Ollama runs
| Config field | Default | Effect |
|---|---|---|
request_timeout_s | 60 | Per-call timeout in seconds. Lowering this catches memory-pressure stalls earlier; raising it does not help with deadlock (Bug 1). |
ollama_evict_threshold | 3 | Number of consecutive stall errors before the runner requests eviction and reload. |
max_concurrency | 4 | Set to 1 for Ollama runs on M-series hardware to avoid amplifying memory pressure. |
References
Upstream issues
- #15237 — Gemma 3/4 garbage output with Flash Attention on Apple Silicon
- #15258 — M4 reproduction of #15237
- #15368 — Flash Attention silently spills to CPU;
ollama psreports 100% GPU - #15350 — 3–5× speedup with
OLLAMA_FLASH_ATTENTION=0 - #15378 — Partial fix in Ollama v0.20.4
- #15634 — Open: unresolved CPU-spill on Apple Silicon
Additional sources
~/.ollama/logs/server.log— encoder vs generate timing per trial; available on your local machine- Your local DuckDB trial table — query trials grouped by minute to reproduce the failure-window timeline