Documentation Index
Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
capture_reasoning: true (the default) asks the model for a {rating, reasoning} JSON object on every trial. This sounds trivial. Implementing it across providers without breaking smaller models was not — three approaches were tried before the current one, and understanding why the first two failed explains why the fix is structured the way it is.
The symptom
The first sign of trouble was a run againstgemma-4-31b-it on OpenRouter producing a flood of status='failed' trials. The raw responses were long whitespace-only completions, all truncated with finish_reason: length. Every other model in the same experiment — GPT-4o, Claude, Gemini — returned clean {"rating": 5, "reasoning": "..."} objects. The same Gemma model on the same provider, same image, same prompt, but without the reasoning schema: it rated fine.
The investigation
Three approaches were tried in order:Approach 1 — required-reasoning JSON schema (broken)
The first implementation used LiteLLM’s On Gemma 4, this triggered a degenerate output loop — long whitespace-only completions ending in
response_format with a strict JSON schema where reasoning was in the required array:length-truncation and unparseable output. Same model, same provider, same image, same prompt, no required-reasoning: rated fine. The schema constraint was the trigger.Approach 2 — optional-reasoning JSON schema (silent loss)
reasoning was made optional in the schema, keeping only rating required. The whitespace loops stopped — but Gemma now silently ignored the field. Every completion came back as {"rating": 5} with no reasoning string. The system prompt still said “respond with a single integer,” and the model took that literally even when the schema allowed for more.Approach 3 — prompt rewrite + drop response_format (current)
The current path rewrites the prompts and drops
response_format entirely when capture_reasoning=true. Free-form output goes through _parse_rating’s JSON-then-regex fallback. This is the only approach that produced consistent {rating, reasoning} output from Gemma 4 without triggering the whitespace loop.What capture_reasoning=true actually does
The fix is three coordinated changes: a system-prompt rewrite, a user-prompt suffix, and removing the strict response_format constraint.
1. System-prompt rewrite. The paper’s “respond with a single integer” instruction is replaced with a JSON-with-reasoning instruction:
response_format argument is omitted entirely when capture_reasoning=true:
capture_reasoning: false, the harness flips back to a strict response_format enforcing {rating: int} only. That small schema is reliable across every model tested.
The prompt diff
Here is the before/after for the system prompt (valence dimension shown): Before (capture_reasoning: false):
capture_reasoning: true):
Why reasoning stays optional in the schema
Even when the strict schema is active, it marks only rating as required:
capture_reasoning=false path, where the schema is enforced strictly via _schema_for(cfg) (which strips reasoning entirely). Marking reasoning required under strict mode is exactly what triggered the Gemma whitespace loops — keeping it optional is a defense-in-depth choice for any path that does use the schema.
Parsing fallback
The rating parser is forgiving by design. Given the model’s raw text output, it tries three passes in order:Try strict json.loads
Works when the model returns a clean JSON object. Returns
(rating, reasoning).Extract first { ... } substring
Handles fenced output like
```json\n{...}\n``` and prefixed prose. Slices from the first { to the last } and re-parses. Returns (rating, reasoning) if the extracted object is valid.status='failed' only if all three fail (or the model returned an empty completion). That outcome is tracked in the error column for postmortem queries.
Config option
Thecapture_reasoning field in your run config controls this behaviour. It defaults to true.
capture_reasoning: true is worth it. The reasoning text is genuinely useful for spotting cases where the model is rating the content differently than you’d expect — for example, anthropomorphizing an alarm clock as “stressful” because it implies “waking up early.”
Reproducing the diagnostic
If you see a run with manystatus='failed' trials and finish_reason: length in the error column, run the following against your DuckDB file to confirm the pattern:
raw_response values paired with finish_reason = 'length' confirm the Gemma whitespace-loop pattern. The fix is to ensure capture_reasoning: true is set (which drops response_format) or to switch to a model that handles required-reasoning schemas reliably.
Related pages
- Configuration —
capture_reasoningfield reference - Discrepancies — how reasoning capture changes the comparison to human norms