Reasoning Capture: Fix for Gemma 4 Schema Failures

capture_reasoning: true (the default) asks the model for a {rating, reasoning} JSON object on every trial. This sounds trivial. Implementing it across providers without breaking smaller models was not — three approaches were tried before the current one, and understanding why the first two failed explains why the fix is structured the way it is.

The symptom

The first sign of trouble was a run against gemma-4-31b-it on OpenRouter producing a flood of status='failed' trials. The raw responses were long whitespace-only completions, all truncated with finish_reason: length. Every other model in the same experiment — GPT-4o, Claude, Gemini — returned clean {"rating": 5, "reasoning": "..."} objects. The same Gemma model on the same provider, same image, same prompt, but without the reasoning schema: it rated fine.

The investigation

Three approaches were tried in order:

Approach 1 — required-reasoning JSON schema (broken)

The first implementation used LiteLLM’s response_format with a strict JSON schema where reasoning was in the required array:

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "rating",
        "schema": {
            "properties": {"rating": ..., "reasoning": ...},
            "required": ["rating", "reasoning"],
            "strict": True,
        },
    },
}

On Gemma 4, this triggered a degenerate output loop — long whitespace-only completions ending in length-truncation and unparseable output. Same model, same provider, same image, same prompt, no required-reasoning: rated fine. The schema constraint was the trigger.

Approach 2 — optional-reasoning JSON schema (silent loss)

reasoning was made optional in the schema, keeping only rating required. The whitespace loops stopped — but Gemma now silently ignored the field. Every completion came back as {"rating": 5} with no reasoning string. The system prompt still said “respond with a single integer,” and the model took that literally even when the schema allowed for more.

Approach 3 — prompt rewrite + drop response_format (current)

The current path rewrites the prompts and drops response_format entirely when capture_reasoning=true. Free-form output goes through _parse_rating’s JSON-then-regex fallback. This is the only approach that produced consistent {rating, reasoning} output from Gemma 4 without triggering the whitespace loop.

What `capture_reasoning=true` actually does

The fix is three coordinated changes: a system-prompt rewrite, a user-prompt suffix, and removing the strict response_format constraint. 1. System-prompt rewrite. The paper’s “respond with a single integer” instruction is replaced with a JSON-with-reasoning instruction:

sys_p = sys_p.replace(
    "Please respond with a single integer from 1 to 7.",
    'Respond with a JSON object containing your integer `rating` (1-7) '
    'and a brief one-sentence `reasoning` (<=30 words) describing what '
    'about the image drove the rating.\n'
    'Example: {"rating": 5, "reasoning": "A bright sunset evokes mild positive affect."}',
)

2. User-prompt suffix. A “respond ONLY with a JSON object” instruction is appended to the user turn:

usr_p = (
    f"{usr_p} Respond ONLY with a JSON object: "
    '{"rating": <int 1-7>, "reasoning": "<one-sentence rationale>"}.'
)

3. No strict schema. The response_format argument is omitted entirely when capture_reasoning=true:

if cfg.capture_reasoning:
    resp = await acompletion(**call_kwargs)          # no response_format
else:
    # strict {rating: int} schema with try/except fallback
    ...

With capture_reasoning: false, the harness flips back to a strict response_format enforcing {rating: int} only. That small schema is reliable across every model tested.

The prompt diff

Here is the before/after for the system prompt (valence dimension shown): Before (capture_reasoning: false):

...
Please respond with a single integer from 1 to 7.

After (capture_reasoning: true):

...
Respond with a JSON object containing your integer `rating` (1-7)
and a brief one-sentence `reasoning` (<=30 words) describing what
about the image drove the rating.
Example: {"rating": 5, "reasoning": "A bright sunset evokes mild positive affect."}

The user turn gains a suffix:

Please rate the VALENCE of this image on the 1-7 scale. Respond ONLY with a JSON object: {"rating": <int 1-7>, "reasoning": "<one-sentence rationale>"}.

Why `reasoning` stays optional in the schema

Even when the strict schema is active, it marks only rating as required:

RATING_SCHEMA = {
    "type": "object",
    "properties": {
        "rating":    {"type": "integer", "minimum": 1, "maximum": 7, ...},
        "reasoning": {"type": "string", ...},
    },
    "required": ["rating"],
    "additionalProperties": False,
}

This matters for the capture_reasoning=false path, where the schema is enforced strictly via _schema_for(cfg) (which strips reasoning entirely). Marking reasoning required under strict mode is exactly what triggered the Gemma whitespace loops — keeping it optional is a defense-in-depth choice for any path that does use the schema.

Parsing fallback

The rating parser is forgiving by design. Given the model’s raw text output, it tries three passes in order:

Try strict json.loads

Works when the model returns a clean JSON object. Returns (rating, reasoning).

Extract first { ... } substring

Handles fenced output like ```json\n{...}\n``` and prefixed prose. Slices from the first { to the last } and re-parses. Returns (rating, reasoning) if the extracted object is valid.

Regex first integer 1–7

Last resort: pulls the first standalone digit 1–7 from the text using re.search(r"\b([1-7])\b", content). Reasoning is returned as None in this branch — the rating is salvaged but the reasoning is lost.

A trial reaches status='failed' only if all three fail (or the model returned an empty completion). That outcome is tracked in the error column for postmortem queries.

Config option

The capture_reasoning field in your run config controls this behaviour. It defaults to true.

capture_reasoning: true   # default — rewrites prompts, drops response_format
capture_reasoning: false  # strict {rating: int} schema, no reasoning column

For strict comparison to the human OASIS norms, set capture_reasoning: false. Asking for reasoning is a deliberate task change — the model is now doing something different from what the human participants did. See Discrepancies for the full discussion.

For exploratory work, capture_reasoning: true is worth it. The reasoning text is genuinely useful for spotting cases where the model is rating the content differently than you’d expect — for example, anthropomorphizing an alarm clock as “stressful” because it implies “waking up early.”

Reproducing the diagnostic

If you see a run with many status='failed' trials and finish_reason: length in the error column, run the following against your DuckDB file to confirm the pattern:

SELECT
    model,
    count(*) FILTER (WHERE status = 'failed')            AS failed,
    count(*) FILTER (WHERE finish_reason = 'length')     AS length_truncated,
    count(*)                                              AS total
FROM trials
JOIN runs USING (run_id)
WHERE run_id = '<your_run_id>'
GROUP BY model;

Then inspect a sample of the raw responses:

SELECT raw_response, error, finish_reason
FROM trials
WHERE run_id = '<your_run_id>'
  AND status = 'failed'
LIMIT 10;

Whitespace-only or near-empty raw_response values paired with finish_reason = 'length' confirm the Gemma whitespace-loop pattern. The fix is to ensure capture_reasoning: true is set (which drops response_format) or to switch to a model that handles required-reasoning schemas reliably.

Configuration — capture_reasoning field reference
Discrepancies — how reasoning capture changes the comparison to human norms

Documentation Index

​The symptom

​The investigation

​What capture_reasoning=true actually does

​The prompt diff

​Why reasoning stays optional in the schema

​Parsing fallback

​Config option

​Reproducing the diagnostic

​Related pages

The symptom

The investigation

What `capture_reasoning=true` actually does

The prompt diff

Why `reasoning` stays optional in the schema

Parsing fallback

Config option

Reproducing the diagnostic

Related pages