Run Lifecycle: From YAML Config to Completed Trials

When you run oasis-llm run configs/runs/my-run.yaml, a precise sequence of steps unfolds: the CLI reads your config, creates or resumes a run record in DuckDB, enqueues every (image, dimension, sample) combination that hasn’t been attempted yet, then drains those trials through an async worker pool that calls your LLM provider and writes results back to the database. Understanding this lifecycle helps you reason about resumption, cost, and failure recovery.

End-to-end sequence

The diagram below traces a full run from your terminal command through to the final status update.

Key invariants

Three guarantees hold throughout every run:

Atomic claim. _claim_one executes under a single asyncio.Lock, so no two workers in the same process can grab the same trial. The runner is single-process by design — there is no cross-process locking mechanism.
Stale-claim recovery. Before claiming a new trial, _claim_one resets any trial whose status='running' and claimed_at < now() - 10 min back to pending. If a worker is killed mid-flight, the next invocation of the runner automatically reclaims those trials after at most 10 minutes.
Retry budget. A trial is eligible to be claimed again while status='pending' OR (status='failed' AND attempts < 3). Every call to _record_result increments attempts regardless of whether the trial succeeded or failed.

The ordering inside _claim_one is ORDER BY attempts, sample_idx, image_id, dimension LIMIT 1, so fresh trials (attempts = 0) drain before retries. Within an attempt tier, lower sample_idx values go first.

Trial lifecycle

Every trial moves through a defined set of states. The diagram below shows all valid transitions, including the stale-recovery path and the two terminal states. A trial reaches a terminal state when it is either done, or failed with attempts >= 3. There is no automatic promotion out of a terminal state — to re-run a finished trial you must start a new run with a different name.

Trial record

Every row in the trials table is identified by the composite key (run_id, image_id, dimension, sample_idx). The table below documents every column you may encounter when querying results directly.

Column	Type	Meaning
`status`	TEXT	`pending` / `running` / `done` / `failed`
`rating`	INTEGER	Parsed 1–7 rating; NULL on failure
`raw_response`	TEXT	Verbatim model output
`reasoning`	TEXT	Parsed JSON `reasoning` field; NULL if absent or not captured
`prompt_hash`	TEXT	`sha256(model + system + user)[:16]` — lets you detect prompt-version changes across runs
`latency_ms`	INTEGER	Wall-clock time from the start of `_call_model`
`input_tokens`	INTEGER	Token count from the provider usage block
`output_tokens`	INTEGER	Token count from the provider usage block
`cost_usd`	DOUBLE	LiteLLM cost estimate; falls back to OpenRouter native `usage.cost`
`error`	TEXT	Error message; NULL on success
`finish_reason`	TEXT	Provider’s finish reason (e.g. `stop`, `length`)
`response_id`	TEXT	Provider’s response ID — useful for correlating with provider-side trace logs and dashboards
`attempts`	INTEGER	Incremented by every call to `_record_result`, whether success or failure
`claimed_at`	TIMESTAMP	Set when a worker claims the trial; used for stale-recovery cutoff
`completed_at`	TIMESTAMP	Set when `_record_result` writes the outcome; useful for latency analysis

Query cost_usd and latency_ms across a finished run to estimate per-model pricing and throughput before scaling up samples_per_image.

Resumption semantics

upsert_run stores a canonical config hash alongside every run record and checks it on every subsequent invocation of the same run_id. If the hash doesn’t match, the runner raises an error rather than silently mixing results from two different experiment configurations:

if existing[0] != cfg_hash:
    raise RuntimeError(
        f"Run '{run_id}' exists with different config hash "
        f"(stored={existing[0]}, new={cfg_hash}). Use a new --name or --new-run."
    )

The hash is computed as sha256(model_dump(exclude={name, max_concurrency, request_timeout_s, max_retries, samples_per_image}))[:16]. Because those five fields are excluded, you can safely change them between invocations of the same run without invalidating earlier results. What you CAN change between resumes:

max_concurrency — tune parallelism up or down freely.
request_timeout_s — extend the per-call timeout if you’re seeing timeouts on large images.
samples_per_image — increase from 5 to 20 and only the new samples will be enqueued; already-completed samples are skipped by the anti-join in enqueue_trials.

What you CANNOT silently change:

model — changing the model changes what the experiment measures.
Prompt overrides (system_prompt_override, format_hint_suffix) — these affect what the model sees.
dimensions — adding or removing valence/arousal changes the scope of the experiment.
image_set — changing the image set changes the population being rated.

Attempting to resume a run after changing any of the above fields will raise a RuntimeError. Create a new run with a different name instead.

Exporting results

Once a run is complete, you have three export options from the CLI.

Export raw trials

Write all completed trials for a run to a CSV file:

oasis-llm export <run_id> outputs/<run_id>.csv

The output contains every row from the trials table where status='done', including rating, reasoning, cost_usd, latency_ms, input_tokens, and output_tokens.

Generate paper-style plots

Produce a summary PDF and plots in the style of a research paper:

oasis-llm paper-plots <run_id>
oasis-llm paper-plots <run_id> --out-dir outputs/paper_plots

The output directory (default outputs/paper_plots/<run_id>) receives:

Valence and arousal distribution plots
LLM vs human scatter plots
A summary JSON with image count, samples per image, and human-overlap correlations (if human norms are available)

Generate participant-style dataset

Reconstruct a wide participant-style CSV that mirrors the original OASIS data format, where each row represents one “pseudo-participant” that rated a fixed number of images:

oasis-llm participant-dataset <run_id>
oasis-llm participant-dataset <run_id> --out-dir outputs/participant_dataset --images-per-participant 20

Flag	Default	Description
`--out-dir`	`outputs/participant_dataset/<run_id>`	Directory to write the CSV and per-image plots.
`--images-per-participant`	`20`	Number of images assigned to each pseudo-participant row.

Rows in the participant-style dataset are reconstructed from sample_idx values, not real participant sessions. They are structurally compatible with analysis scripts written for the original OASIS format but do not represent independent participants.

Dataset management

Datasets are named, versioned subsets of the 900 OASIS images. They let you define a reusable stimulus pool and attach it to multiple experiment configs without re-specifying the image list each time. Datasets progress through a lifecycle: draft → approved → archived.

oasis-llm dataset list                        # list all datasets with status and image counts
oasis-llm dataset generate "my-set" --n 30   # create a 30-image stratified draft
oasis-llm dataset show <dataset_id>           # inspect images and metadata
oasis-llm dataset approve <dataset_id>        # lock as immutable (approved)
oasis-llm dataset archive <dataset_id>        # mark as archived (no longer active)

Editing a draft dataset

Before approving, you can refine the image list:

oasis-llm dataset exclude <dataset_id> <image_id> --note "reason"  # exclude an image
oasis-llm dataset include <dataset_id> <image_id>                   # re-include excluded image
oasis-llm dataset add <dataset_id> <image_id>                       # add a new image
oasis-llm dataset duplicate <dataset_id> "new-name"                 # clone to a new draft
oasis-llm dataset delete <dataset_id>                               # delete (built-ins protected)

The generate command supports three sampling strategies:

Strategy	Behaviour
`stratified` (default)	Proportionally allocates images across OASIS categories (Animal, Scene, Person, Object).
`uniform`	Draws images uniformly at random from the full pool.
`all`	Selects all 900 images (equivalent to `full_900`).

oasis-llm dataset generate "arousal-focus" --n 50 --strategy uniform --seed 99

Once a dataset is approved, you cannot add or exclude images. Duplicate the dataset to a new draft if you need to modify it.

Experiment management

An experiment groups multiple run configs against a shared dataset, letting you compare several models or parameter combinations in one organised unit. Each config in the experiment gets its own run_id and runs sequentially.

oasis-llm experiment list                      # list all experiments with status
oasis-llm experiment create experiment.yaml    # create from a YAML definition
oasis-llm experiment show <experiment_id>      # per-config progress and cost
oasis-llm experiment run <experiment_id>       # execute all configs sequentially
oasis-llm experiment delete <experiment_id>    # delete experiment and all its trials

Experiment YAML format

name: compare-models-pilot30
dataset: my-dataset-id
description: "Compare Gemma 4 vs Qwen3 on pilot_30"
configs:
  - name: gemma4-pilot30
    provider: openrouter
    model: google/gemma-4-31b-it
    modality: vision
    dimensions: [valence, arousal]
    samples_per_image: 5
    max_concurrency: 4
    capture_reasoning: true

  - name: qwen3-pilot30
    provider: openrouter
    model: qwen/qwen3-vl-7b
    modality: vision
    dimensions: [valence, arousal]
    samples_per_image: 5
    max_concurrency: 4
    capture_reasoning: true

oasis-llm experiment run executes each config sequentially. If one config is already complete (all trials done), it is skipped automatically. You can re-run the command after a failure and it resumes from where it left off, because each underlying run is idempotent.

Use oasis-llm experiment show <id> to check per-config progress and cost mid-run. The table shows done/total and cumulative cost_usd for each config.

Documentation Index

​End-to-end sequence

​Key invariants

​Trial lifecycle

​Trial record

​Resumption semantics

​Exporting results

​Export raw trials

​Generate paper-style plots

​Generate participant-style dataset

​Dataset management

​Editing a draft dataset

​Experiment management

​Experiment YAML format

End-to-end sequence

Key invariants

Trial lifecycle

Trial record

Resumption semantics

Exporting results

Export raw trials

Generate paper-style plots

Generate participant-style dataset

Dataset management

Editing a draft dataset

Experiment management

Experiment YAML format