When you runDocumentation Index
Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
oasis-llm run configs/runs/my-run.yaml, a precise sequence of steps unfolds: the CLI reads your config, creates or resumes a run record in DuckDB, enqueues every (image, dimension, sample) combination that hasn’t been attempted yet, then drains those trials through an async worker pool that calls your LLM provider and writes results back to the database. Understanding this lifecycle helps you reason about resumption, cost, and failure recovery.
End-to-end sequence
The diagram below traces a full run from your terminal command through to the final status update.Key invariants
Three guarantees hold throughout every run:- Atomic claim.
_claim_oneexecutes under a singleasyncio.Lock, so no two workers in the same process can grab the same trial. The runner is single-process by design — there is no cross-process locking mechanism. - Stale-claim recovery. Before claiming a new trial,
_claim_oneresets any trial whosestatus='running'andclaimed_at < now() - 10 minback topending. If a worker is killed mid-flight, the next invocation of the runner automatically reclaims those trials after at most 10 minutes. - Retry budget. A trial is eligible to be claimed again while
status='pending'OR (status='failed'ANDattempts < 3). Every call to_record_resultincrementsattemptsregardless of whether the trial succeeded or failed.
The ordering inside
_claim_one is ORDER BY attempts, sample_idx, image_id, dimension LIMIT 1, so fresh trials (attempts = 0) drain before retries. Within an attempt tier, lower sample_idx values go first.Trial lifecycle
Every trial moves through a defined set of states. The diagram below shows all valid transitions, including the stale-recovery path and the two terminal states. A trial reaches a terminal state when it is eitherdone, or failed with attempts >= 3. There is no automatic promotion out of a terminal state — to re-run a finished trial you must start a new run with a different name.
Trial record
Every row in thetrials table is identified by the composite key (run_id, image_id, dimension, sample_idx). The table below documents every column you may encounter when querying results directly.
| Column | Type | Meaning |
|---|---|---|
status | TEXT | pending / running / done / failed |
rating | INTEGER | Parsed 1–7 rating; NULL on failure |
raw_response | TEXT | Verbatim model output |
reasoning | TEXT | Parsed JSON reasoning field; NULL if absent or not captured |
prompt_hash | TEXT | sha256(model + system + user)[:16] — lets you detect prompt-version changes across runs |
latency_ms | INTEGER | Wall-clock time from the start of _call_model |
input_tokens | INTEGER | Token count from the provider usage block |
output_tokens | INTEGER | Token count from the provider usage block |
cost_usd | DOUBLE | LiteLLM cost estimate; falls back to OpenRouter native usage.cost |
error | TEXT | Error message; NULL on success |
finish_reason | TEXT | Provider’s finish reason (e.g. stop, length) |
response_id | TEXT | Provider’s response ID — useful for correlating with provider-side trace logs and dashboards |
attempts | INTEGER | Incremented by every call to _record_result, whether success or failure |
claimed_at | TIMESTAMP | Set when a worker claims the trial; used for stale-recovery cutoff |
completed_at | TIMESTAMP | Set when _record_result writes the outcome; useful for latency analysis |
Resumption semantics
upsert_run stores a canonical config hash alongside every run record and checks it on every subsequent invocation of the same run_id. If the hash doesn’t match, the runner raises an error rather than silently mixing results from two different experiment configurations:
sha256(model_dump(exclude={name, max_concurrency, request_timeout_s, max_retries, samples_per_image}))[:16]. Because those five fields are excluded, you can safely change them between invocations of the same run without invalidating earlier results.
What you CAN change between resumes:
max_concurrency— tune parallelism up or down freely.request_timeout_s— extend the per-call timeout if you’re seeing timeouts on large images.samples_per_image— increase from 5 to 20 and only the new samples will be enqueued; already-completed samples are skipped by the anti-join inenqueue_trials.
model— changing the model changes what the experiment measures.- Prompt overrides (
system_prompt_override,format_hint_suffix) — these affect what the model sees. dimensions— adding or removing valence/arousal changes the scope of the experiment.image_set— changing the image set changes the population being rated.
Exporting results
Once a run is complete, you have three export options from the CLI.Export raw trials
Write all completed trials for a run to a CSV file:trials table where status='done', including rating, reasoning, cost_usd, latency_ms, input_tokens, and output_tokens.
Generate paper-style plots
Produce a summary PDF and plots in the style of a research paper:outputs/paper_plots/<run_id>) receives:
- Valence and arousal distribution plots
- LLM vs human scatter plots
- A summary JSON with image count, samples per image, and human-overlap correlations (if human norms are available)
Generate participant-style dataset
Reconstruct a wide participant-style CSV that mirrors the original OASIS data format, where each row represents one “pseudo-participant” that rated a fixed number of images:| Flag | Default | Description |
|---|---|---|
--out-dir | outputs/participant_dataset/<run_id> | Directory to write the CSV and per-image plots. |
--images-per-participant | 20 | Number of images assigned to each pseudo-participant row. |
Rows in the participant-style dataset are reconstructed from
sample_idx values, not real participant sessions. They are structurally compatible with analysis scripts written for the original OASIS format but do not represent independent participants.Dataset management
Datasets are named, versioned subsets of the 900 OASIS images. They let you define a reusable stimulus pool and attach it to multiple experiment configs without re-specifying the image list each time. Datasets progress through a lifecycle:draft → approved → archived.
Editing a draft dataset
Before approving, you can refine the image list:generate command supports three sampling strategies:
| Strategy | Behaviour |
|---|---|
stratified (default) | Proportionally allocates images across OASIS categories (Animal, Scene, Person, Object). |
uniform | Draws images uniformly at random from the full pool. |
all | Selects all 900 images (equivalent to full_900). |
Once a dataset is approved, you cannot add or exclude images. Duplicate the dataset to a new draft if you need to modify it.
Experiment management
An experiment groups multiple run configs against a shared dataset, letting you compare several models or parameter combinations in one organised unit. Each config in the experiment gets its ownrun_id and runs sequentially.
Experiment YAML format
oasis-llm experiment run executes each config sequentially. If one config is already complete (all trials done), it is skipped automatically. You can re-run the command after a failure and it resumes from where it left off, because each underlying run is idempotent.