Skip to main content

Documentation Index

Fetch the complete documentation index at: https://dcpma.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

When you run oasis-llm run configs/runs/my-run.yaml, a precise sequence of steps unfolds: the CLI reads your config, creates or resumes a run record in DuckDB, enqueues every (image, dimension, sample) combination that hasn’t been attempted yet, then drains those trials through an async worker pool that calls your LLM provider and writes results back to the database. Understanding this lifecycle helps you reason about resumption, cost, and failure recovery.

End-to-end sequence

The diagram below traces a full run from your terminal command through to the final status update.

Key invariants

Three guarantees hold throughout every run:
  • Atomic claim. _claim_one executes under a single asyncio.Lock, so no two workers in the same process can grab the same trial. The runner is single-process by design — there is no cross-process locking mechanism.
  • Stale-claim recovery. Before claiming a new trial, _claim_one resets any trial whose status='running' and claimed_at < now() - 10 min back to pending. If a worker is killed mid-flight, the next invocation of the runner automatically reclaims those trials after at most 10 minutes.
  • Retry budget. A trial is eligible to be claimed again while status='pending' OR (status='failed' AND attempts < 3). Every call to _record_result increments attempts regardless of whether the trial succeeded or failed.
The ordering inside _claim_one is ORDER BY attempts, sample_idx, image_id, dimension LIMIT 1, so fresh trials (attempts = 0) drain before retries. Within an attempt tier, lower sample_idx values go first.

Trial lifecycle

Every trial moves through a defined set of states. The diagram below shows all valid transitions, including the stale-recovery path and the two terminal states. A trial reaches a terminal state when it is either done, or failed with attempts >= 3. There is no automatic promotion out of a terminal state — to re-run a finished trial you must start a new run with a different name.

Trial record

Every row in the trials table is identified by the composite key (run_id, image_id, dimension, sample_idx). The table below documents every column you may encounter when querying results directly.
ColumnTypeMeaning
statusTEXTpending / running / done / failed
ratingINTEGERParsed 1–7 rating; NULL on failure
raw_responseTEXTVerbatim model output
reasoningTEXTParsed JSON reasoning field; NULL if absent or not captured
prompt_hashTEXTsha256(model + system + user)[:16] — lets you detect prompt-version changes across runs
latency_msINTEGERWall-clock time from the start of _call_model
input_tokensINTEGERToken count from the provider usage block
output_tokensINTEGERToken count from the provider usage block
cost_usdDOUBLELiteLLM cost estimate; falls back to OpenRouter native usage.cost
errorTEXTError message; NULL on success
finish_reasonTEXTProvider’s finish reason (e.g. stop, length)
response_idTEXTProvider’s response ID — useful for correlating with provider-side trace logs and dashboards
attemptsINTEGERIncremented by every call to _record_result, whether success or failure
claimed_atTIMESTAMPSet when a worker claims the trial; used for stale-recovery cutoff
completed_atTIMESTAMPSet when _record_result writes the outcome; useful for latency analysis
Query cost_usd and latency_ms across a finished run to estimate per-model pricing and throughput before scaling up samples_per_image.

Resumption semantics

upsert_run stores a canonical config hash alongside every run record and checks it on every subsequent invocation of the same run_id. If the hash doesn’t match, the runner raises an error rather than silently mixing results from two different experiment configurations:
if existing[0] != cfg_hash:
    raise RuntimeError(
        f"Run '{run_id}' exists with different config hash "
        f"(stored={existing[0]}, new={cfg_hash}). Use a new --name or --new-run."
    )
The hash is computed as sha256(model_dump(exclude={name, max_concurrency, request_timeout_s, max_retries, samples_per_image}))[:16]. Because those five fields are excluded, you can safely change them between invocations of the same run without invalidating earlier results. What you CAN change between resumes:
  • max_concurrency — tune parallelism up or down freely.
  • request_timeout_s — extend the per-call timeout if you’re seeing timeouts on large images.
  • samples_per_image — increase from 5 to 20 and only the new samples will be enqueued; already-completed samples are skipped by the anti-join in enqueue_trials.
What you CANNOT silently change:
  • model — changing the model changes what the experiment measures.
  • Prompt overrides (system_prompt_override, format_hint_suffix) — these affect what the model sees.
  • dimensions — adding or removing valence/arousal changes the scope of the experiment.
  • image_set — changing the image set changes the population being rated.
Attempting to resume a run after changing any of the above fields will raise a RuntimeError. Create a new run with a different name instead.

Exporting results

Once a run is complete, you have three export options from the CLI.

Export raw trials

Write all completed trials for a run to a CSV file:
oasis-llm export <run_id> outputs/<run_id>.csv
The output contains every row from the trials table where status='done', including rating, reasoning, cost_usd, latency_ms, input_tokens, and output_tokens.

Generate paper-style plots

Produce a summary PDF and plots in the style of a research paper:
oasis-llm paper-plots <run_id>
oasis-llm paper-plots <run_id> --out-dir outputs/paper_plots
The output directory (default outputs/paper_plots/<run_id>) receives:
  • Valence and arousal distribution plots
  • LLM vs human scatter plots
  • A summary JSON with image count, samples per image, and human-overlap correlations (if human norms are available)

Generate participant-style dataset

Reconstruct a wide participant-style CSV that mirrors the original OASIS data format, where each row represents one “pseudo-participant” that rated a fixed number of images:
oasis-llm participant-dataset <run_id>
oasis-llm participant-dataset <run_id> --out-dir outputs/participant_dataset --images-per-participant 20
FlagDefaultDescription
--out-diroutputs/participant_dataset/<run_id>Directory to write the CSV and per-image plots.
--images-per-participant20Number of images assigned to each pseudo-participant row.
Rows in the participant-style dataset are reconstructed from sample_idx values, not real participant sessions. They are structurally compatible with analysis scripts written for the original OASIS format but do not represent independent participants.

Dataset management

Datasets are named, versioned subsets of the 900 OASIS images. They let you define a reusable stimulus pool and attach it to multiple experiment configs without re-specifying the image list each time. Datasets progress through a lifecycle: draftapprovedarchived.
oasis-llm dataset list                        # list all datasets with status and image counts
oasis-llm dataset generate "my-set" --n 30   # create a 30-image stratified draft
oasis-llm dataset show <dataset_id>           # inspect images and metadata
oasis-llm dataset approve <dataset_id>        # lock as immutable (approved)
oasis-llm dataset archive <dataset_id>        # mark as archived (no longer active)

Editing a draft dataset

Before approving, you can refine the image list:
oasis-llm dataset exclude <dataset_id> <image_id> --note "reason"  # exclude an image
oasis-llm dataset include <dataset_id> <image_id>                   # re-include excluded image
oasis-llm dataset add <dataset_id> <image_id>                       # add a new image
oasis-llm dataset duplicate <dataset_id> "new-name"                 # clone to a new draft
oasis-llm dataset delete <dataset_id>                               # delete (built-ins protected)
The generate command supports three sampling strategies:
StrategyBehaviour
stratified (default)Proportionally allocates images across OASIS categories (Animal, Scene, Person, Object).
uniformDraws images uniformly at random from the full pool.
allSelects all 900 images (equivalent to full_900).
oasis-llm dataset generate "arousal-focus" --n 50 --strategy uniform --seed 99
Once a dataset is approved, you cannot add or exclude images. Duplicate the dataset to a new draft if you need to modify it.

Experiment management

An experiment groups multiple run configs against a shared dataset, letting you compare several models or parameter combinations in one organised unit. Each config in the experiment gets its own run_id and runs sequentially.
oasis-llm experiment list                      # list all experiments with status
oasis-llm experiment create experiment.yaml    # create from a YAML definition
oasis-llm experiment show <experiment_id>      # per-config progress and cost
oasis-llm experiment run <experiment_id>       # execute all configs sequentially
oasis-llm experiment delete <experiment_id>    # delete experiment and all its trials

Experiment YAML format

name: compare-models-pilot30
dataset: my-dataset-id
description: "Compare Gemma 4 vs Qwen3 on pilot_30"
configs:
  - name: gemma4-pilot30
    provider: openrouter
    model: google/gemma-4-31b-it
    modality: vision
    dimensions: [valence, arousal]
    samples_per_image: 5
    max_concurrency: 4
    capture_reasoning: true

  - name: qwen3-pilot30
    provider: openrouter
    model: qwen/qwen3-vl-7b
    modality: vision
    dimensions: [valence, arousal]
    samples_per_image: 5
    max_concurrency: 4
    capture_reasoning: true
oasis-llm experiment run executes each config sequentially. If one config is already complete (all trials done), it is skipped automatically. You can re-run the command after a failure and it resumes from where it left off, because each underlying run is idempotent.
Use oasis-llm experiment show <id> to check per-config progress and cost mid-run. The table shows done/total and cumulative cost_usd for each config.