multi-table · cross-session

Skill development eval workspace

One logbook, one job — improving this skill across iterations — structured as several connected tables rather than one. The container has a single address, single owner, single lifetime. Individual tables are sub-addresses within it.

Shape

Multi-table

Storage

SQLite (or directory of CSVs)

Owner

Skill author

Lifetime

Active while skill is being improved

Join keys

eval_id, run_id, iteration_id

Corrections

Mixed — see below

Tables

evals

Column	Meaning
eval_id	Stable identifier. Patch-in-place.
prompt	Input prompt given to the skill.
assertions	List of checks the output must satisfy.

runs

Column	Meaning
run_id	One run = one (`eval_id`, `iteration_id`) pair.
eval_id	FK → evals.
iteration_id	FK → iterations.
assertion_results	Per-assertion pass/fail + evidence.
duration_ms	Wall-clock time.
token_count	Total model tokens consumed.
transcript_ref	Path to the full transcript file.

iterations

Column	Meaning
iteration_id	Sequential, one per skill revision.
skill_version	Git SHA or version tag.
pass_rate	Computed aggregate across runs.
avg_tokens	Mean token count per run.
notes	Short prose from the author about what changed.

feedback

Column	Meaning
feedback_id	Sequential.
run_id	FK → runs.
author	Human reviewer identity.
comment	Freeform observation.
timestamp	When written.

Common queries

-- Which evals regressed between iteration 7 and 8?
SELECT r1.eval_id, r1.assertion_results AS prev, r2.assertion_results AS curr
FROM runs r1 JOIN runs r2 USING (eval_id)
WHERE r1.iteration_id = 7 AND r2.iteration_id = 8
  AND r1.assertion_results != r2.assertion_results;

-- Pass rate trend
SELECT iteration_id, pass_rate FROM iterations ORDER BY iteration_id;

-- Runs with substantive human feedback
SELECT r.run_id, f.comment FROM runs r JOIN feedback f USING (run_id);

Actions

Trigger next iteration. The most recent iteration's failing runs become the input context for the next improvement pass. The next iteration reads by path, not by re-running the whole eval.

Generate report. A concise diff across iterations: pass-rate trend, token-cost trend, top failing evals with linked transcripts.

Visualize. Pass rate over iteration_id as a line chart, grouped by eval category.

Why this is still one logbook

Shared job (improving the skill), shared owner, shared lifetime, shared address (the workspace directory). Tables are sub-addresses; joins live in tools, not in mental overhead. When any of those four diverges — different lifetime, different owner — split into two cooperating logbooks. Don't build a universal workspace schema.

← Previous example Retro collector Next example Background agent log →