multi-table · cross-session

Skill development eval workspace

One logbook, one job — improving this skill across iterations — structured as several connected tables rather than one. The container has a single address, single owner, single lifetime. Individual tables are sub-addresses within it.

Shape
Multi-table
Storage
SQLite (or directory of CSVs)
Owner
Skill author
Lifetime
Active while skill is being improved
Join keys
eval_id, run_id, iteration_id
Corrections
Mixed — see below

Tables

evals

ColumnMeaning
eval_idStable identifier. Patch-in-place.
promptInput prompt given to the skill.
assertionsList of checks the output must satisfy.

runs

ColumnMeaning
run_idOne run = one (eval_id, iteration_id) pair.
eval_idFK → evals.
iteration_idFK → iterations.
assertion_resultsPer-assertion pass/fail + evidence.
duration_msWall-clock time.
token_countTotal model tokens consumed.
transcript_refPath to the full transcript file.

iterations

ColumnMeaning
iteration_idSequential, one per skill revision.
skill_versionGit SHA or version tag.
pass_rateComputed aggregate across runs.
avg_tokensMean token count per run.
notesShort prose from the author about what changed.

feedback

ColumnMeaning
feedback_idSequential.
run_idFK → runs.
authorHuman reviewer identity.
commentFreeform observation.
timestampWhen written.

Common queries

-- Which evals regressed between iteration 7 and 8?
SELECT r1.eval_id, r1.assertion_results AS prev, r2.assertion_results AS curr
FROM runs r1 JOIN runs r2 USING (eval_id)
WHERE r1.iteration_id = 7 AND r2.iteration_id = 8
  AND r1.assertion_results != r2.assertion_results;

-- Pass rate trend
SELECT iteration_id, pass_rate FROM iterations ORDER BY iteration_id;

-- Runs with substantive human feedback
SELECT r.run_id, f.comment FROM runs r JOIN feedback f USING (run_id);

Actions

Trigger next iteration. The most recent iteration's failing runs become the input context for the next improvement pass. The next iteration reads by path, not by re-running the whole eval.

Generate report. A concise diff across iterations: pass-rate trend, token-cost trend, top failing evals with linked transcripts.

Visualize. Pass rate over iteration_id as a line chart, grouped by eval category.

Why this is still one logbook

Shared job (improving the skill), shared owner, shared lifetime, shared address (the workspace directory). Tables are sub-addresses; joins live in tools, not in mental overhead. When any of those four diverges — different lifetime, different owner — split into two cooperating logbooks. Don't build a universal workspace schema.