Skill development eval workspace
One logbook, one job — improving this skill across iterations — structured as several connected tables rather than one. The container has a single address, single owner, single lifetime. Individual tables are sub-addresses within it.
eval_id, run_id, iteration_idTables
evals
| Column | Meaning |
|---|---|
| eval_id | Stable identifier. Patch-in-place. |
| prompt | Input prompt given to the skill. |
| assertions | List of checks the output must satisfy. |
runs
| Column | Meaning |
|---|---|
| run_id | One run = one (eval_id, iteration_id) pair. |
| eval_id | FK → evals. |
| iteration_id | FK → iterations. |
| assertion_results | Per-assertion pass/fail + evidence. |
| duration_ms | Wall-clock time. |
| token_count | Total model tokens consumed. |
| transcript_ref | Path to the full transcript file. |
iterations
| Column | Meaning |
|---|---|
| iteration_id | Sequential, one per skill revision. |
| skill_version | Git SHA or version tag. |
| pass_rate | Computed aggregate across runs. |
| avg_tokens | Mean token count per run. |
| notes | Short prose from the author about what changed. |
feedback
| Column | Meaning |
|---|---|
| feedback_id | Sequential. |
| run_id | FK → runs. |
| author | Human reviewer identity. |
| comment | Freeform observation. |
| timestamp | When written. |
Common queries
-- Which evals regressed between iteration 7 and 8?
SELECT r1.eval_id, r1.assertion_results AS prev, r2.assertion_results AS curr
FROM runs r1 JOIN runs r2 USING (eval_id)
WHERE r1.iteration_id = 7 AND r2.iteration_id = 8
AND r1.assertion_results != r2.assertion_results;
-- Pass rate trend
SELECT iteration_id, pass_rate FROM iterations ORDER BY iteration_id;
-- Runs with substantive human feedback
SELECT r.run_id, f.comment FROM runs r JOIN feedback f USING (run_id);
Actions
Trigger next iteration. The most recent iteration's failing runs become the input context for the next improvement pass. The next iteration reads by path, not by re-running the whole eval.
Generate report. A concise diff across iterations: pass-rate trend, token-cost trend, top failing evals with linked transcripts.
Visualize. Pass rate over iteration_id as a line chart, grouped by eval category.
Shared job (improving the skill), shared owner, shared lifetime, shared address (the workspace directory). Tables are sub-addresses; joins live in tools, not in mental overhead. When any of those four diverges — different lifetime, different owner — split into two cooperating logbooks. Don't build a universal workspace schema.