summarize-experiment
// Create a lightweight summary of experiment results from a completed (fine-tuned and evaluated) experiment. Use after run-experiment to capture key metrics from the experiment in textual form.
Summarize Experiment
Generate a summary.md file capturing key metrics from a completed experiment. Think R's summary() for experiment results.
Your Task
Create a lightweight summary of experiment results:
- Parse run status from experiment_summary.yaml
- Extract final training loss from SLURM stdout
- Extract accuracy from inspect-ai .eval files
- Generate summary.md in experiment directory
- Log the process in logs/summarize-experiment.log
Prerequisites
- experiment_summary.yaml exists
- At least some runs have completed (partial results acceptable)
- run-experiment has been executed (or manual SLURM jobs run)
- Conda environment activated - The
parse_eval_log.pyscript requires inspect-ai. Activate the conda environment fromclaude.local.mdbefore running extraction commands.
Workflow
1. Locate Experiment
Find the experiment directory:
- If in an experiment directory (contains experiment_summary.yaml): use current directory
- Otherwise: ask user for path
2. Parse Run Status
Read experiment_summary.yaml to identify runs:
From runs: section:
name: Run identifiertype: "fine-tuned" or "control"model: Model nameparameters: Dict of hyperparameters (empty for control runs)
From evaluation.matrix: section:
run: Run nametasks: List of evaluation task namesepochs: List of epochs to evaluate (null for control runs)
Determine status by checking filesystem:
- Fine-tuning: Check for
{output_base}/{run_name}/artifacts/and SLURM outputs - Evaluation: Check for
{run_dir}/eval/logs/*.evalfiles
3. Extract Training Loss
For each COMPLETED fine-tuning run:
- Find SLURM stdout in the output directory:
- Parse experiment_summary.yaml "Output" section for
output_dir_base - Look in:
{output_dir_base}/{run_name}/artifacts/slurm-*.out - If multiple files, use most recent by modification time
- Parse experiment_summary.yaml "Output" section for
- Extract final loss using
cruijff_kit.tools.torchtune.extract_loss:from cruijff_kit.tools.torchtune.extract_loss import final_loss result = final_loss(slurm_text) # returns (epoch, step, loss) or None- The canonical regex and helpers live in
src/tools/torchtune/extract_loss.py final_loss()returns the last match (epoch, step, loss) or Noneextract_losses()returns all matches as a list- The step number from the last match is the total training steps
- The canonical regex and helpers live in
- Record: run_name, final_loss, total_steps, epoch, step
Note: Training SLURM outputs are in the output directory, NOT the run directory.
If SLURM stdout missing:
- Log warning
- Record "N/A" for loss
- Continue with other runs
4. Extract Evaluation Accuracy
For each COMPLETED evaluation:
- Find .eval files:
{run_dir}/eval/logs/*.eval - For each .eval file, run:
python -m cruijff_kit.tools.inspect.parse_eval_log {path} - Parse JSON output for accuracy
- Map to epoch using SLURM job names (see below)
- For binary tasks, also run
summary_binary.pyto get balanced accuracy and F1 - Record: run_name, task, epoch, accuracy, balanced_accuracy, f1, samples
Script output format:
{
"status": "success",
"task": "capitalization",
"accuracy": 0.85,
"samples": 100,
"scorer": "exact_match",
"model": "..."
}
Mapping Epochs via SLURM Job Names
The .eval files don't currently store epoch information directly. To reliably map each evaluation to its epoch:
- Find SLURM output files in the eval directory:
{run_dir}/eval/slurm-*.out - Extract job IDs from filenames (e.g.,
slurm-2773062.out→ job ID 2773062) - Query job names via sacct:
sacct -j {job_ids} --format=JobID,JobName%50 - Parse epoch from job name - scaffold-inspect names jobs like
eval-{task}-{run}-ep{N}:eval-general_eval-lowlr-ep0→ epoch 0eval-general_eval-lowlr-ep9→ epoch 9
- Extract accuracy from SLURM output:
grep -oP 'match/accuracy: \K[0-9.]+' slurm-{jobid}.out
Example workflow:
# Get job names for all eval jobs
sacct -j 2773062,2773063,2773065 --format=JobID,JobName%50
# Output shows epoch in job name:
# 2773062 eval-general_eval-lowlr-ep0
# 2773063 eval-general_eval-lowlr-ep1
# 2773065 eval-general_eval-lowlr-ep2
This approach is reliable because:
- Job names are set by scaffold-inspect and include epoch info
- Works regardless of submission order or timing
- Survives job failures and resubmissions
If extraction fails:
- Script returns
{"status": "error", "message": "..."} - Log the error
- Record "ERROR" for accuracy
- Continue with other evaluations
Computing Balanced Accuracy and F1 (Binary Classification)
For binary classification tasks (0/1 targets), use summary_binary.py to compute additional metrics:
python -m cruijff_kit.tools.inspect.summary_binary {path_to_eval_file} --json
JSON output format:
{
"status": "success",
"path": "/path/to/file.eval",
"samples": 100,
"accuracy": 0.85,
"balanced_accuracy": 0.83,
"f1": 0.82,
"precision_1": 0.80,
"recall_1": 0.84,
"recall_0": 0.82,
"confusion_matrix": {"tp": 42, "tn": 43, "fp": 7, "fn": 8, "other": 0}
}
Why these metrics matter for imbalanced data:
- Balanced Accuracy = (Recall_0 + Recall_1) / 2 — not inflated by majority class
- F1 Score = harmonic mean of precision and recall — penalizes class imbalance
Note: For non-binary tasks, only accuracy is reported (Bal. Acc and F1 shown as "-").
5. Generate summary.md
Create {experiment_dir}/summary.md with the following structure:
# Experiment Summary
**Experiment:** `{experiment_name}` | **Generated:** {timestamp} | **Status:** {X}/{Y} complete
## Run Status
| Run | Type | Fine-tuning | Evaluation |
|-----|------|-------------|------------|
| rank4_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
| rank8_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
| base_model | Control | N/A | COMPLETED |
## Training Results
| Run | Final Loss | Total Steps | Epochs | Duration |
|-----|------------|-------------|--------|----------|
| rank4_lr1e-5 | 0.234 | 250 | 2 | 8m 15s |
| rank8_lr1e-5 | 0.198 | 250 | 2 | 9m 02s |
**Notes:**
- Base model runs have no training loss (control)
- Duration from SLURM elapsed time (if available)
## Evaluation Results
| Run | Task | Epoch | Accuracy | Bal. Acc | F1 | Samples |
|-----|------|-------|----------|----------|------|---------|
| rank4_lr1e-5 | capitalization | 0 | 0.85 | 0.83 | 0.82 | 100 |
| rank4_lr1e-5 | capitalization | 1 | 0.88 | 0.86 | 0.85 | 100 |
| rank8_lr1e-5 | capitalization | 0 | 0.82 | 0.80 | 0.78 | 100 |
| rank8_lr1e-5 | capitalization | 1 | 0.91 | 0.89 | 0.88 | 100 |
| base_model | capitalization | - | 0.45 | 0.50 | 0.31 | 100 |
**Best performing:** rank8_lr1e-5 (epoch 1) with 89% balanced accuracy
## Datasets Used
Record the literal file that backed each training run and each evaluation. For generated-text experiments these are the `{condition}_{split}_{hash8}.json` files scaffold resolved from `data.data_generation`; for standard experiments this is `data.training.path` and any per-task `dataset`/`eval_condition`.
Read from (in order of preference):
1. `{experiment_dir}/{run_name}/setup_finetune.yaml` → `input_dir_base` + `dataset_label` + `dataset_ext` = training dataset path for that run. This is authoritative — it's what torchtune actually loaded.
2. `{experiment_dir}/{run_name}/eval/eval_config.yaml` (or the inspect `.slurm` script's `DATA_PATH`) → test dataset path for that evaluation.
3. `logs/scaffold-torchtune.log` and `logs/scaffold-inspect.log` — include `resolve_dataset_path` entries recording the exact path scaffold chose.
| Run | Training dataset | Eval dataset(s) |
|-----|------------------|-----------------|
| rank4_lr1e-5 | `.../ck-data/generated/dict_subset_train_f18f10eb.json` | `.../dict_subset_test_f18f10eb.json` |
| rank8_lr1e-5 | `.../ck-data/generated/dict_full_train_d064ec15.json` | `.../dict_full_test_d064ec15.json` |
If the dataset file has a `.meta.json` sidecar (generated by `convert-tabular-to-text`), the sidecar contains the full canonical `data_generation` config that produced it — cite its `config_hash_short` in the table so the entire generation provenance is captured.
## Incomplete Runs
| Run | Stage | Status | Notes |
|-----|-------|--------|-------|
| rank16_lr1e-5 | Fine-tuning | FAILED | Check slurm-12345.out |
## Next Steps
1. View detailed evaluation results: `inspect view --port=$(get_free_port)`
2. Export raw data: `inspect log export {run_dir}/eval/logs/*.eval --format csv`
3. Full analysis: `analyze-experiment`
---
*Generated by summarize-experiment skill*
6. Create Log
Document the process in {experiment_dir}/logs/summarize-experiment.log.
See logging.md for action types and format.
Error Handling
If SLURM stdout missing
- Log warning with action type
EXTRACT_LOSS - Record "N/A" for loss in summary
- Continue with other runs
If .eval file cannot be parsed
- Log error with file path
- Record "ERROR" for accuracy in summary
- Continue with other evaluations
If all runs failed
- Generate summary noting all failures
- Include failure states in "Incomplete Runs" section
- Suggest troubleshooting steps
If partial results
- Generate summary with available data
- Clearly indicate which runs are missing in "Incomplete Runs" section
- Still identify best performing run from available data
Idempotency
Running summarize-experiment multiple times overwrites summary.md. This is intentional:
- Allows re-running after fixing failed runs
- Summary always reflects current state
Output Files
{experiment_dir}/
├── summary.md # Human-readable summary (new)
└── logs/
└── summarize-experiment.log # Process log (new)
Relationship to Other Skills
- After: run-experiment (or manual execution)
- Before: analyze-experiment
- Optional hook: run-experiment can invoke this at completion