Назад към всички

summarize-experiment

// Create a lightweight summary of experiment results from a completed (fine-tuned and evaluated) experiment. Use after run-experiment to capture key metrics from the experiment in textual form.

$ git log --oneline --stat
stars:9
forks:2
updated:February 27, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
namesummarize-experiment
descriptionCreate a lightweight summary of experiment results from a completed (fine-tuned and evaluated) experiment. Use after run-experiment to capture key metrics from the experiment in textual form.

Summarize Experiment

Generate a summary.md file capturing key metrics from a completed experiment. Think R's summary() for experiment results.

Your Task

Create a lightweight summary of experiment results:

  1. Parse run status from experiment_summary.yaml
  2. Extract final training loss from SLURM stdout
  3. Extract accuracy from inspect-ai .eval files
  4. Generate summary.md in experiment directory
  5. Log the process in logs/summarize-experiment.log

Prerequisites

  • experiment_summary.yaml exists
  • At least some runs have completed (partial results acceptable)
  • run-experiment has been executed (or manual SLURM jobs run)
  • Conda environment activated - The parse_eval_log.py script requires inspect-ai. Activate the conda environment from claude.local.md before running extraction commands.

Workflow

1. Locate Experiment

Find the experiment directory:

  • If in an experiment directory (contains experiment_summary.yaml): use current directory
  • Otherwise: ask user for path

2. Parse Run Status

Read experiment_summary.yaml to identify runs:

From runs: section:

  • name: Run identifier
  • type: "fine-tuned" or "control"
  • model: Model name
  • parameters: Dict of hyperparameters (empty for control runs)

From evaluation.matrix: section:

  • run: Run name
  • tasks: List of evaluation task names
  • epochs: List of epochs to evaluate (null for control runs)

Determine status by checking filesystem:

  • Fine-tuning: Check for {output_base}/{run_name}/artifacts/ and SLURM outputs
  • Evaluation: Check for {run_dir}/eval/logs/*.eval files

3. Extract Training Loss

For each COMPLETED fine-tuning run:

  1. Find SLURM stdout in the output directory:
    • Parse experiment_summary.yaml "Output" section for output_dir_base
    • Look in: {output_dir_base}/{run_name}/artifacts/slurm-*.out
    • If multiple files, use most recent by modification time
  2. Extract final loss using cruijff_kit.tools.torchtune.extract_loss:
    from cruijff_kit.tools.torchtune.extract_loss import final_loss
    result = final_loss(slurm_text)  # returns (epoch, step, loss) or None
    
    • The canonical regex and helpers live in src/tools/torchtune/extract_loss.py
    • final_loss() returns the last match (epoch, step, loss) or None
    • extract_losses() returns all matches as a list
    • The step number from the last match is the total training steps
  3. Record: run_name, final_loss, total_steps, epoch, step

Note: Training SLURM outputs are in the output directory, NOT the run directory.

If SLURM stdout missing:

  • Log warning
  • Record "N/A" for loss
  • Continue with other runs

4. Extract Evaluation Accuracy

For each COMPLETED evaluation:

  1. Find .eval files: {run_dir}/eval/logs/*.eval
  2. For each .eval file, run:
    python -m cruijff_kit.tools.inspect.parse_eval_log {path}
    
  3. Parse JSON output for accuracy
  4. Map to epoch using SLURM job names (see below)
  5. For binary tasks, also run summary_binary.py to get balanced accuracy and F1
  6. Record: run_name, task, epoch, accuracy, balanced_accuracy, f1, samples

Script output format:

{
  "status": "success",
  "task": "capitalization",
  "accuracy": 0.85,
  "samples": 100,
  "scorer": "exact_match",
  "model": "..."
}

Mapping Epochs via SLURM Job Names

The .eval files don't currently store epoch information directly. To reliably map each evaluation to its epoch:

  1. Find SLURM output files in the eval directory: {run_dir}/eval/slurm-*.out
  2. Extract job IDs from filenames (e.g., slurm-2773062.out → job ID 2773062)
  3. Query job names via sacct:
    sacct -j {job_ids} --format=JobID,JobName%50
    
  4. Parse epoch from job name - scaffold-inspect names jobs like eval-{task}-{run}-ep{N}:
    • eval-general_eval-lowlr-ep0 → epoch 0
    • eval-general_eval-lowlr-ep9 → epoch 9
  5. Extract accuracy from SLURM output:
    grep -oP 'match/accuracy: \K[0-9.]+' slurm-{jobid}.out
    

Example workflow:

# Get job names for all eval jobs
sacct -j 2773062,2773063,2773065 --format=JobID,JobName%50

# Output shows epoch in job name:
# 2773062  eval-general_eval-lowlr-ep0
# 2773063  eval-general_eval-lowlr-ep1
# 2773065  eval-general_eval-lowlr-ep2

This approach is reliable because:

  • Job names are set by scaffold-inspect and include epoch info
  • Works regardless of submission order or timing
  • Survives job failures and resubmissions

If extraction fails:

  • Script returns {"status": "error", "message": "..."}
  • Log the error
  • Record "ERROR" for accuracy
  • Continue with other evaluations

Computing Balanced Accuracy and F1 (Binary Classification)

For binary classification tasks (0/1 targets), use summary_binary.py to compute additional metrics:

python -m cruijff_kit.tools.inspect.summary_binary {path_to_eval_file} --json

JSON output format:

{
  "status": "success",
  "path": "/path/to/file.eval",
  "samples": 100,
  "accuracy": 0.85,
  "balanced_accuracy": 0.83,
  "f1": 0.82,
  "precision_1": 0.80,
  "recall_1": 0.84,
  "recall_0": 0.82,
  "confusion_matrix": {"tp": 42, "tn": 43, "fp": 7, "fn": 8, "other": 0}
}

Why these metrics matter for imbalanced data:

  • Balanced Accuracy = (Recall_0 + Recall_1) / 2 — not inflated by majority class
  • F1 Score = harmonic mean of precision and recall — penalizes class imbalance

Note: For non-binary tasks, only accuracy is reported (Bal. Acc and F1 shown as "-").

5. Generate summary.md

Create {experiment_dir}/summary.md with the following structure:

# Experiment Summary

**Experiment:** `{experiment_name}` | **Generated:** {timestamp} | **Status:** {X}/{Y} complete

## Run Status

| Run | Type | Fine-tuning | Evaluation |
|-----|------|-------------|------------|
| rank4_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
| rank8_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
| base_model | Control | N/A | COMPLETED |

## Training Results

| Run | Final Loss | Total Steps | Epochs | Duration |
|-----|------------|-------------|--------|----------|
| rank4_lr1e-5 | 0.234 | 250 | 2 | 8m 15s |
| rank8_lr1e-5 | 0.198 | 250 | 2 | 9m 02s |

**Notes:**
- Base model runs have no training loss (control)
- Duration from SLURM elapsed time (if available)

## Evaluation Results

| Run | Task | Epoch | Accuracy | Bal. Acc | F1 | Samples |
|-----|------|-------|----------|----------|------|---------|
| rank4_lr1e-5 | capitalization | 0 | 0.85 | 0.83 | 0.82 | 100 |
| rank4_lr1e-5 | capitalization | 1 | 0.88 | 0.86 | 0.85 | 100 |
| rank8_lr1e-5 | capitalization | 0 | 0.82 | 0.80 | 0.78 | 100 |
| rank8_lr1e-5 | capitalization | 1 | 0.91 | 0.89 | 0.88 | 100 |
| base_model | capitalization | - | 0.45 | 0.50 | 0.31 | 100 |

**Best performing:** rank8_lr1e-5 (epoch 1) with 89% balanced accuracy

## Datasets Used

Record the literal file that backed each training run and each evaluation. For generated-text experiments these are the `{condition}_{split}_{hash8}.json` files scaffold resolved from `data.data_generation`; for standard experiments this is `data.training.path` and any per-task `dataset`/`eval_condition`.

Read from (in order of preference):

1. `{experiment_dir}/{run_name}/setup_finetune.yaml` → `input_dir_base` + `dataset_label` + `dataset_ext` = training dataset path for that run. This is authoritative — it's what torchtune actually loaded.
2. `{experiment_dir}/{run_name}/eval/eval_config.yaml` (or the inspect `.slurm` script's `DATA_PATH`) → test dataset path for that evaluation.
3. `logs/scaffold-torchtune.log` and `logs/scaffold-inspect.log` — include `resolve_dataset_path` entries recording the exact path scaffold chose.

| Run | Training dataset | Eval dataset(s) |
|-----|------------------|-----------------|
| rank4_lr1e-5 | `.../ck-data/generated/dict_subset_train_f18f10eb.json` | `.../dict_subset_test_f18f10eb.json` |
| rank8_lr1e-5 | `.../ck-data/generated/dict_full_train_d064ec15.json`   | `.../dict_full_test_d064ec15.json`   |

If the dataset file has a `.meta.json` sidecar (generated by `convert-tabular-to-text`), the sidecar contains the full canonical `data_generation` config that produced it — cite its `config_hash_short` in the table so the entire generation provenance is captured.

## Incomplete Runs

| Run | Stage | Status | Notes |
|-----|-------|--------|-------|
| rank16_lr1e-5 | Fine-tuning | FAILED | Check slurm-12345.out |

## Next Steps

1. View detailed evaluation results: `inspect view --port=$(get_free_port)`
2. Export raw data: `inspect log export {run_dir}/eval/logs/*.eval --format csv`
3. Full analysis: `analyze-experiment`

---
*Generated by summarize-experiment skill*

6. Create Log

Document the process in {experiment_dir}/logs/summarize-experiment.log.

See logging.md for action types and format.

Error Handling

If SLURM stdout missing

  • Log warning with action type EXTRACT_LOSS
  • Record "N/A" for loss in summary
  • Continue with other runs

If .eval file cannot be parsed

  • Log error with file path
  • Record "ERROR" for accuracy in summary
  • Continue with other evaluations

If all runs failed

  • Generate summary noting all failures
  • Include failure states in "Incomplete Runs" section
  • Suggest troubleshooting steps

If partial results

  • Generate summary with available data
  • Clearly indicate which runs are missing in "Incomplete Runs" section
  • Still identify best performing run from available data

Idempotency

Running summarize-experiment multiple times overwrites summary.md. This is intentional:

  • Allows re-running after fixing failed runs
  • Summary always reflects current state

Output Files

{experiment_dir}/
├── summary.md                    # Human-readable summary (new)
└── logs/
    └── summarize-experiment.log  # Process log (new)

Relationship to Other Skills

  • After: run-experiment (or manual execution)
  • Before: analyze-experiment
  • Optional hook: run-experiment can invoke this at completion