Назад към всички

Fine-Tuning

// Fine-tune LLMs with data preparation, provider selection, cost estimation, evaluation, and compliance checks.

$ git log --oneline --stat
stars:1,933
forks:367
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
nameFine-Tuning
slugfine-tuning
descriptionFine-tune LLMs with data preparation, provider selection, cost estimation, evaluation, and compliance checks.

When to Use

User wants to fine-tune a language model, evaluate if fine-tuning is worth it, or debug training issues.

Quick Reference

TopicFile
Provider comparison & pricingproviders.md
Data preparation & validationdata-prep.md
Training configurationtraining.md
Evaluation & debuggingevaluation.md
Cost estimation & ROIcosts.md
Compliance & securitycompliance.md

Core Capabilities

  1. Decide fit — Analyze if fine-tuning beats prompting for the use case
  2. Prepare data — Convert raw data to JSONL, deduplicate, validate format
  3. Select provider — Compare OpenAI, Anthropic (Bedrock), Google, open source based on constraints
  4. Estimate costs — Calculate training cost, inference savings, break-even point
  5. Configure training — Set hyperparameters (learning rate, epochs, LoRA rank)
  6. Run evaluation — Compare fine-tuned vs base model on task-specific metrics
  7. Debug failures — Diagnose loss curves, overfitting, catastrophic forgetting
  8. Handle compliance — Scan for PII, configure on-premise training, generate audit logs

Decision Checklist

Before recommending fine-tuning, ask:

  • What's the failure mode with prompting? (format, style, knowledge, cost)
  • How many training examples available? (minimum 50-100)
  • Expected inference volume? (affects ROI calculation)
  • Privacy constraints? (determines provider options)
  • Budget for training + ongoing inference?

Fine-Tune vs Prompt Decision

SignalRecommendation
Format/style inconsistencyFine-tune ✓
Missing domain knowledgeRAG first, then fine-tune if needed
High inference volume (>100K/mo)Fine-tune for cost savings
Requirements change frequentlyStick with prompting
<50 quality examplesPrompting + few-shot

Critical Rules

  • Data quality > quantity — 100 great examples beat 1000 noisy ones
  • LoRA first — Never jump to full fine-tuning; LoRA is 10-100x cheaper
  • Hold out eval set — Always 80/10/10 split; never peek at test data
  • Same precision — Train and serve at identical precision (4-bit, 16-bit)
  • Baseline first — Run eval on base model before training to measure actual improvement
  • Expect iteration — First attempt rarely optimal; plan for 2-3 cycles

Common Pitfalls

MistakeFix
Training on inconsistent dataManual review of 100+ samples before training
Learning rate too highStart with 2e-4 for SFT, 5e-6 for RLHF
Expecting new knowledgeFine-tuning adjusts behavior, not knowledge — use RAG
No baseline comparisonAlways test base model on same eval set
Ignoring forgettingMix 20% general data to preserve capabilities