Назад към всички

Text

// Transform, format, and process text with patterns for writing, data cleaning, localization, citations, and copywriting.

$ git log --oneline --stat
stars:1,933
forks:367
updated:March 4, 2026
SKILL.mdreadonly
SKILL.md Frontmatter
nameText
descriptionTransform, format, and process text with patterns for writing, data cleaning, localization, citations, and copywriting.

Quick Reference

TaskLoad
Creative writing (voice, dialogue, POV)writing.md
Data processing (CSV, regex, encoding)data.md
Academic/citations (APA, MLA, Chicago)academic.md
Marketing copy (headlines, CTA, email)copy.md
Translation/localizationlocalization.md

Universal Text Rules

Encoding

  • Always verify encoding first: file -bi document.txt
  • Normalize line endings: tr -d '\r'
  • Remove BOM if present: sed -i '1s/^\xEF\xBB\xBF//'

Whitespace

  • Collapse multiple spaces: sed 's/[[:space:]]\+/ /g'
  • Trim leading/trailing: sed 's/^[[:space:]]*//;s/[[:space:]]*$//'

Common Traps

  • Smart quotes (" ") break parsers → normalize to "
  • Em/en dashes ( ) break ASCII → normalize to -
  • Zero-width chars invisible but break comparisons → strip them
  • String length ≠ byte length in UTF-8 ("café" = 4 chars, 5 bytes)

Format Detection

# Detect encoding
file -I document.txt

# Detect line endings
cat -A document.txt | head -1
# ^M at end = Windows (CRLF)
# No ^M = Unix (LF)

# Detect delimiter (CSV/TSV)
head -1 file | tr -cd ',;\t|' | wc -c

Quick Transformations

TaskCommand
Lowercasetr '[:upper:]' '[:lower:]'
Remove punctuationtr -d '[:punct:]'
Count wordswc -w
Count unique linessort -u | wc -l
Find duplicatessort | uniq -d
Extract emailsgrep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
Extract URLs`grep -oE 'https?://[^[:space:]<>"{}

Before Processing Checklist

  • Encoding verified (UTF-8?)
  • Line endings normalized
  • Delimiter identified (for structured text)
  • Target format/style defined
  • Edge cases considered (empty, Unicode, special chars)