format-specific-extraction
// ______________________________________________________________________
$ git log --oneline --stat
stars:6,507
forks:1.2k
updated:March 4, 2026
SKILL.mdreadonly
priority: high
Format-Specific Extraction Workflows
Office XML (DOCX/PPTX/ODT)
ZIP archive → Security validation → XML parsing → Text + tables + metadata
ZipBombValidator::new(limits).validate(&mut archive)?- Extract XML files from archive (
word/document.xml,ppt/slides/*.xml,content.xml) - Parse with
quick-xml::Reader(streaming) +DepthValidator+StringGrowthValidator - Extract metadata via
crate::extraction::office_metadata::extract_metadata() - See:
extractors/docx.rs,extractors/pptx.rs,extractors/odt.rs
Bytes → pdfium-render → Per-page text + OCR fallback → Tables → Metadata
pdfium.create_document_from_bytes(content, None)?- Check if needs OCR:
config.force_ocr || !has_searchable_text() - Extract text per page, tables if
config.pagesenabled - Feature-gated:
#[cfg(feature = "pdf")] - See:
extractors/pdf/mod.rs
Archives (ZIP/TAR/7z/GZIP)
Validate → Extract metadata → Extract plaintext files only
ZipBombValidatorBEFORE any extraction- Extract metadata (file list, sizes)
- Extract text content from plaintext files
- Use
build_archive_result()helper - See:
extractors/archive.rs,extraction/archive/*.rs
Structured Text (JSON/YAML/TOML/XML)
Detect format from MIME → Parse → Pretty-print → Metadata
Single StructuredExtractor handles multiple MIME types. Parse with format-specific library, pretty-print to text.
See: extractors/structured.rs
Email (EML/MSG)
Parse headers → Extract body (text/html) → Process attachments
See: extraction/email.rs, extractors/email.rs
Common Helpers
| Helper | Location | Purpose |
|---|---|---|
office_metadata::extract_metadata() | extraction/office.rs | Office XML metadata |
cells_to_markdown() | extraction/mod.rs | Convert cell grid to GFM table |
build_archive_result() | extraction/archive/mod.rs | Standard archive result |
Adding a New Format
- Add MIME type to
EXT_TO_MIMEincore/mime.rs - Create extractor implementing
DocumentExtractortrait - Set
supported_mime_types()andpriority()(default: 50) - Register in
extractors/mod.rs→register_default_extractors() - Feature-gate if optional:
#[cfg(feature = "my-format")] - Apply security validators for user content
- Add tests with fixture files