Format Support¶

Kreuzberg supports 75+ file formats across major categories, providing comprehensive document intelligence capabilities through native Rust extractors.

Overview¶

Kreuzberg v4 uses a high-performance Rust core with two extraction methods:

Native Rust Extractors: Fast, memory-efficient extractors for all supported formats

Note: LibreOffice was a required system dependency for legacy .doc/.ppt extraction in Kreuzberg < 4.3. Since 4.3, these formats are extracted natively without any external tools.

All formats support async/await and batch processing. Image formats and PDFs support optional OCR when configured.

Format Support Matrix¶

Office Documents¶

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
PDF	`.pdf`	`application/pdf`	Native Rust (pdfium-render)	Yes	Metadata extraction, image extraction, text layer detection
Excel	`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xlam`, `.xla`, `.ods`	Various Excel MIME types	Native Rust (calamine)	No	Multi-sheet support, formula preservation
PowerPoint	`.pptx`, `.pptm`, `.ppsx`	`application/vnd.openxmlformats-officedocument.presentationml.presentation`	Native Rust (roxmltree)	Yes (for embedded images)	Slide extraction, image OCR, table detection
Word (Modern)	`.docx`	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Native Rust	No	Preserves formatting, extracts metadata
Word (Legacy)	`.doc`	`application/msword`	Native OLE/CFB	Yes	Direct binary parsing
PowerPoint (Legacy)	`.ppt`	`application/vnd.ms-powerpoint`	Native OLE/CFB	Yes	Direct binary parsing
OpenDocument Text	`.odt`	`application/vnd.oasis.opendocument.text`	Native Rust	No	Full OpenDocument support
OpenDocument Spreadsheet	`.ods`	`application/vnd.oasis.opendocument.spreadsheet`	Native Rust (calamine)	No	Multi-sheet support

Text & Markup¶

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
Plain Text	`.txt`	`text/plain`	Native Rust (streaming)	No	Line/word/character counting, memory-efficient streaming
Markdown	`.md`, `.markdown`	`text/markdown`, `text/x-markdown`	Native Rust (streaming)	No	Header extraction, link detection, code block detection
HTML	`.html`, `.htm`	`text/html`, `application/xhtml+xml`	Native Rust (html-to-markdown-rs)	No	Converts to Markdown, metadata extraction
XML	`.xml`	`application/xml`, `text/xml`	Native Rust (quick-xml streaming)	No	Element counting, unique element tracking
SVG	`.svg`	`image/svg+xml`	Native Rust (XML parser)	No	Treated as XML document
reStructuredText	`.rst`	`text/x-rst`	Native (rst-parser)	No	Full reST syntax support
Org Mode	`.org`	`text/x-org`	Native (org)	No	Emacs Org mode support
Rich Text Format	`.rtf`	`application/rtf`, `text/rtf`	Native (rtf-parser)	No	RTF 1.x support
Djot	`.djot`	`text/x-djot`	Native Rust (jotdown)	No	Smart punctuation, tables, code blocks, YAML frontmatter, footnotes, math blocks
MDX	`.mdx`	`text/mdx`	Native Rust (pulldown-cmark)	No	JSX-in-Markdown, component-based documents

Structured Data¶

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
JSON	`.json`	`application/json`, `text/json`	Native Rust (serde_json)	No	Field counting, nested structure extraction
YAML	`.yaml`, `.yml`	`application/x-yaml`, `text/yaml`, `text/x-yaml`	Native Rust (serde_yaml)	No	Multi-document support, field counting
TOML	`.toml`	`application/toml`, `text/toml`	Native Rust (toml crate)	No	Configuration file support
CSV	`.csv`	`text/csv`	Native Rust	No	Tabular data extraction
TSV	`.tsv`	`text/tab-separated-values`	Native Rust	No	Tab-separated data extraction

Email¶

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
EML	`.eml`	`message/rfc822`	Native Rust (mail-parser)	No	Header extraction, attachment listing, body text, UTF-16 support
MSG	`.msg`	`application/vnd.ms-outlook`	Native Rust (mail-parser)	No	Outlook message support, metadata extraction

Images¶

All image formats support OCR when configured with ocr parameter in ExtractionConfig.

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
PNG	`.png`	`image/png`	Native Rust (image-rs)	Yes	EXIF metadata extraction
JPEG	`.jpg`, `.jpeg`	`image/jpeg`, `image/jpg`	Native Rust (image-rs)	Yes	EXIF metadata extraction
WebP	`.webp`	`image/webp`	Native Rust (image-rs)	Yes	Modern format support
BMP	`.bmp`	`image/bmp`, `image/x-bmp`, `image/x-ms-bmp`	Native Rust (image-rs)	Yes	Uncompressed format
TIFF	`.tiff`, `.tif`	`image/tiff`, `image/x-tiff`	Native Rust (image-rs)	Yes	Multi-page support
GIF	`.gif`	`image/gif`	Native Rust (image-rs)	Yes	Animation frame extraction
JPEG 2000	`.jp2`, `.jpx`, `.jpm`, `.mj2`	`image/jp2`, `image/jpx`, `image/jpm`, `image/mj2`	Native Rust (hayro-jpeg2000)	Yes	OCR: Pure Rust, memory-safe decoder for JP2 container and J2K codestream formats, table detection, format-specific metadata
JBIG2	`.jbig2`, `.jb2`	`image/x-jbig2`	Native Rust (hayro-jbig2)	Yes	OCR: Pure Rust bi-level decoder, commonly found in scanned PDFs
PNM Family	`.pnm`, `.pbm`, `.pgm`, `.ppm`	`image/x-portable-anymap`, etc.	Native Rust (image-rs)	Yes	NetPBM formats

Archives¶

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
ZIP	`.zip`	`application/zip`, `application/x-zip-compressed`	Native Rust (zip crate)	No	File listing, text content extraction
TAR	`.tar`, `.tgz`	`application/x-tar`, `application/tar`, `application/x-gtar`, `application/x-ustar`	Native Rust (tar crate)	No	Unix archive support, gzip compression detection
7-Zip	`.7z`	`application/x-7z-compressed`	Native Rust (sevenz-rust)	No	High compression format support
Gzip	`.gz`	`application/gzip`, `application/x-gzip`	Native Rust (flate2)	No	Gzip decompression with text extraction

Academic & Publishing (Native)¶

Format	Extensions	MIME Type	Extraction Method	OCR Support	Special Features
LaTeX	`.tex`, `.latex`	`application/x-latex`, `text/x-tex`	Native (manual parser)	No	Full LaTeX document support
EPUB	`.epub`	`application/epub+zip`	Native (zip + roxmltree + html-to-markdown-rs)	No	E-book format, metadata extraction
BibTeX	`.bib`	`application/x-bibtex`, `application/x-biblatex`	Native (biblatex)	No	Bibliography database support
Typst	`.typst`, `.typ`	`application/x-typst`	Native (typst-syntax)	No	Modern typesetting format
Jupyter Notebook	`.ipynb`	`application/x-ipynb+json`	Native (JSON parsing)	No	Code cells, markdown cells, output extraction
FictionBook	`.fb2`	`application/x-fictionbook+xml`	Native (fb2)	No	XML-based e-book format
DocBook	`.docbook`, `.dbk`	`application/docbook+xml`	Native (roxmltree)	No	Technical documentation format
JATS	`.jats`	`application/x-jats+xml`	Native (roxmltree)	No	Journal article XML format
OPML	`.opml`	`application/x-opml+xml`	Native (roxmltree)	No	Outline format
RIS	`.ris`	`application/x-research-info-systems`	Native (biblib)	No	Structured citation parsing with title, authors, DOI, and abstract extraction
EndNote XML	`.enw`	`application/x-endnote+xml`	Native (biblib)	No	Structured citation parsing with title, authors, DOI, and keywords extraction
PubMed/MEDLINE	`.nbib`	`application/x-pubmed`	Native (biblib)	No	Structured citation parsing with author affiliations, MeSH terms, and abstract
CSL JSON	`.csl`	`application/csl+json`	Native (JSON parser)	No	Citation Style Language JSON

Markdown Variants (Native)¶

Format	MIME Type	Extraction Method	Special Features
CommonMark	`text/x-commonmark`	Native (pulldown-cmark)	Standard Markdown spec
GitHub Flavored Markdown	`text/x-gfm`	Native (pulldown-cmark)	GFM extensions (tables, strikethrough, etc.)
MultiMarkdown	`text/x-multimarkdown`	Native (pulldown-cmark)	MMD extensions
Markdown Extra	`text/x-markdown-extra`	Native (pulldown-cmark)	PHP Markdown Extra extensions
MDX	`text/mdx`	Native (pulldown-cmark)	JSX-in-Markdown format
Djot	`text/x-djot`	Native (jotdown)	Djot markup format with extended features

Other Formats¶

Format	MIME Type	Extraction Method	Special Features
Man Pages	`text/x-mdoc`	Native (mdoc-parser)	Unix manual page format
Troff	`text/troff`	Native (troff-parser)	Unix document format
POD	`text/x-pod`	Native (pod-parser)	Perl documentation format
DokuWiki	`text/x-dokuwiki`	Native (dokuwiki-parser)	Wiki markup format

Architecture Diagram¶

graph TD
    A[File Input] --> B{MIME Detection}
    B --> C{Extraction Method}

    C -->|Native Format| D[Rust Core Extractors]

    D --> G[PDF Extractor]
    D --> H[Excel Extractor]
    D --> I[Image Extractor]
    D --> J[XML/Text/HTML Extractors]
    D --> K[Email Extractor]
    D --> L[Archive Extractor]
    D --> M[OLE/CFB Parser for .doc/.ppt]

    G --> P{OCR Needed?}
    I --> P
    P -->|Yes| Q[Tesseract OCR]
    P -->|No| R[Text Output]
    Q --> R

    H --> R
    J --> R
    K --> R
    L --> R
    M --> R

    R --> S[Post-Processing Pipeline]
    S --> T[Final Result]

Feature Flags¶

Kreuzberg uses Cargo feature flags to enable optional format support:

Feature Flag	Formats Enabled	Default
`pdf`	PDF documents	No
`excel`	Excel spreadsheets (all variants)	No
`office`	PowerPoint and Office formats	No
`ocr`	OCR for images and PDFs	No
`email`	EML, MSG email formats	No
`html`	HTML to Markdown conversion	No
`xml`	XML document parsing	No
`archives`	ZIP, TAR, 7z archive support	No
`markdown`	Markdown documents	No
`djot`	Djot documents	No
`mdx`	MDX documents	No

Note: No features are enabled by default (default = []). You must explicitly enable the features you need.

To enable specific features:

Cargo.toml

[dependencies]
# Enable only PDF and Excel format support
kreuzberg = { version = "4.0", features = ["pdf", "excel"] }

To enable all features with --all-features:

Terminal

# Build with all format extraction features enabled
cargo build --all-features

Or use the convenience bundles:

Cargo.toml

[dependencies]
# All format extraction features (no server components)
kreuzberg = { version = "4.0", features = ["full"] }

# Server features (API, MCP) with common format support
kreuzberg = { version = "4.0", features = ["server"] }

# CLI features with commonly used formats
kreuzberg = { version = "4.0", features = ["cli"] }

System Dependencies¶

Some formats require external system tools:

Tesseract OCR (Optional)¶

Required for OCR on images and PDFs:

Terminal

# Install Tesseract OCR on macOS
brew install tesseract

# Install Tesseract OCR on Ubuntu/Debian
sudo apt-get install tesseract-ocr

# Install Tesseract OCR on RHEL/CentOS/Fedora
sudo dnf install tesseract

# Install Tesseract OCR on Windows (using Scoop)
scoop install tesseract

Docker Note: All system dependencies are pre-installed in official Kreuzberg Docker images.

Format Detection¶

Kreuzberg automatically detects file formats using:

File Extension Mapping: 75+ formats mapped to MIME types
mime_guess Crate: Fallback for unknown extensions
Manual Override: Explicit MIME type can be provided

Example with manual override:

C#GoJavaPythonRubyRustTypeScript

format_detection.cs

using Kreuzberg;

// Automatic format detection from file extension
var result = KreuzbergClient.ExtractFileSync("document.pdf");

// Manual MIME type override for files without extensions
var result2 = KreuzbergClient.ExtractFileAsBytes(rawBytes, "application/pdf", null);

format_detection.go

import "kreuzberg"

// Automatic format detection from file extension
result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
if err != nil {
    log.Fatal(err)
}

// Manual MIME type override for ambiguous files
config := &kreuzberg.ExtractionConfig{}
mimeBytes, _ := ioutil.ReadFile("document.dat")
result2, err := kreuzberg.ExtractBytesSync(mimeBytes, "application/pdf", config)

FormatDetection.java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;

// Automatic format detection from file extension
ExtractionResult result = Kreuzberg.extractFile("document.pdf");

// Manual MIME type override using detectMimeType for byte arrays
String mimeType = Kreuzberg.detectMimeType(new byte[]{/* PDF header bytes */});
ExtractionResult result2 = Kreuzberg.extractFileAsBytes(rawBytes, mimeType, null);

format_detection.py

from kreuzberg import extract_file

# Automatic format detection from file extension
result = extract_file("document.pdf")

# Manual MIME type override for unknown extensions
result = extract_file("document.dat", mime_type="application/pdf")

format_detection.rb

require 'kreuzberg'

# Automatic format detection from file extension
result = Kreuzberg.extract_file_sync('document.pdf')

# Manual MIME type override for files with ambiguous extensions
config = Kreuzberg::Config::Extraction.new
result = Kreuzberg.extract_file_sync('document.dat', mime_type: 'application/pdf', config: config)

format_detection.rs

use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();

    // Automatic format detection from file extension
    let result = extract_file("document.pdf", None, &config).await?;

    // Manual MIME type override for extensionless files
    let result = extract_file("document.dat", Some("application/pdf"), &config).await?;

    Ok(())
}

format_detection.ts

import { extractFile } from '@kreuzberg/node';

// Automatic format detection from file extension
const result = await extractFile('document.pdf');

// Manual MIME type override for files with no extension
const result2 = await extractFile('document.dat', { mimeType: 'application/pdf' });

OCR Support¶

OCR is available for:

All image formats (PNG, JPEG, WebP, BMP, TIFF, GIF, etc.)
PDF documents (with automatic fallback for scanned PDFs)
Embedded images in PowerPoint presentations

Configuration¶

ocr_configuration.py

from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig

# Configure OCR with multi-language support and custom Tesseract settings
config = ExtractionConfig(
    ocr=OcrConfig(
        tesseract_config=TesseractConfig(
            lang="eng+deu",  # Multiple languages: English and German
            psm=3,           # Page segmentation mode: Auto
            oem=1            # OCR Engine mode: LSTM neural net
        )
    ),
    force_ocr=False  # Only use OCR when native text extraction is insufficient
)

result = extract_file("scanned_document.pdf", config=config)

Automatic OCR Decision¶

For PDFs, Kreuzberg automatically decides whether OCR is needed by analyzing native text:

No OCR: Document has substantial, meaningful text (>64 non-whitespace chars, >32 chars/page average)
OCR Fallback: Document appears scanned (mostly punctuation, very low alphanumeric ratio)

Override with force_ocr=True to always use OCR regardless of native text quality.

Performance Characteristics¶

Native Rust Extractors¶

PDF: Significantly faster than Python libraries due to native Rust implementation
Excel: Streaming parser, handles multi-GB files
XML: Streaming parser, memory-efficient for large documents
Text/Markdown: Streaming parser with lazy regex compilation
Archives: Efficient extraction without full decompression

OLE/CFB Extractors¶

Direct binary parsing of OLE2/CFB compound files
Used for legacy formats (.doc, .ppt)
No external tool dependencies, native Rust implementation

Batch Processing¶

All formats support concurrent batch processing:

batch_processing.py

from kreuzberg import batch_extract_file, ExtractionConfig

# Process multiple files concurrently for better throughput
paths = ["file1.pdf", "file2.docx", "file3.xlsx"]
config = ExtractionConfig(max_concurrent_extractions=8)

results = batch_extract_file(paths, config=config)

Format Limitations¶

Known Limitations¶

Password-Protected PDFs: Requires crypto extra (pip install kreuzberg[crypto])
Legacy Excel (.xls): Formula evaluation not supported (values only)
Encrypted Office Documents: Password protection not supported
Multi-page TIFF: OCR processes first page only (configurable)
Animated GIF: Extracts first frame only

Unsupported Formats¶

Video formats (MP4, AVI, MOV, etc.)
Audio formats (MP3, WAV, FLAC, etc.)
CAD formats (DWG, DXF, etc.)
Database files (MDB, ACCDB, etc.)
Compressed Office formats without proper headers

Adding New Formats¶

Kreuzberg's plugin system allows adding custom format extractors: