Kreuzberg vs Unstructured: Feature Comparison¶

A comprehensive comparison of Kreuzberg and Unstructured.io for document intelligence workloads.

Executive Summary¶

Aspect	Kreuzberg	Unstructured
Core Language	Rust	Python
Performance	Rust-based native speed	Python-based
Formats Supported	56+	~30
Language Bindings	10 (Python, TS, Ruby, PHP, Go, Java, C#, Elixir, Rust, WASM)	Python + API
Deployment	Self-hosted (CLI, API, library)	Cloud API + self-hosted
Pricing	Free & open-source	Free tier + paid plans
Best For	High-performance, polyglot stacks, self-hosted	Rapid prototyping, managed service

Feature Matrix¶

Document Processing¶

Feature	Kreuzberg	Unstructured	Notes
PDF Extraction	✅ Full support	✅ Full support	Kreuzberg has native hierarchy detection
PDF Hierarchy (h1-h6)	✅ Font-size clustering	✅ ML-based layout	Kreuzberg uses statistical clustering
OCR (Tesseract)	✅ Built-in	✅ Built-in	Both support Tesseract
Table Detection	✅ Native	✅ ML-based	Unstructured has better complex table support
Image Extraction	✅ Full support	✅ Full support	Both extract images with metadata
Bounding Boxes	✅ Native (PDF)	✅ Available	Kreuzberg preserves from source
Multi-Page Support	✅ Per-page content	✅ Page numbers	Kreuzberg has richer per-page metadata

Output Formats¶

Format	Kreuzberg	Unstructured
Unified Text	✅ Default	❌
Element-Based	✅ Optional	✅ Default
Per-Page JSON	✅ Native	⚠️ Via elements
Markdown	✅ Native (HTML→MD)	❌
Structured Data	✅ JSON/YAML/TOML	✅ JSON

Element Types¶

Element Type	Kreuzberg	Unstructured	Notes
Title	✅	✅	Kreuzberg adds hierarchy level metadata
NarrativeText	✅	✅	Both detect paragraphs
ListItem	✅	✅	Kreuzberg: bullets, numbered, lettered, indented
Table	✅	✅	Kreuzberg: tab-separated text
Image	✅	✅	Both include dimensions, format
PageBreak	✅	✅	Between multi-page content
Header	⚠️ → Title	✅	Kreuzberg maps to title
Footer	⚠️ → NarrativeText	✅	Kreuzberg treats as narrative
Address	❌	✅	Unstructured-specific
EmailAddress	❌	✅	Unstructured-specific
Formula	❌	✅	Unstructured-specific

Supported File Formats¶

Kreuzberg (56+ formats): - Documents: PDF, DOCX, DOC, ODT, RTF, TXT, Markdown, RST, LaTeX, Typst - Presentations: PPTX, PPT, ODP, Keynote - Spreadsheets: XLSX, XLS, ODS, CSV - Web: HTML, XML, EPUB, FictionBook - Code: Jupyter Notebooks, Source code (via tree-sitter) - Data: JSON, YAML, TOML, BibTeX, OPML, OrgMode - Images: PNG, JPEG, TIFF, WebP (via OCR) - Email: EML, MSG

Unstructured (~30 formats): - Documents: PDF, DOCX, DOC, ODT, RTF, TXT - Presentations: PPTX, PPT - Spreadsheets: XLSX, XLS, CSV - Web: HTML, XML, EPUB - Data: JSON, Markdown - Images: PNG, JPEG, TIFF (via OCR) - Email: EML, MSG

Winner: Kreuzberg (broader format coverage)

Metadata Richness¶

Kreuzberg Metadata (format-specific discriminated unions):

{
  "title": "Document Title",
  "authors": ["Author 1", "Author 2"],
  "created_at": "2024-01-15T10:30:00Z",
  "modified_at": "2024-01-20T14:45:00Z",
  "language": "en",
  "format": {
    "format_type": "pdf",
    "page_count": 10,
    "version": "1.7",
    "is_encrypted": false,
    "permissions": {
      "print": true,
      "modify": false
    }
  }
}

Unstructured Metadata:

{
  "filename": "document.pdf",
  "page_number": 1,
  "filetype": "application/pdf"
}

Winner: Kreuzberg (richer, format-specific metadata)

Chunking & Embeddings¶

Feature	Kreuzberg	Unstructured	Notes
Text Chunking	✅ Basic (fixed-size)	✅ Advanced (by_title)	Unstructured has smarter strategies
Chunk Overlap	✅ Configurable	✅ Configurable	Both support overlap
Embedding Generation	✅ Built-in (ONNX)	⚠️ External API	Kreuzberg: local ONNX models
Embedding Models	✅ fastembed presets	✅ OpenAI, Cohere, etc.	Kreuzberg: offline, Unstructured: API-based
Page Range Tracking	✅ Native	✅ Via metadata	Kreuzberg tracks first_page/last_page

Language Bindings & Integrations¶

Kreuzberg: - ✅ Python (PyO3) - ✅ TypeScript (NAPI-RS) - ✅ Ruby (Magnus) - ✅ PHP (ext-php-rs) - ✅ Go (cgo FFI) - ✅ Java (JNI FFI) - ✅ C# (P/Invoke FFI) - ✅ Elixir (Rustler NIFs) - ✅ Rust (native) - ✅ WASM (browser/Node/Deno/Workers)

Unstructured: - ✅ Python (native) - ✅ REST API (language-agnostic) - ⚠️ Other languages via API only

Winner: Kreuzberg (native bindings for 10 languages)

Deployment Options¶

Kreuzberg: - ✅ CLI (single binary) - ✅ Self-hosted API (Docker, native) - ✅ Library (embedded in applications) - ✅ WASM (browser-based processing) - ❌ Managed cloud service

Unstructured: - ✅ Managed API (cloud-hosted) - ✅ Self-hosted API (Docker) - ✅ Python library - ❌ CLI - ❌ Browser-based

Cost Analysis¶

Kreuzberg: - License: Apache 2.0 (free, open-source) - Infrastructure: Self-hosted only (compute costs) - Total Cost: Infrastructure + maintenance

Unstructured: - License: Apache 2.0 (free, open-source) - Managed API: Free tier (100 pages/month) + paid plans ($0.01-0.10/page) - Self-hosted: Infrastructure costs only - Total Cost: API fees OR infrastructure + maintenance

Security & Compliance¶

Feature	Kreuzberg	Unstructured
Data Privacy	✅ 100% on-prem	⚠️ Cloud API or on-prem
GDPR Compliance	✅ Self-managed	⚠️ Varies (cloud API)
SOC 2	N/A (self-hosted)	✅ (managed API)
Air-Gapped	✅ Fully supported	⚠️ Self-hosted only
Audit Logs	⚠️ Basic (via API logs)	✅ Advanced (managed)

Use Case Recommendations¶

Choose Kreuzberg If:¶

✅ You need maximum performance (Rust-based native speed)
✅ You're building a polyglot stack (Python, TS, Go, etc.)
✅ You require strict data privacy (on-prem processing)
✅ You need to process 56+ file formats
✅ You want zero API fees (fully self-hosted)
✅ You need native bindings for your language
✅ You're processing large document volumes (high throughput)
✅ You need offline embeddings (no external API calls)

Choose Unstructured If:¶

✅ You need ML-based layout detection (GPU-accelerated)
✅ You want a managed cloud service (zero ops)
✅ You need advanced chunking strategies (by_title, semantic)
✅ You're prototyping and want fast setup
✅ You need more granular element types (Address, Formula, etc.)
✅ You're already using OpenAI/Cohere APIs for embeddings
✅ You have low document volume (free tier sufficient)

Migration Path¶

From Unstructured to Kreuzberg: 1. Deploy Kreuzberg API (Docker or native) 2. Update endpoint URLs in your code 3. Add output_format=element_based for Unstructured-compatible output 4. Test with sample documents 5. Optimize with Kreuzberg-specific features (hierarchy, per-page, embeddings)

From Kreuzberg to Unstructured: 1. Sign up for Unstructured API key 2. Update endpoint URLs 3. Remove output_format parameter (element-based is default) 4. Adjust for different metadata structure

Roadmap & Future Features¶

Kreuzberg Planned Features:¶

⏳ Enhanced chunking strategies (by_title, semantic)
⏳ Layout detection models (optional GPU acceleration)
⏳ More element types (Header, Footer, Formula)
⏳ Cloud-hosted option (for non-self-hosters)

Unstructured Strengths:¶

✅ Mature ML models (layout, tables)
✅ Large community & integrations
✅ Managed service with SLA

Verdict¶

Kreuzberg excels at: - Performance (Rust native) - Polyglot support (10 language bindings) - Format coverage (56+ formats) - Self-hosted deployments - Cost efficiency (zero API fees)

Unstructured excels at: - ML-powered layout analysis - Managed cloud service - Advanced chunking strategies - Larger ecosystem

Recommendation: - High-volume, polyglot, self-hosted → Kreuzberg - Rapid prototyping, managed service → Unstructured - Hybrid approach: Use both (Kreuzberg for bulk processing, Unstructured for complex layouts)