Skip to content

Migrating from Unstructured to Kreuzberg

This guide helps you migrate from Unstructured.io to Kreuzberg for document intelligence workloads.

Quick Start

Unstructured API:

curl -X POST "https://api.unstructured.io/general/v0/general" \
  -F 'files=@document.pdf'

Kreuzberg API:

curl -X POST "http://localhost:8080/extract" \
  -F 'files=@document.pdf' \
  -F 'output_format=element_based'

Output Format Comparison

Unified Output (Default)

Kreuzberg's default output provides richer metadata than Unstructured:

Kreuzberg Unified:

{
  "content": "Full document text...",
  "mime_type": "application/pdf",
  "metadata": {
    "title": "Document Title",
    "authors": ["Author Name"],
    "created_at": "2024-01-15T10:30:00Z",
    "format": {
      "format_type": "pdf",
      "page_count": 10,
      "version": "1.7"
    }
  },
  "tables": [...],
  "images": [...],
  "pages": [...]
}

Element-Based Output

Kreuzberg (when output_format=element_based):

{
  "elements": [
    {
      "element_id": "elem-a3f2b1c4",
      "element_type": "title",
      "text": "Introduction",
      "metadata": {
        "page_number": 1,
        "filename": "Document Title",
        "coordinates": {
          "x0": 72.0,
          "y0": 100.0,
          "x1": 540.0,
          "y1": 130.0
        },
        "element_index": 0,
        "additional": {
          "level": "h1",
          "font_size": "24.0"
        }
      }
    },
    {
      "element_type": "narrative_text",
      "text": "This is a paragraph...",
      "metadata": {
        "page_number": 1
      }
    }
  ]
}

Unstructured:

[
  {
    "type": "Title",
    "text": "Introduction",
    "metadata": {
      "page_number": 1,
      "filename": "document.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "text": "This is a paragraph...",
    "metadata": {
      "page_number": 1
    }
  }
]

API Endpoint Mapping

Unstructured Kreuzberg Notes
POST /general/v0/general POST /extract Single/batch extraction
N/A POST /embed Built-in embeddings (ONNX models)
N/A GET /health Health check
N/A GET /cache/stats Cache statistics

Element Type Mapping

Unstructured Kreuzberg Notes
Title title PDF hierarchy (h1-h6) detection
NarrativeText narrative_text Paragraphs split on double newlines
ListItem list_item Bullets, numbered, lettered
Table table Tab-separated text representation
Image image Format, dimensions in metadata
PageBreak page_break Between pages in multi-page docs
Header header Page header text
Footer footer Page footer text
N/A heading Section headings (beyond title)
N/A code_block Code snippets
N/A block_quote Quoted text blocks

Code Examples

Python

Unstructured:

from unstructured.partition.auto import partition

elements = partition(filename="document.pdf")
for element in elements:
    print(f"{element.category}: {element.text}")

Kreuzberg:

from kreuzberg import extract_bytes

# Option 1: Element-based output
config = {"output_format": "element_based"}
result = extract_bytes(pdf_bytes, "application/pdf", config)

for element in result.elements:
    print(f"{element.element_type}: {element.text}")
    if element.metadata.page_number:
        print(f"  Page: {element.metadata.page_number}")

# Option 2: Unified output (default, richer metadata)
result = extract_bytes(pdf_bytes, "application/pdf")
print(result.content)  # Full text
print(result.metadata.title)  # Document metadata
for page in result.pages:
    print(f"Page {page.page_number}: {page.content[:100]}")

TypeScript

Unstructured (via API):

const formData = new FormData();
formData.append('files', fileBlob);

const response = await fetch('https://api.unstructured.io/general/v0/general', {
  method: 'POST',
  body: formData
});
const elements = await response.json();

Kreuzberg:

import { extractBytes } from 'kreuzberg';

// Option 1: Element-based output
const result = await extractBytes(pdfBuffer, 'application/pdf', {
  output_format: 'element_based'
});

for (const element of result.elements) {
  console.log(`${element.element_type}: ${element.text}`);
}

// Option 2: Unified output with pages
const result = await extractBytes(pdfBuffer, 'application/pdf', {
  pages: { extract_pages: true }
});

for (const page of result.pages) {
  console.log(`Page ${page.page_number}:`, page.content);
}

cURL

Unstructured:

curl -X POST "https://api.unstructured.io/general/v0/general" \
  -H "unstructured-api-key: $API_KEY" \
  -F 'files=@document.pdf' \
  -F 'strategy=hi_res'

Kreuzberg:

# Element-based output
curl -X POST "http://localhost:8080/extract" \
  -F 'files=@document.pdf' \
  -F 'output_format=element_based'

# With configuration JSON
curl -X POST "http://localhost:8080/extract" \
  -F 'files=@document.pdf' \
  -F 'config={"output_format":"element_based","pages":{"extract_pages":true}}'

Feature Comparison

What Kreuzberg Adds

  1. Richer Metadata: Format-specific discriminated unions (PDF, Excel, Email, etc.)
  2. Native Per-Page: PageContent with byte offsets, hierarchy, tables, images per page
  3. 56+ Formats: vs Unstructured's ~30 formats
  4. Performance: Rust-based native implementation (vs Python-based)
  5. 10 Language Bindings: Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, Rust, WASM
  6. Built-in Embeddings: ONNX models via /embed endpoint (no external API)
  7. Smart Hierarchy: PDF font-size clustering for h1-h6 detection
  8. Bounding Boxes: Preserved from PDF source in element coordinates

What Unstructured Has

  1. Layout Detection Models: ML-based layout analysis (GPU-accelerated)
  2. Cloud API: Hosted service (Kreuzberg requires self-hosting)
  3. More Element Types: More granular element classification
  4. Mature Ecosystem: Larger community, more integrations

Configuration Mapping

Unstructured Parameter Kreuzberg Config Notes
strategy=hi_res pdf_options.hierarchy.enabled=true PDF hierarchy extraction
coordinates=true Always included when available Bounding boxes in element metadata
languages=["eng"] ocr.language="eng" OCR language
extract_image_block_types=["image"] images.extract_images=true Image extraction
chunking_strategy="by_title" chunking.max_chars=1000 Text chunking (basic)
embedding_model="..." chunking.embedding.model="..." Embedding generation

Migration Checklist

  • Update API endpoint URLs (Unstructured → Kreuzberg)
  • Add output_format=element_based if using element-based workflow
  • Update element type references (Titletitle, camelCase → snake_case)
  • Update metadata field references (Kreuzberg has richer metadata structure)
  • Test with sample documents to verify output equivalence
  • Update error handling (Kreuzberg uses HTTP 422 for validation errors)
  • Configure caching if needed (Kreuzberg has built-in file-based cache)
  • Set up embeddings if using RAG pipeline (Kreuzberg has built-in ONNX support)

Advanced: Hybrid Approach

You can use both formats simultaneously:

from kreuzberg import extract_bytes

result = extract_bytes(pdf_bytes, "application/pdf", {
    "output_format": "element_based",  # Get elements
    "pages": {"extract_pages": true}   # Also get per-page content
})

# Element-based processing
for element in result.elements:
    if element.element_type == "title":
        index_heading(element.text)

# Page-based processing
for page in result.pages:
    if page.hierarchy:
        for block in page.hierarchy.blocks:
            if block.level == "h1":
                process_section(block.text)

Performance Tips

  1. Enable Caching: use_cache: true (default) for repeated extractions
  2. Disable OCR: If documents are searchable PDFs, set force_ocr: false
  3. Limit Page Extraction: Only enable pages if you need per-page content
  4. Batch Processing: Send multiple files in single request (up to 10MB total)
  5. Use Embeddings Wisely: Enable only for chunked content destined for vector DB

Getting Help

  • Documentation: https://github.com/kreuzberg-dev/kreuzberg
  • Issues: https://github.com/kreuzberg-dev/kreuzberg/issues
  • Examples: See examples/ directory for full workflow samples
  • API Reference: See docs/api/ for endpoint documentation

Next Steps

After migration: 1. Review the Kreuzberg vs Unstructured Comparison 2. Explore Kreuzberg-specific features (hierarchy, per-page metadata, embeddings) 3. Optimize your pipeline with native Rust performance