Migrating from Unstructured to Kreuzberg¶

This guide helps you migrate from Unstructured.io to Kreuzberg for document intelligence workloads.

Quick Start¶

Unstructured API:

curl -X POST "https://api.unstructured.io/general/v0/general" \
  -F 'files=@document.pdf'

Kreuzberg API:

curl -X POST "http://localhost:8080/extract" \
  -F 'files=@document.pdf' \
  -F 'output_format=element_based'

Output Format Comparison¶

Unified Output (Default)¶

Kreuzberg's default output provides richer metadata than Unstructured:

Kreuzberg Unified:

{
  "content": "Full document text...",
  "mime_type": "application/pdf",
  "metadata": {
    "title": "Document Title",
    "authors": ["Author Name"],
    "created_at": "2024-01-15T10:30:00Z",
    "format": {
      "format_type": "pdf",
      "page_count": 10,
      "version": "1.7"
    }
  },
  "tables": [...],
  "images": [...],
  "pages": [...]
}

Element-Based Output¶

Kreuzberg (when output_format=element_based):

>

{ "elements": [ { "element_id": "elem-a3f2b1c4", "element_type": "title", "text": "Introduction", "metadata": { "page_number": 1, "filename": "Document Title", "coordinates": { "x0": 72.0, "y0": 100.0, "x1": 540.0, "y1": 130.0 }, "element_index": 0, "additional": { "level": "h1", "font_size": "24.0" } } }, { "element_type": "narrative_text", "text": "This is a paragraph...", "metadata": { "page_number": 1 } } ] class=p>}
 Unstructured: 
[
  {
    "type": "Title",
    "text": "Introduction",
    "metadata": {
      "page_number": 1,
      "filename": "document.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "text": "This is a paragraph...",
    "metadata": {
      "page_number": 1
    }
  }
]
 API Endpoint Mapping¶
    Unstructured  Kreuzberg  Notes  
 
   POST /general/v0/general  POST /extract  Single/batch extraction  
  N/A  POST /embed  Built-in embeddings (ONNX models)  
  N/A  GET /health  Health check  
  N/A  GET /cache/stats  Cache statistics  
 
 
 Element Type Mapping¶
    Unstructured  Kreuzberg  Notes  
 
   Title  title  PDF hierarchy (h1-h6) detection  
  NarrativeText  narrative_text  Paragraphs split on double newlines  
  ListItem  list_item  Bullets, numbered, lettered  
  Table  table  Tab-separated text representation  
  Image  image  Format, dimensions in metadata  
  PageBreak  page_break  Between pages in multi-page docs  
  Header  header  Page header text  
  Footer  footer  Page footer text  
  N/A  heading  Section headings (beyond title)  
  N/A  code_block  Code snippets  
  N/A  block_quote  Quoted text blocks  
 
 
 Code Examples¶
 Python¶
 Unstructured: 
from unstructured.partition.auto import partition

elements = partition(filename="document.pdf")
for element in elements:
    print(f"{element.category}: {element.text}")
 Kreuzberg: 
from kreuzberg import extract_bytes

# Option 1: Element-based output
config = {"output_format": "element_based"}
result = extract_bytes(pdf_bytes, "application/pdf", config)

for element in result.elements:
    print(f"{element.element_type}: {element.text}")
    if element.metadata.page_number:
        print(f"  Page: {element.metadata.page_number}")

# Option 2: Unified output (default, richer metadata)
result = extract_bytes(pdf_bytes, "application/pdf")
print(result.content)  # Full text
print(result.metadata.title)  # Document metadata
for page in result.pages:
    print(f"Page {page.page_number}: {page.content[:100]}")
 TypeScript¶
 Unstructured (via API): 
const formData = new FormData();
formData.append('files', fileBlob);

const response = await fetch('https://api.unstructured.io/general/v0/general', {
  method: 'POST',
  body: formData
});
const elements = await response.json();
 Kreuzberg: 
import { extractBytes } from 'kreuzberg';

// Option 1: Element-based output
const result = await extractBytes(pdfBuffer, 'application/pdf', {
  output_format: 'element_based'
});

for (const element of result.elements) {
  console.log(`${element.element_type}: ${element.text}`);
}

// Option 2: Unified output with pages
const result = await extractBytes(pdfBuffer, 'application/pdf', {
  pages: { extract_pages: true }
});

for (const page of result.pages) {
  console.log(`Page ${page.page_number}:`, page.content);
}
 cURL¶
 Unstructured: 
curl -X POST "https://api.unstructured.io/general/v0/general" \
  -H "unstructured-api-key: $API_KEY" \
  -F 'files=@document.pdf' \
  -F 'strategy=hi_res'
 Kreuzberg: 
# Element-based output
curl -X POST "http://localhost:8080/extract" \
  -F 'files=@document.pdf' \
  -F 'output_format=element_based'

# With configuration JSON
curl -X POST "http://localhost:8080/extract" \
  -F 'files=@document.pdf' \
  -F 'config={"output_format":"element_based","pages":{"extract_pages":true}}'
 Feature Comparison¶
 What Kreuzberg Adds¶
  Richer Metadata: Format-specific discriminated unions (PDF, Excel, Email, etc.)
 Native Per-Page: PageContent with byte offsets, hierarchy, tables, images per page
 56+ Formats: vs Unstructured's ~30 formats
 Performance: Rust-based native implementation (vs Python-based)
 10 Language Bindings: Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, Rust, WASM
 Built-in Embeddings: ONNX models via /embed endpoint (no external API)
 Smart Hierarchy: PDF font-size clustering for h1-h6 detection
 Bounding Boxes: Preserved from PDF source in element coordinates
 
 What Unstructured Has¶
  Layout Detection Models: ML-based layout analysis (GPU-accelerated)
 Cloud API: Hosted service (Kreuzberg requires self-hosting)
 More Element Types: More granular element classification
 Mature Ecosystem: Larger community, more integrations
 
 Configuration Mapping¶
    Unstructured Parameter  Kreuzberg Config  Notes  
 
   strategy=hi_res  pdf_options.hierarchy.enabled=true  PDF hierarchy extraction  
  coordinates=true  Always included when available  Bounding boxes in element metadata  
  languages=["eng"]  ocr.language="eng"  OCR language  
  extract_image_block_types=["image"]  images.extract_images=true  Image extraction  
  chunking_strategy="by_title"  chunking.max_chars=1000  Text chunking (basic)  
  embedding_model="..."  chunking.embedding.model="..."  Embedding generation  
 
 
 Migration Checklist¶
   Update API endpoint URLs (Unstructured → Kreuzberg)
  Add output_format=element_based if using element-based workflow
  Update element type references (Title → title, camelCase → snake_case)
  Update metadata field references (Kreuzberg has richer metadata structure)
  Test with sample documents to verify output equivalence
  Update error handling (Kreuzberg uses HTTP 422 for validation errors)
  Configure caching if needed (Kreuzberg has built-in file-based cache)
  Set up embeddings if using RAG pipeline (Kreuzberg has built-in ONNX support)
 
 Advanced: Hybrid Approach¶
 You can use both formats simultaneously:
 from kreuzberg import extract_bytes

result = extract_bytes(pdf_bytes, "application/pdf", {
    "output_format": "element_based",  # Get elements
    "pages": {"extract_pages": true}   # Also get per-page content
})

# Element-based processing
for element in result.elements:
    if element.element_type == "title":
        index_heading(element.text)

# Page-based processing
for page in result.pages:
    if page.hierarchy:
        for block in page.hierarchy.blocks:
            if block.level == "h1":
                process_section(block.text)
 Performance Tips¶
  Enable Caching: use_cache: true (default) for repeated extractions
 Disable OCR: If documents are searchable PDFs, set force_ocr: false
 Limit Page Extraction: Only enable pages if you need per-page content
 Batch Processing: Send multiple files in single request (up to 10MB total)
 Use Embeddings Wisely: Enable only for chunked content destined for vector DB
 
 Getting Help¶
  Documentation: https://github.com/kreuzberg-dev/kreuzberg
 Issues: https://github.com/kreuzberg-dev/kreuzberg/issues
 Examples: See examples/ directory for full workflow samples
 API Reference: See docs/api/ for endpoint documentation
 
 Next Steps¶
 After migration: 1. Review the Kreuzberg vs Unstructured Comparison 2. Explore Kreuzberg-specific features (hierarchy, per-page metadata, embeddings) 3. Optimize your pipeline with native Rust performance
      2026-01-18      2026-01-18

Unstructured	Kreuzberg	Notes
`Title`	`title`	PDF hierarchy (h1-h6) detection
`NarrativeText`	`narrative_text`	Paragraphs split on double newlines
`ListItem`	`list_item`	Bullets, numbered, lettered
`Table`	`table`	Tab-separated text representation
`Image`	`image`	Format, dimensions in metadata
`PageBreak`	`page_break`	Between pages in multi-page docs
`Header`	`header`	Page header text
`Footer`	`footer`	Page footer text
N/A	`heading`	Section headings (beyond title)
N/A	`code_block`	Code snippets
N/A	`block_quote`	Quoted text blocks

Unstructured Parameter	Kreuzberg Config	Notes
`strategy=hi_res`	`pdf_options.hierarchy.enabled=true`	PDF hierarchy extraction
`coordinates=true`	Always included when available	Bounding boxes in element metadata
`languages=["eng"]`	`ocr.language="eng"`	OCR language
`extract_image_block_types=["image"]`	`images.extract_images=true`	Image extraction
`chunking_strategy="by_title"`	`chunking.max_chars=1000`	Text chunking (basic)
`embedding_model="..."`	`chunking.embedding.model="..."`	Embedding generation

Unstructured	Kreuzberg	Notes
`POST /general/v0/general`	`POST /extract`	Single/batch extraction
N/A	`POST /embed`	Built-in embeddings (ONNX models)
N/A	`GET /health`	Health check
N/A	`GET /cache/stats`	Cache statistics