Migrating from Unstructured to Kreuzberg¶
This guide helps you migrate from Unstructured.io to Kreuzberg for document intelligence workloads.
Quick Start¶
Unstructured API:
Kreuzberg API:
curl -X POST "http://localhost:8080/extract" \
-F 'files=@document.pdf' \
-F 'output_format=element_based'
Output Format Comparison¶
Unified Output (Default)¶
Kreuzberg's default output provides richer metadata than Unstructured:
Kreuzberg Unified:
{
"content": "Full document text...",
"mime_type": "application/pdf",
"metadata": {
"title": "Document Title",
"authors": ["Author Name"],
"created_at": "2024-01-15T10:30:00Z",
"format": {
"format_type": "pdf",
"page_count": 10,
"version": "1.7"
}
},
"tables": [...],
"images": [...],
"pages": [...]
}
Element-Based Output¶
Kreuzberg (when output_format=element_based):
{
"elements": [
{
"element_id": "elem-a3f2b1c4",
"element_type": "title",
"text": "Introduction",
"metadata": {
"page_number": 1,
"filename": "Document Title",
"coordinates": {
"x0": 72.0,
"y0": 100.0,
"x1": 540.0,
"y1": 130.0
},
"element_index": 0,
"additional": {
"level": "h1",
"font_size": "24.0"
}
}
},
{
"element_type": "narrative_text",
"text": "This is a paragraph...",
"metadata": {
"page_number": 1
}
}
]
}
Unstructured:
[
{
"type": "Title",
"text": "Introduction",
"metadata": {
"page_number": 1,
"filename": "document.pdf"
}
},
{
"type": "NarrativeText",
"text": "This is a paragraph...",
"metadata": {
"page_number": 1
}
}
]
API Endpoint Mapping¶
| Unstructured | Kreuzberg | Notes |
|---|---|---|
POST /general/v0/general | POST /extract | Single/batch extraction |
| N/A | POST /embed | Built-in embeddings (ONNX models) |
| N/A | GET /health | Health check |
| N/A | GET /cache/stats | Cache statistics |
Element Type Mapping¶
| Unstructured | Kreuzberg | Notes |
|---|---|---|
Title | title | PDF hierarchy (h1-h6) detection |
NarrativeText | narrative_text | Paragraphs split on double newlines |
ListItem | list_item | Bullets, numbered, lettered |
Table | table | Tab-separated text representation |
Image | image | Format, dimensions in metadata |
PageBreak | page_break | Between pages in multi-page docs |
Header | header | Page header text |
Footer | footer | Page footer text |
| N/A | heading | Section headings (beyond title) |
| N/A | code_block | Code snippets |
| N/A | block_quote | Quoted text blocks |
Code Examples¶
Python¶
Unstructured:
from unstructured.partition.auto import partition
elements = partition(filename="document.pdf")
for element in elements:
print(f"{element.category}: {element.text}")
Kreuzberg:
from kreuzberg import extract_bytes
# Option 1: Element-based output
config = {"output_format": "element_based"}
result = extract_bytes(pdf_bytes, "application/pdf", config)
for element in result.elements:
print(f"{element.element_type}: {element.text}")
if element.metadata.page_number:
print(f" Page: {element.metadata.page_number}")
# Option 2: Unified output (default, richer metadata)
result = extract_bytes(pdf_bytes, "application/pdf")
print(result.content) # Full text
print(result.metadata.title) # Document metadata
for page in result.pages:
print(f"Page {page.page_number}: {page.content[:100]}")
TypeScript¶
Unstructured (via API):
const formData = new FormData();
formData.append('files', fileBlob);
const response = await fetch('https://api.unstructured.io/general/v0/general', {
method: 'POST',
body: formData
});
const elements = await response.json();
Kreuzberg:
import { extractBytes } from 'kreuzberg';
// Option 1: Element-based output
const result = await extractBytes(pdfBuffer, 'application/pdf', {
output_format: 'element_based'
});
for (const element of result.elements) {
console.log(`${element.element_type}: ${element.text}`);
}
// Option 2: Unified output with pages
const result = await extractBytes(pdfBuffer, 'application/pdf', {
pages: { extract_pages: true }
});
for (const page of result.pages) {
console.log(`Page ${page.page_number}:`, page.content);
}
cURL¶
Unstructured:
curl -X POST "https://api.unstructured.io/general/v0/general" \
-H "unstructured-api-key: $API_KEY" \
-F 'files=@document.pdf' \
-F 'strategy=hi_res'
Kreuzberg:
# Element-based output
curl -X POST "http://localhost:8080/extract" \
-F 'files=@document.pdf' \
-F 'output_format=element_based'
# With configuration JSON
curl -X POST "http://localhost:8080/extract" \
-F 'files=@document.pdf' \
-F 'config={"output_format":"element_based","pages":{"extract_pages":true}}'
Feature Comparison¶
What Kreuzberg Adds¶
- Richer Metadata: Format-specific discriminated unions (PDF, Excel, Email, etc.)
- Native Per-Page:
PageContentwith byte offsets, hierarchy, tables, images per page - 56+ Formats: vs Unstructured's ~30 formats
- Performance: Rust-based native implementation (vs Python-based)
- 10 Language Bindings: Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, Rust, WASM
- Built-in Embeddings: ONNX models via
/embedendpoint (no external API) - Smart Hierarchy: PDF font-size clustering for h1-h6 detection
- Bounding Boxes: Preserved from PDF source in element coordinates
What Unstructured Has¶
- Layout Detection Models: ML-based layout analysis (GPU-accelerated)
- Cloud API: Hosted service (Kreuzberg requires self-hosting)
- More Element Types: More granular element classification
- Mature Ecosystem: Larger community, more integrations
Configuration Mapping¶
| Unstructured Parameter | Kreuzberg Config | Notes |
|---|---|---|
strategy=hi_res | pdf_options.hierarchy.enabled=true | PDF hierarchy extraction |
coordinates=true | Always included when available | Bounding boxes in element metadata |
languages=["eng"] | ocr.language="eng" | OCR language |
extract_image_block_types=["image"] | images.extract_images=true | Image extraction |
chunking_strategy="by_title" | chunking.max_chars=1000 | Text chunking (basic) |
embedding_model="..." | chunking.embedding.model="..." | Embedding generation |
Migration Checklist¶
- Update API endpoint URLs (Unstructured → Kreuzberg)
- Add
output_format=element_basedif using element-based workflow - Update element type references (
Title→title, camelCase → snake_case) - Update metadata field references (Kreuzberg has richer metadata structure)
- Test with sample documents to verify output equivalence
- Update error handling (Kreuzberg uses HTTP 422 for validation errors)
- Configure caching if needed (Kreuzberg has built-in file-based cache)
- Set up embeddings if using RAG pipeline (Kreuzberg has built-in ONNX support)
Advanced: Hybrid Approach¶
You can use both formats simultaneously:
from kreuzberg import extract_bytes
result = extract_bytes(pdf_bytes, "application/pdf", {
"output_format": "element_based", # Get elements
"pages": {"extract_pages": true} # Also get per-page content
})
# Element-based processing
for element in result.elements:
if element.element_type == "title":
index_heading(element.text)
# Page-based processing
for page in result.pages:
if page.hierarchy:
for block in page.hierarchy.blocks:
if block.level == "h1":
process_section(block.text)
Performance Tips¶
- Enable Caching:
use_cache: true(default) for repeated extractions - Disable OCR: If documents are searchable PDFs, set
force_ocr: false - Limit Page Extraction: Only enable
pagesif you need per-page content - Batch Processing: Send multiple files in single request (up to 10MB total)
- Use Embeddings Wisely: Enable only for chunked content destined for vector DB
Getting Help¶
- Documentation: https://github.com/kreuzberg-dev/kreuzberg
- Issues: https://github.com/kreuzberg-dev/kreuzberg/issues
- Examples: See
examples/directory for full workflow samples - API Reference: See
docs/api/for endpoint documentation
Next Steps¶
After migration: 1. Review the Kreuzberg vs Unstructured Comparison 2. Explore Kreuzberg-specific features (hierarchy, per-page metadata, embeddings) 3. Optimize your pipeline with native Rust performance