Features¶

Kreuzberg is a comprehensive document intelligence library supporting 75+ file formats with advanced extraction, OCR, and processing capabilities. This page documents all features and their availability across language bindings.

Core Extraction Features¶

File Format Support¶

Kreuzberg extracts text, tables, and metadata from 75+ file formats:

Documents - PDF (.pdf) - Native text extraction with optional OCR fallback - Microsoft Word (.docx, .doc) - Modern and legacy formats - OpenDocument Text (.odt) - OpenDocument text - Plain text (.txt, .md, .markdown, .djot, .mdx) - With metadata extraction for Markdown, Djot, and MDX

Spreadsheets - Excel (.xlsx, .xls, .xlsm, .xlsb) - Modern and legacy formats - OpenDocument Spreadsheet (.ods) - OpenDocument spreadsheet - CSV (.csv) - Comma-separated values - TSV (.tsv) - Tab-separated values

Presentations - PowerPoint (.pptx, .ppt) - Modern and legacy formats

Images - Common formats: JPEG, PNG, GIF, BMP, TIFF, WebP - Advanced formats: JPEG 2000 (.jp2, .jpx, .jpm, .mj2) - Portable formats: PNM, PBM, PGM, PPM

Email - EML (.eml) - RFC 822 email format - MSG (.msg) - Microsoft Outlook format

Web & Markup - HTML (.html, .htm) - Converted to Markdown - XML (.xml) - Streaming parser for large files - SVG (.svg) - Scalable vector graphics

Structured Data - JSON (.json) - JavaScript Object Notation - YAML (.yaml) - YAML Ain't Markup Language - TOML (.toml) - Tom's Obvious Minimal Language

Archives - ZIP (.zip) - ZIP archives - TAR (.tar, .tgz) - Tape archives - GZIP (.gz) - GNU zip - 7-Zip (.7z) - 7-Zip archives

Extraction Capabilities¶

Text Extraction - Native text extraction from all supported formats - Preserves formatting and structure where applicable - Handles multi-byte character encodings (UTF-8, UTF-16, etc.) - Mojibake detection and correction

Table Extraction - Structured table data from PDFs, spreadsheets, and Word documents - Cell-level extraction with row/column indexing - Markdown and JSON output formats - Merged cell support

Metadata Extraction - Document properties (title, author, creation date, etc.) - Page count, word count, character count - MIME type detection - Format-specific metadata (Excel sheets, PDF annotations, etc.)

Image Extraction - Extract embedded images from PDFs and Office documents - Image preprocessing for OCR optimization - Format conversion and resolution normalization

OCR (Optical Character Recognition)¶

Tesseract OCR¶

Native Tesseract integration available in all language bindings.

Features: - 100+ language support via Tesseract language packs - Page segmentation modes (PSM) for different layouts - OCR Engine Modes (OEM) for accuracy tuning - Confidence scoring per word/line - hOCR output format support - Automatic image preprocessing

Configuration: - Language selection (single or multi-language) - PSM and OEM mode selection - Custom Tesseract configuration strings - Whitelist/blacklist character sets

PaddleOCR (Native)¶

PaddleOCR is available as a native Rust backend in all non-WASM bindings via the paddle-ocr feature flag. Models are automatically downloaded on first use.

Production-ready OCR using ONNX Runtime
PP-OCRv5 server detection + per-family recognition models
80+ language support across 11 script families: English, Chinese, Latin, Korean, Slavic, Thai, Greek, Arabic, Devanagari, Tamil, Telugu
Per-family models downloaded on demand
Concurrent multi-language OCR via engine pool
Excellent CJK (Chinese, Japanese, Korean) accuracy
No Python dependency required
Also available as a Python package (pip install kreuzberg[paddleocr], requires Python <3.14)

Python-Specific OCR Backend¶

EasyOCR (pip install kreuzberg[easyocr]) - Deep learning-based OCR engine (Python only) - 80+ language support - GPU acceleration support (CUDA) - Better accuracy for certain scripts (CJK, Arabic, etc.) - Requires Python <3.14

OCR Features¶

Automatic fallback: Use OCR when native text extraction fails
Force OCR mode: Override native extraction with OCR
Caching: OCR results cached to disk for performance
Image preprocessing: Automatic contrast, deskew, and noise reduction
Multi-language detection: Process documents with mixed languages

Advanced Processing Features¶

Language Detection¶

Automatic language detection for extracted text using fast-langdetect.

Capabilities: - 60+ language detection - Confidence scoring - Multi-language detection (detect all languages in document) - Configurable confidence thresholds - ISO 639-1 and ISO 639-3 code support

Configuration:

language_detection_config.py

# Configure language detection with multiple language support
LanguageDetectionConfig(
    detect_multiple=True,
    confidence_threshold=0.7
)

Content Chunking¶

Split extracted text into semantic chunks for LLM processing.

Chunking Strategies: - Recursive: Split by paragraphs, sentences, then words - Semantic: Preserve semantic boundaries - Token-aware: Respect token limits for LLMs

Features: - Configurable chunk size and overlap - Metadata preservation per chunk - Character position tracking - Optional embedding generation

Configuration:

chunking_config.py

# Configure content chunking with size and overlap settings
ChunkingConfig(
    max_chars=1000,
    max_overlap=200
)

Embeddings¶

Generate vector embeddings for chunks using FastEmbed.

Embedding Models: - Preset models: "fast", "balanced", "quality" - FastEmbed models: Any model from FastEmbed catalog - Custom models: Bring your own embedding model

Features: - Local embedding generation (no API calls) - Automatic model download and caching - Multiple embedding dimensions (384, 512, 768, 1024) - Batch processing for performance - Optional L2 normalization

Configuration:

embedding_config.py

# Configure embedding generation with balanced preset model
EmbeddingConfig(
    model=EmbeddingModelType.preset("balanced"),
    normalize=True
)

Token Reduction¶

Reduce token count while preserving semantic meaning using extractive summarization.

Reduction Modes: - Light ("light"): ~15% reduction, minimal information loss - Moderate ("moderate"): ~30% reduction, balanced approach - Aggressive ("aggressive"): ~50% reduction, maximum compression

Algorithm: - TF-IDF based sentence scoring - Stopword filtering with language-specific lists - Position-aware scoring (preserve important sections) - Configurable reduction targets

Configuration:

token_reduction_config.py

# Configure token reduction with moderate compression
TokenReductionConfig(
    mode="moderate"
)

Quality Processing¶

Enhance extraction quality with text normalization and cleanup.

Processing Steps: - Unicode normalization (NFC/NFD/NFKC/NFKD) - Whitespace normalization - Line break standardization - Encoding detection and correction - Mojibake fixing - Character set validation

Configuration:

extraction_config.py

# Enable quality processing for text normalization and cleanup
ExtractionConfig(
    enable_quality_processing=True
)

Keyword Extraction¶

Extract keywords and key phrases from documents.

Algorithms: - YAKE (Yet Another Keyword Extractor): Unsupervised, language-independent - RAKE (Rapid Automatic Keyword Extraction): Fast statistical method

Features: - Configurable number of keywords - N-gram support (1-3 word phrases) - Language-specific stopword filtering - Relevance scoring

Configuration:

keyword_extraction_config.py

# Configure keyword extraction using YAKE algorithm
KeywordConfig(
    algorithm=KeywordAlgorithm.Yake,
    max_keywords=10,
    ngram_range=(1,3)
)

Page Tracking and Boundaries¶

Extract per-page content and track precise page boundaries with byte-accurate offsets.

Capabilities: - Per-page content extraction (text, tables, images per page) - Byte-offset page boundaries for O(1) page lookups - Automatic chunk-to-page mapping when chunking is enabled - Page markers in combined text for LLM context - Format-specific page types (Page/Slide/Sheet)

Supported Formats: - PDF: Full byte-accurate tracking, O(1) performance - PPTX: Slide boundary tracking - DOCX: Best-effort page break detection

Configuration:

page_tracking.py

config = ExtractionConfig(
    pages=PageConfig(
        extract_pages=True,          # Get pages array
        insert_page_markers=True,     # Add markers in content
        marker_format="--- Page {page_num} ---"
    )
)

Use Cases: - Precise location references in RAG systems - Page-aware embeddings and retrieval - Per-page processing workflows - Document structure analysis - Page-filtered search results

PDF Hierarchy Detection¶

Automatically detect and extract document structure from PDF documents using K-means clustering to identify semantic hierarchies of blocks.

Algorithm Overview:

The PDF hierarchy detection system analyzes PDF content blocks to infer document structure without relying on explicit heading tags or format markers. The approach uses K-means clustering to identify semantic levels and group related content:

Block Analysis: Extracts all text blocks from PDF with position, size, and styling information
Feature Extraction: Computes features for each block including size, indentation, font characteristics, and position
K-means Clustering: Groups blocks into semantic levels (typically 3-5 levels) representing document hierarchy
Hierarchy Inference: Maps clustered blocks to hierarchical levels (title, section, subsection, paragraph, etc.)
Relationship Detection: Links related blocks using spatial proximity and semantic similarity

K-means Clustering Details:

The clustering algorithm identifies optimal semantic levels by analyzing block characteristics:

Number of Clusters: Automatically determined (typically 3-5 levels) based on content distribution
Features Used: Font size, text weight, indentation, position, text length
Convergence: Iterative refinement until cluster stability
Output: Each block assigned to a semantic level with confidence scores

Configuration Options:

hierarchy_config.py

from kreuzberg import ExtractionConfig, PdfConfig, HierarchyConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy_detection=HierarchyConfig(
            enabled=True,                      # Enable hierarchy detection
            k_clusters=6,                      # Number of clusters for semantic levels
            include_bbox=True,                 # Include bounding box in output
            ocr_coverage_threshold=None        # OCR coverage threshold (None = auto)
        )
    )
)

Output Structure:

hierarchy_output.py

# Access hierarchy from extraction result
result = extract_file("document.pdf", config=config)

# Hierarchy blocks available at page level
for page in result.pages:
    for block in page.hierarchy.blocks:
        print(f"Level {block.level}: {block.text}")
        print(f"Font Size: {block.font_size}")
        print(f"Bounding Box: {block.bbox}")

Use Cases:

1. Retrieval Augmented Generation (RAG) - Build hierarchical knowledge bases with semantic structure - Improve retrieval precision by understanding document context - Support structured queries using hierarchy levels - Enable context-aware chunk selection during retrieval

2. Document Indexing - Create multi-level indexes for better navigation - Support breadcrumb-style navigation in user interfaces - Index content by semantic section for faster lookup - Generate table of contents automatically

3. Semantic Chunking - Respect logical document structure when splitting content - Keep related sections together despite content length - Assign semantic metadata to chunks based on hierarchy - Preserve hierarchical context for embeddings

4. Content Analysis - Identify document structure patterns - Analyze section-level statistics and summaries - Generate outline or abstract from hierarchy - Detect anomalies in document structure

5. Knowledge Base Construction - Organize extracted content into structured collections - Map hierarchy to graph databases or knowledge graphs - Support structured navigation and exploration - Enable hierarchy-aware search and filtering

Integration with Other Features:

The hierarchy detection integrates seamlessly with other Kreuzberg features:

combined_features.py

from kreuzberg import ExtractionConfig, ChunkingConfig, EmbeddingConfig, PdfConfig, HierarchyDetectionConfig

config = ExtractionConfig(
    # Enable hierarchy detection
    pdf_options=PdfConfig(
        hierarchy_detection=HierarchyDetectionConfig(enabled=True)
    ),
    # Combine with semantic chunking
    chunking=ChunkingConfig(
        max_chars=1000,
        max_overlap=200
    ),
    # Generate embeddings for chunks
    embeddings=EmbeddingConfig(
        model=EmbeddingModelType.preset("balanced")
    )
)

result = extract_file("document.pdf", config=config)

# Use hierarchy information to enrich chunks
for chunk in result.chunks:
    # Find which hierarchy block this chunk belongs to
    containing_block = find_hierarchy_block(chunk, result)
    print(f"Chunk belongs to: {containing_block.text}")
    print(f"Semantic level: {containing_block.level}")

Performance Characteristics:

Time Complexity: O(n·k·i) where n = blocks, k = clusters, i = iterations (typically < 100ms for average documents)
Memory Usage: Linear with document size (O(n) for n blocks)
GPU Support: Optional GPU acceleration for large documents (100+ pages)
Caching: Results cached based on document content hash

Batch Processing¶

Parallel Extraction¶

Process multiple documents concurrently using async/await or thread pools.

Async API:

batch_extraction.py

# Process multiple documents concurrently using async batch extraction
results = await batch_extract_file(
    ["doc1.pdf", "doc2.pdf", "doc3.pdf"],
    config=config
)

Features: - Automatic concurrency based on CPU count - Configurable worker limits - Error handling per document - Progress tracking - Memory-efficient streaming for large batches

Caching¶

Intelligent caching system for expensive operations.

Cached Operations: - OCR results (per image hash) - Language detection results - Embedding vectors - Extracted metadata

Cache Features: - Disk-based storage - Automatic cache invalidation - Configurable cache directory - Cache statistics and management - LRU eviction policy

Configuration:

cache_config.py

# Enable caching with custom directory
ExtractionConfig(
    use_cache=True,
    cache_dir="/custom/cache/path"
)

Configuration & Discovery¶

Configuration Methods¶

Kreuzberg supports four configuration methods:

Programmatic: Create configuration objects in code
TOML files: kreuzberg.toml
YAML files: kreuzberg.yaml
JSON files: kreuzberg.json

Automatic Discovery¶

Configuration files automatically discovered in order:

Current directory: ./kreuzberg.{toml,yaml,json}
User config: ~/.config/kreuzberg/config.{toml,yaml,json}
System config: /etc/kreuzberg/config.{toml,yaml,json}

Discovery API:

config_discovery.py

# Automatically discover and load configuration from filesystem
config = ExtractionConfig.discover()

Environment Variables¶

Override configuration via environment variables:

KREUZBERG_CONFIG_PATH: Path to config file
KREUZBERG_CACHE_DIR: Cache directory
KREUZBERG_OCR_BACKEND: OCR backend selection
KREUZBERG_OCR_LANGUAGE: OCR language

Plugin System¶

Plugin Types¶

Extensible architecture supporting four plugin types:

Document Extractors - Add support for custom file formats - Override default extractors - Priority-based selection

OCR Backends - Integrate cloud OCR services - Custom OCR engines - Preprocessing pipelines

Post Processors - Transform extraction results - Add custom metadata - Filter or enhance content

Validators - Validate extraction results - Enforce quality standards - Custom error handling

Plugin Registration¶

Rust:

plugin_registration.rs

// Register custom document extractor with priority 50
let registry = get_document_extractor_registry();
registry.register("custom", Arc::new(MyExtractor), 50)?;

Python:

plugin_registration.py

# Register custom document extractor plugin
from kreuzberg.plugins import register_extractor

register_extractor(MyExtractor(), priority=50)

Plugin Discovery¶

Automatic plugin discovery from: - Python entry points - Configuration files - Environment variables

Server Modes¶

HTTP REST API Server¶

Production-ready RESTful API server.

Endpoints: - POST /extract - Extract from uploaded files - GET /health - Health check - GET /info - Server information - GET /cache/stats - Cache statistics - POST /cache/clear - Clear cache

Features: - File upload support - JSON/multipart request handling - CORS configuration - Request logging and metrics - Graceful shutdown

Start Server:

Terminal

kreuzberg serve --host 0.0.0.0 --port 8000

Model Context Protocol (MCP) Server¶

Stdio-based MCP server for AI agent integration.

Tools: - extract_file - Extract from file path - extract_bytes - Extract from base64 bytes - batch_extract - Extract from multiple files

Features: - Stdio transport (Claude Desktop, Continue.dev, etc.) - JSON-RPC 2.0 protocol - Streaming results - Error handling

Start Server:

Terminal

kreuzberg mcp

Claude Desktop Configuration:

claude_desktop_config.json

// Configure kreuzberg MCP server in Claude Desktop
{
  "mcpServers": {
    "kreuzberg": {
      "command": "kreuzberg",
      "args": ["mcp"]
    }
  }
}

AI Coding Assistants¶

Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. Install it into any project with the Vercel Skills CLI:

npx skills add kreuzberg-dev/kreuzberg

Supported Tools:

Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills open standard.

What It Covers:

Complete API knowledge for Python, Node.js/TypeScript, Rust, and CLI
Extraction flows (sync/async, file/bytes, single/batch)
Configuration (OCR, chunking, output format, PDF options)
Embedding generation and RAG pipeline setup
Error handling patterns per language
Plugin system (post-processors, validators, OCR backends)
Correct field names and signatures (avoids common pitfalls)

Reference Files:

The skill includes 8 detailed reference files covering Python API, Node.js API, Rust API, CLI commands, configuration schema, supported formats, advanced features, and other language bindings (Go, Ruby, Java, C#, PHP, Elixir).

See AI Coding Assistants Guide for details.

Language Binding Comparison¶

Feature Availability¶

Feature	C#	Go	Python	Ruby	Rust	TypeScript (Native)	TypeScript (WASM)
Core Extraction	✓	✓	✓	✓	✓	✓	✓
All file formats	✓	✓	✓	✓	✓	✓	✓
Table extraction	✓	✓	✓	✓	✓	✓	✓
Metadata extraction	✓	✓	✓	✓	✓	✓	✓
OCR
Tesseract	✓	✓	✓	✓	✓	✓	✓
EasyOCR	✗	✗	✓ (optional)	✗	✗	✗	✗
PaddleOCR	✓	✓	✓ (optional)	✓	✓	✓	✗
Processing
Language detection	✓	✓	✓	✓	✓	✓	✓
Content chunking	✓	✓	✓	✓	✓	✓	✓
Embeddings	✓	✓*	✓	✓	✓	✓	✓
Token reduction	✓	✓	✓	✓	✓	✓	✓
Quality processing	✓	✓	✓	✓	✓	✓	✓
Keyword extraction	✓	✓	✓	✓	✓	✓	✓
Configuration
Programmatic config	✓	✓	✓	✓	✓	✓	✓
File-based config	✓	✓	✓	✓	✓	✓	✗
Config discovery	✓	✓	✓	✓	✓	✓	✗
Plugin System
Document extractors	✓	✓	✓	✓	✓	✓	Limited
OCR backends	✓	✓	✓	✓	✓	✓	Limited
Post processors	✓	✓	✓	✓	✓	✓	Limited
Validators	✓	✓	✓	✓	✓	✓	Limited
Servers
HTTP REST API	✓	✓	✓	✓	✓	✓	✗
MCP Server	✓	✓	✓	✓	✓	✓	✗
APIs
Sync API	✓	✓	✓	✓	✓	✓	✓
Async API	✗	✗	✓	✗	✓	✓	✓
Batch processing	✓	✓	✓	✓	✓	✓	✓
Streaming	✗	✗	✗	✗	✓	✗	✗
File I/O
File system access	✓	✓	✓	✓	✓	✓	Limited*

Platform Notes: - * Go embeddings not available on Windows (MinGW cannot link ONNX Runtime which requires MSVC) - * WASM: File system access limited to browser File API (read-only), Cloudflare Workers, or Deno (with permissions)

TypeScript Binding Differences¶

Native (@kreuzberg/node): - Fastest performance (100% of native speed) - Full feature parity with other language bindings - Full file I/O capabilities - Server-side HTTP/MCP servers supported - File-based configuration discovery - Plugin system with custom implementations

WASM (@kreuzberg/wasm): - Cross-platform browser compatibility (60-80% of native speed) - Zero native dependencies - Limited file system access (browser File API only) - No server mode support (use worker/edge runtime instead) - Plugins limited to in-memory registration (no filesystem) - Configuration via programmatic API only

Choose Native for server-side Node.js applications. Choose WASM for browser/edge environments.

Package Distribution¶

Language	Package Manager	Modular Features	Full Package
C#	`kreuzberg.dev`	✗	✓ (default)
Go	go.pkg.dev	✗	✓ (default)
Python	PyPI (`pip`)	✗	✓ (default)
Ruby	RubyGems (`gem`)	✗	✓ (default)
Rust	crates.io	✓	✗ (opt-in)
TypeScript	npm	✗	✓ (default)

Rust Feature Flags¶

Rust provides fine-grained control over included components via Cargo features:

Format Extractors: - pdf - PDF extraction (pdfium) - excel - Excel/spreadsheet support - office - Office document support (Word, PowerPoint) - email - Email extraction (EML, MSG) - html - HTML to Markdown conversion - xml - XML streaming parser - archives - Archive extraction (ZIP, TAR, 7z) - markdown - Markdown extraction - djot - Djot extraction - mdx - MDX extraction

Processing Features: - ocr - Tesseract OCR integration - language-detection - Language detection - chunking - Content chunking - embeddings - Embedding generation (requires chunking) - quality - Quality processing and text normalization - keywords - Keyword extraction (YAKE + RAKE) - stopwords - Stopword filtering

Server Features: - api - HTTP REST API server - mcp - Model Context Protocol server

Convenience Bundles: - full - All format extractors + all processing features - server - Server features + common extractors - cli - CLI features + common extractors

Example Cargo.toml:

Cargo.toml

[dependencies]
kreuzberg = { version = "4.0", features = ["pdf", "ocr", "chunking"] }

Default: No features enabled (minimal build)

Python Optional Dependencies¶

Python bindings include all core features by default. Optional OCR backends require separate installation:

Terminal

# Core package (Tesseract OCR only)
pip install kreuzberg

# With EasyOCR
pip install kreuzberg[easyocr]

# With PaddleOCR
pip install kreuzberg[paddleocr]

# All optional features
pip install kreuzberg[all]

Note: EasyOCR requires Python <3.14 due to PyTorch dependencies. The Python PaddleOCR package also requires Python <3.14, but the native Rust PaddleOCR backend has no Python dependency and works on all platforms.

TypeScript/Ruby Packages¶

TypeScript provides two packages with different feature sets:

Terminal

# Native TypeScript - full features (Node.js/Bun)
npm install @kreuzberg/node

# WASM TypeScript - browser/edge compatible (60-80% of native speed)
npm install @kreuzberg/wasm

Ruby includes all features in a single package:

Terminal

# Ruby - full package
gem install kreuzberg

Performance Characteristics¶

Rust Core¶

Kreuzberg's Rust core provides efficient performance for document processing:

Key Optimizations: - Streaming parsers for large documents - Concurrent extraction with configurable worker pools - Intelligent caching of expensive operations

Memory Efficiency¶

Streaming Support: - XML: Constant memory regardless of file size - Plain text: Line-by-line streaming for large files - Archives: Extract on-demand without loading entire archive

Caching¶

Disk-based caching improves performance for repeated operations:

Cache results for OCR, language detection, and embeddings
Automatic cache invalidation based on file content hash
Configurable cache directory and eviction policy

CLI Tools¶

Extract Command¶

Primary CLI for document extraction.

Terminal

kreuzberg extract document.pdf --ocr true --format json

Features: - Batch processing with glob patterns - Output format selection (text, JSON) - OCR configuration - Config file and inline JSON (--config, --config-json, --config-json-base64)

Serve Command¶

Start HTTP REST API server.

Terminal

kreuzberg serve --host 0.0.0.0 --port 8000 --config production.toml

MCP Command¶

Start Model Context Protocol server.

Terminal

kreuzberg mcp --config kreuzberg.toml

Cache Management¶

Terminal

# View cache statistics
kreuzberg cache stats

# Clear cache
kreuzberg cache clear

See CLI Usage for complete documentation.

System Requirements¶

Runtime Dependencies¶

Required (all platforms): - Tesseract OCR (4.0+) for OCR functionality

Installation:

Terminal

# macOS
brew install tesseract

# Ubuntu/Debian
apt-get install tesseract-ocr

# RHEL/CentOS/Fedora
dnf install tesseract

# Windows (Chocolatey)
choco install tesseract

Python Requirements¶

Python 3.10+
Optional: CUDA toolkit for GPU-accelerated OCR (EasyOCR)

TypeScript/Node.js Requirements¶

Node.js 18+
Native module support (node-gyp)

Rust Requirements¶

Rust 1.80+ (edition 2024)
Cargo for building from source

Ruby Requirements¶

Ruby 3.2.0 or higher (including Ruby 4.x)
Ruby 4.0+ is fully supported with no code changes required
Native extension support

Docker Images¶

Pre-built Docker images available on Docker Hub:

Variants: - ghcr.io/kreuzberg-dev/kreuzberg:latest - Core + Tesseract - ghcr.io/kreuzberg-dev/kreuzberg:latest - All features

Usage:

Terminal

docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
  extract /data/document.pdf --ocr

See Installation Guide for detailed instructions.

Next Steps¶

Installation - Install Kreuzberg
Quick Start - Get started in 5 minutes
Configuration - Configure extraction behavior
API Reference - Complete API documentation