Skip to content

Features

Kreuzberg is a document intelligence library built on a Rust core with bindings for 12 languages. It extracts text, tables, and metadata from 91+ file formats, runs OCR when needed, and feeds the results through a configurable post-processing pipeline -- chunking, embeddings, keyword extraction, and more.

This page is a map of what Kreuzberg can do. Each section links to the guide or reference page where you will find configuration details and code examples.

Kreuzberg features overview -- 91+ input formats flow through extraction, OCR, and processing to produce text, tables, chunks, and metadata


Format Support

Kreuzberg handles 91+ file formats through native Rust extractors. No external tools such as LibreOffice are required.

PDF .pdf Word .docx .doc Pages .pages PowerPoint .pptx .ppt Keynote .key OpenDocument .odt Plain text .txt Markdown .md Djot .djot MDX .mdx RTF .rtf reStructuredText .rst Org .org Hangul .hwp .hwpx

Excel .xlsx .xls .xlsm .xlsb Numbers .numbers OpenDocument .ods CSV .csv TSV .tsv dBASE .dbf

JPEG .jpg .jpeg PNG .png GIF .gif BMP .bmp TIFF .tiff .tif WebP .webp JPEG 2000 .jp2 .jpx .jpm .mj2 JBIG2 .jbig2 PNM .pnm .pbm .pgm .ppm

EML .eml MSG .msg

HTML .html .htm XHTML .xhtml XML .xml SVG .svg

JSON .json YAML .yaml TOML .toml

ZIP .zip TAR .tar .tgz GZIP .gz 7-Zip .7z

EPUB .epub BibTeX .bib RIS .ris CSL .csl LaTeX .tex Typst .typ JATS .jats DocBook .docbook OPML .opml

For the full format matrix with MIME types, extraction methods, and special capabilities, see the Format Support Reference.


Extraction Pipeline

Every file -- whether a single PDF or a batch of thousands -- flows through the same multi-stage pipeline:

flowchart LR
    A[Input File] --> B[MIME Detection]
    B --> C[Format Extractor]
    C --> D{OCR Needed?}
    D -->|Yes| E[OCR Engine]
    D -->|No| F[Post-Processing]
    E --> F
    F --> G[ExtractionResult]
  1. MIME detection -- Kreuzberg identifies the file type from magic bytes and extension, then selects the matching native extractor from the registry.
  2. Format extraction -- The extractor pulls text, tables, metadata, and optionally images from the file. PDF extraction uses pdfium; Office formats use streaming XML parsers; images pass directly to OCR.
  3. OCR -- When the extractor finds no text layer (or force_ocr is set), the file is routed to the configured OCR backend. The OCR result replaces or supplements the extracted text.
  4. Post-processing -- Validators, quality processing, chunking, embeddings, keyword extraction, and any registered post-processor plugins run in sequence.
  5. Caching -- If caching is enabled, results are stored keyed by a content hash so repeated extractions skip the entire pipeline.

For a deep dive into each stage, see Extraction Pipeline.

Output Formats

Kreuzberg supports five output formats: Plain text, Markdown, Djot, HTML, and Structured (JSON). The HTML format includes a styled renderer with semantic kb-* CSS classes, five built-in themes, and CSS custom properties for full customization. See HTML Output for details.


OCR Engines

Kreuzberg supports three OCR backends. You can use one backend, or chain multiple backends into a fallback pipeline that automatically retries with the next engine when quality is low.

Backend Comparison

Tesseract PaddleOCR EasyOCR
Languages 100+ 80+ (11 script families) 80+
Best for General purpose, broad language coverage CJK, complex scripts, high accuracy GPU-accelerated workloads
Platform All bindings including WASM All non-WASM bindings Python only
Install System package (tesseract-ocr) Cargo feature paddle-ocr or pip install kreuzberg[paddleocr] pip install kreuzberg[easyocr]
Runtime C library (Tesseract 4.0+) ONNX Runtime (models downloaded on first use) PyTorch (optional CUDA)
Python version Any Native: any. Python package: <3.14 <3.14

Multi-Backend Pipeline

Added in v4.5.0

When the paddle-ocr feature is enabled, Kreuzberg automatically constructs a fallback pipeline: Tesseract runs first, and if the output falls below configurable quality thresholds (16 tunable parameters), PaddleOCR takes over. You can also define a custom ordering across all three backends.

The pipeline supports auto-rotate for page orientation detection (0/90/180/270 degrees) and per-stage language and backend-specific settings.

flowchart TD
    A[Image / Scanned Page] --> B[Primary Backend]
    B --> C{Quality Above Threshold?}
    C -->|Yes| D[Return Result]
    C -->|No| E[Fallback Backend]
    E --> F{Quality Above Threshold?}
    F -->|Yes| D
    F -->|No| G[Return Best Result]

Document-Level Optimization

Added in v4.5.3

Some OCR backends (including EasyOCR) now support document-level processing. When a file path is provided, the extractor can bypass the expensive page-by-page rendering stage and delegate the entire document to the OCR engine. This significantly reduces memory overhead and improves throughput for large PDFs and multi-page images.

For backend configuration, language selection, and PSM/OEM modes, see the OCR Guide.


Processing Features

After extraction, Kreuzberg can run a chain of processing steps. Each is optional and configured independently through ExtractionConfig.

For RAG Pipelines

Content Chunking -- Split extracted text into sized chunks for LLM consumption. Strategies include recursive (paragraph/sentence/word splitting), semantic, and Markdown-aware chunking that preserves heading hierarchy. Chunks can be sized by character count or by token count using any HuggingFace tokenizer.

Embeddings -- Generate vector embeddings locally using FastEmbed. Choose from preset models ("fast", "balanced", "quality") or any FastEmbed-compatible model. Embeddings are generated in-process with no external API calls.

Page Tracking -- Extract per-page content with byte-accurate offsets for O(1) page lookups. Chunks are automatically mapped to their source pages, enabling precise citations in retrieval systems. Supported for PDF (byte-accurate), PPTX (slide boundaries), and DOCX (best-effort page breaks). See Extraction Basics for usage.

PDF Hierarchy Detection -- Detect document structure from PDFs using K-means clustering on block characteristics (font size, weight, indentation, position). Blocks are assigned to semantic levels (title, section, subsection, paragraph) without relying on explicit heading tags. See the PDF Hierarchy Guide.

PDF Page Rendering -- Render individual PDF pages as PNG images for thumbnails, vision model input, or custom processing pipelines. Memory-efficient iterator renders one page at a time. Configurable DPI (default 150). Available across all language bindings. See PDF Rendering Guide.

Added in v4.6.2

LLM-Powered Intelligence

Added in v4.8.0

Kreuzberg integrates with 146 LLM providers including local inference (Ollama, LM Studio, vLLM, llama.cpp) via liter-llm to unlock three new capabilities that complement the local extraction pipeline.

VLM OCR -- Vision language models as an OCR backend Use OpenAI GPT-4o, Anthropic Claude, Google Gemini, or any vision-capable model as an OCR engine. VLM OCR delivers superior accuracy on low-quality scans, handwriting, Arabic/Farsi scripts, and complex layouts where traditional OCR struggles. Configure via `ocr.backend = "vlm"` with `ocr.vlm_config` in your extraction config or `kreuzberg.toml`.
Structured Extraction -- Extract typed JSON from documents using a schema Provide a JSON schema and an optional Jinja2 prompt template; the LLM returns conforming structured data. Supports strict mode (OpenAI) with automatic `additionalProperties` sanitization for cross-provider compatibility. Available through the `kreuzberg extract-structured` CLI command, `POST /extract-structured` API endpoint, and `extract_structured` MCP tool.
{
  "type": "object",
  "properties": {
    "invoice_number": { "type": "string" },
    "total": { "type": "number" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "amount": { "type": "number" }
        }
      }
    }
  }
}
VLM Embeddings -- Provider-hosted embedding models Use provider-hosted embedding models (for example, `openai/text-embedding-3-small`, `mistral/mistral-embed`) as an alternative to local ONNX models. Works through the existing `/embed` API endpoint, `embed_text` MCP tool, and `embed` CLI command with `--provider llm`.
Custom Jinja2 Prompts -- Minijinja template engine for LLM prompts Customize the prompts sent to LLMs with Minijinja templates. Available variables for structured extraction: `{{ content }}`, `{{ schema }}`, `{{ schema_name }}`, `{{ schema_description }}`. For VLM OCR prompts: `{{ language }}`. Override the default prompt per-request or in configuration.

LlmConfig and StructuredExtractionConfig types are exposed in Python, Node.js, and PHP bindings. Five new environment variables (KREUZBERG_LLM_MODEL, KREUZBERG_LLM_API_KEY, KREUZBERG_LLM_BASE_URL, KREUZBERG_VLM_OCR_MODEL, KREUZBERG_VLM_EMBEDDING_MODEL) provide zero-code configuration.

For Search and Indexing

Keyword Extraction -- Extract key phrases using YAKE (unsupervised, language-independent) or RAKE (fast statistical method). Configurable n-gram ranges and language-specific stopword filtering. See the Keyword Extraction Guide.

Language Detection -- Identify 60+ languages with confidence scoring using fast-langdetect. Supports multi-language detection for documents with mixed content.

Metadata Extraction -- Pull document properties (title, author, creation date), page/word/character counts, and format-specific metadata (Excel sheet names, PDF annotations).

For Code

Code Intelligence -- Extract functions, classes, imports, exports, symbols, docstrings, and diagnostics from 248 programming languages via tree-sitter. Results are available in ExtractionResult.code_intelligence as a ProcessResult. Code files produce semantic chunks (function/class-aware) that bypass the text-splitter entirely. Configure content mode with CodeContentMode: chunks (default, semantic TSLP chunks), raw (source as-is), or structure (headings + docstrings only).

For Data Quality

Quality Processing -- Unicode normalization (NFC/NFD/NFKC/NFKD), whitespace and line break standardization, encoding detection, and mojibake correction.

Token Reduction -- Reduce token count while preserving meaning through TF-IDF-based extractive summarization. Three modes: light (~15% reduction), moderate (~30%), and aggressive (~50%).

Table Extraction -- Structured table data from PDFs, spreadsheets, and Word documents with cell-level row/column indexing, merged cell support, and Markdown or JSON output.


Layout Detection

Added in v4.5.0

Detect and classify document regions using ONNX-based deep learning models. Layout detection identifies tables, figures, headers, text blocks, code sections, forms, and more, then feeds that structure into extraction for better accuracy.

Model Presets

Preset Model Element Classes Use Case
"fast" YOLO 11 High-throughput pipelines, general documents
"accurate" RT-DETR v2 17 Complex layouts, forms, mixed-content pages

Layout detection includes SLANet table structure recognition for HTML table recovery with colspan/rowspan, heading detection with confidence-based overrides, and GPU acceleration via ONNX Runtime (CUDA, CoreML, TensorRT).

Models are automatically downloaded and cached from HuggingFace on first use. Available across all language bindings except WebAssembly -- WASM does not support layout detection because ONNX Runtime is unavailable in browser environments.

For configuration and usage examples, see the Layout Detection Guide.


Plugin System

Kreuzberg's extraction pipeline is extensible through four plugin types, each hooking into a different stage:

flowchart LR
    A[File Input] --> B[Document Extractor Plugin]
    B --> C[OCR Backend Plugin]
    C --> D[Validator Plugin]
    D --> E[Post-Processor Plugin]
    E --> F[Output]
Plugin Type Purpose Example
Document Extractors Add support for custom file formats or override defaults Proprietary format parser
OCR Backends Integrate cloud OCR services or custom engines AWS Textract, Google Vision
Validators Enforce quality standards on extraction results Minimum word count check
Post-Processors Transform or enrich results after extraction PII redaction, custom metadata

Plugins are registered with a priority value that determines execution order. Discovery works through Python entry points, configuration files, or environment variables.

For the architecture overview, see Plugin System. For implementation guidance, see Creating Plugins.


Deployment Modes

Mode When to Use Details
Library Embedding extraction into your application Import the package in Python, TypeScript, Rust, Go, Ruby, C#, Java, PHP, Elixir, R, or C
CLI One-off extractions, scripting, CI pipelines kreuzberg extract document.pdf --format json -- see CLI Usage
REST API Multi-service architectures, language-agnostic access kreuzberg serve --port 8000 -- see API Server Guide
MCP Server AI agent integration (Claude Desktop, Continue.dev) kreuzberg mcp -- stdio transport with JSON-RPC 2.0
Docker Reproducible deployments with all dependencies bundled ghcr.io/kreuzberg-dev/kreuzberg:latest -- see Docker Guide

Language Bindings

Kreuzberg's Rust core is exposed through native bindings for 12 languages. All bindings share the same extraction engine and produce identical results.

Binding Tiers

Full feature parity with async API -- Python (PyO3), TypeScript/Node.js (NAPI-RS), Rust

Full features, synchronous API -- Go, Ruby, C#, Java

Subset or constrained environment -- TypeScript/WASM (browser/edge, no filesystem or server modes), PHP, Elixir, R, C (FFI)

TypeScript: Native vs WASM

Native (@kreuzberg/node) runs at full speed with complete feature parity including servers, plugins, and config file discovery. WASM (@kreuzberg/wasm) runs in browsers and edge runtimes at 60-80% of native speed with no native dependencies, but lacks filesystem access, server modes, file-based configuration, layout detection (requires ONNX Runtime), hardware acceleration config, concurrency config, and email config. Choose Native for server-side Node.js; choose WASM for browser or edge deployments.

Rust Feature Flags

Rust builds are modular through Cargo features. Nothing is enabled by default -- you opt into exactly what you need:

Category Features
Format extractors pdf, excel, office, email, html, xml, archives, markdown, djot, mdx
Processing ocr, paddle-ocr, language-detection, chunking, embeddings, quality, keywords, stopwords
Servers api, mcp
Bundles full (all extractors + processing), server, cli

Package Installation

pip install kreuzberg                  # Core + Tesseract
pip install kreuzberg[paddleocr]       # + PaddleOCR
pip install kreuzberg[easyocr]         # + EasyOCR
pip install kreuzberg[all]             # Everything
npm install @kreuzberg/node            # Native (Node.js/Bun)
npm install @kreuzberg/wasm            # WASM (browser/edge)
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf", "ocr", "chunking"] }
gem install kreuzberg                  # Ruby
go get github.com/kreuzberg-dev/kreuzberg/packages/go/v4  # Go
dotnet add package kreuzberg.dev       # C#

For API details per language, see the API Reference.


Configuration

Kreuzberg supports four configuration methods, checked in this order:

  1. Programmatic -- Construct ExtractionConfig objects in code (all bindings)
  2. TOML -- kreuzberg.toml
  3. YAML -- kreuzberg.yaml
  4. JSON -- kreuzberg.json

Config files are auto-discovered from the current directory, ~/.config/kreuzberg/, and /etc/kreuzberg/. Environment variables (KREUZBERG_CONFIG_PATH, KREUZBERG_CACHE_DIR, KREUZBERG_OCR_BACKEND, KREUZBERG_OCR_LANGUAGE) override file-based settings.

For the full configuration schema and examples, see the Configuration Guide.


AI Coding Assistants

Added in v4.2.15

Kreuzberg ships with an Agent Skill that teaches AI coding assistants the complete API across Python, TypeScript, Rust, and CLI. Install it with:

npx skills add kreuzberg-dev/kreuzberg

Compatible with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard. See the AI Coding Assistants Guide.


Next Steps