Kreuzberg vs Docling¶

Kreuzberg and Docling are both open-source document extraction libraries, but they come at the problem from different angles. Kreuzberg is a Rust library focused on speed and broad format coverage across many languages. Docling is an IBM-backed Python library that leans heavily on deep learning models for document understanding. Here's how they compare.

At a Glance¶

	Kreuzberg	Docling
Written in	Rust	Python
File formats	91+	~38 extensions (15+ types)
Use from	Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, WASM	Python
License	Apache-2.0	MIT
OCR	Tesseract + PaddleOCR (local, multi-backend fallback)	Tesseract + EasyOCR
Sweet spot	High-throughput pipelines, polyglot stacks, broad format coverage	ML-powered document understanding, scientific papers

How They Differ¶

Architecture¶

Different foundations lead to different trade-offs.

Kreuzberg -- Rust core with native bindings for each language. Your Python or TypeScript code calls directly into compiled Rust. No subprocess overhead, no model loading delays for basic extraction.
Docling -- Python library built around deep learning models (DocLayNet for layout, TableFormer for tables). It produces a rich DoclingDocument object with full structural understanding, but it needs to load ML models on startup.

If you need raw extraction speed without ML overhead, Kreuzberg is faster out of the box. If you need deep structural understanding of complex layouts, Docling's ML pipeline is purpose-built for that.

Format Coverage¶

What each tool can ingest.

Kreuzberg (91+ formats) -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data (JSON/YAML/TOML), plus LaTeX, Typst, BibTeX, Jupyter notebooks, EPUB, OrgMode, and more.
Docling (~38 extensions) -- PDFs, DOCX, PPTX, XLSX, HTML, Markdown, AsciiDoc, CSV, images, and JATS (scientific article XML). Focused on the formats that benefit most from layout analysis.

Docling covers the core document types well. Kreuzberg handles the long tail -- archives, email files, structured data, code, and niche markup formats.

Output Model¶

How extracted content is structured.

Kreuzberg -- Outputs unified text (default), element-based structures, or per-page JSON. You choose the level of detail you need. Markdown output is built in via HTML-to-Markdown conversion.
Docling -- Outputs a DoclingDocument object with rich structural metadata: reading order, table cells, figure captions, section hierarchy. Can export to Markdown, JSON, or other formats. The structural model is deeper but Python-specific.

OCR¶

Both handle image-based documents, with different engine choices.

Kreuzberg -- Tesseract + native PaddleOCR (ONNX-based, no Python dependency). Supports a multi-backend OCR pipeline that auto-falls back between engines based on output quality.
Docling -- Tesseract + EasyOCR. EasyOCR offers good accuracy on CJK and Arabic scripts but requires PyTorch.

Language Support¶

How you integrate each tool into your stack.

Kreuzberg -- Native bindings for 10 languages (Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, WASM). Same performance and features from every language.
Docling -- Python only. If your backend is in Go, Java, or TypeScript, you'd need to wrap Docling in an HTTP service.

When to Use Kreuzberg¶

You're building a pipeline in Go, TypeScript, Ruby, Java, or any language beyond Python
You need to process high volumes quickly without ML model loading overhead
Your pipeline ingests diverse formats beyond PDFs and Office docs
You want local embeddings and chunking built into the extraction step
You need to run in the browser or on edge runtimes via WASM

When to Use Docling¶

You need deep structural understanding of complex document layouts (reading order, nested tables, figure captions)
You're working with scientific papers or technical documents where layout analysis matters
Your stack is Python-only and you want a rich document object model
You need TableFormer-based table extraction for complex tables with merged cells and spanning rows
You value IBM's ongoing investment in document AI research

Benchmarks

For extraction speed and quality comparisons between Kreuzberg and Docling, see the live benchmark dashboard.