Skip to content

Comparisons

Kreuzberg sits in a crowded space of document extraction tools. Some are general-purpose libraries that handle dozens of formats, others are laser-focused on PDFs. This page maps the landscape so you can find the right tool for your project.

For performance and quality numbers across all of these tools, see the live benchmarks.


Full-Scope Extraction Libraries

These handle multiple document formats -- not just PDFs.

Library Language Formats License Focus Deep Dive
Kreuzberg Rust 91+ Apache-2.0 High-throughput extraction with native bindings for 10 languages --
Unstructured Python ~31 Apache-2.0 Element-based output, managed cloud API Read more
Docling Python ~38 MIT IBM-backed, ML-powered layout analysis Read more
Apache Tika Java 1500+ detected Apache-2.0 Enterprise standard, broadest format detection Read more
MarkItDown Python ~25 MIT Microsoft-backed, outputs Markdown for LLM prep Read more
MinerU Python PDF + images AGPL-3.0 Heavy ML models for scientific document layout Read more
Pandoc Haskell 45+ input GPL-2.0 Universal document converter (cannot read PDFs) --

PDF-Specific Libraries

These focus on PDF extraction only. They're not direct competitors to Kreuzberg's full format coverage, but you'll often see them in PDF-heavy pipelines.

Library Language License Focus
PyMuPDF / PyMuPDF4LLM Python (C core) AGPL-3.0 Fast PDF extraction via MuPDF. AGPL license limits commercial use.
pdfplumber Python MIT Good table extraction, built on pdfminer.six
pdfminer.six Python MIT Fine-grained text positioning, pure Python
pypdf Python BSD-3 Lightweight, pure Python, no C dependencies
playa-pdf Python MIT Modern pure-Python PDF library
pdftotext C (Python binding) GPL-2.0 Thin wrapper around poppler's pdftotext

License matters

Libraries marked AGPL-3.0 (PyMuPDF, MinerU) require that any application using them also be released under AGPL, unless you purchase a commercial license. GPL-2.0 tools (Pandoc, pdftotext/poppler) have similar copyleft requirements. If you're building a commercial product, check the license before integrating.

Benchmarks

Kreuzberg benchmarks against all of the libraries listed above. For extraction speed, quality scores, and format-by-format comparisons, see the live benchmark dashboard.