Comparisons¶
Kreuzberg sits in a crowded space of document extraction tools. Some are general-purpose libraries that handle dozens of formats, others are laser-focused on PDFs. This page maps the landscape so you can find the right tool for your project.
For performance and quality numbers across all of these tools, see the live benchmarks.
Full-Scope Extraction Libraries¶
These handle multiple document formats -- not just PDFs.
| Library | Language | Formats | License | Focus | Deep Dive |
|---|---|---|---|---|---|
| Kreuzberg | Rust | 91+ | Apache-2.0 | High-throughput extraction with native bindings for 10 languages | -- |
| Unstructured | Python | ~31 | Apache-2.0 | Element-based output, managed cloud API | Read more |
| Docling | Python | ~38 | MIT | IBM-backed, ML-powered layout analysis | Read more |
| Apache Tika | Java | 1500+ detected | Apache-2.0 | Enterprise standard, broadest format detection | Read more |
| MarkItDown | Python | ~25 | MIT | Microsoft-backed, outputs Markdown for LLM prep | Read more |
| MinerU | Python | PDF + images | AGPL-3.0 | Heavy ML models for scientific document layout | Read more |
| Pandoc | Haskell | 45+ input | GPL-2.0 | Universal document converter (cannot read PDFs) | -- |
PDF-Specific Libraries¶
These focus on PDF extraction only. They're not direct competitors to Kreuzberg's full format coverage, but you'll often see them in PDF-heavy pipelines.
| Library | Language | License | Focus |
|---|---|---|---|
| PyMuPDF / PyMuPDF4LLM | Python (C core) | AGPL-3.0 | Fast PDF extraction via MuPDF. AGPL license limits commercial use. |
| pdfplumber | Python | MIT | Good table extraction, built on pdfminer.six |
| pdfminer.six | Python | MIT | Fine-grained text positioning, pure Python |
| pypdf | Python | BSD-3 | Lightweight, pure Python, no C dependencies |
| playa-pdf | Python | MIT | Modern pure-Python PDF library |
| pdftotext | C (Python binding) | GPL-2.0 | Thin wrapper around poppler's pdftotext |
License matters
Libraries marked AGPL-3.0 (PyMuPDF, MinerU) require that any application using them also be released under AGPL, unless you purchase a commercial license. GPL-2.0 tools (Pandoc, pdftotext/poppler) have similar copyleft requirements. If you're building a commercial product, check the license before integrating.
Benchmarks
Kreuzberg benchmarks against all of the libraries listed above. For extraction speed, quality scores, and format-by-format comparisons, see the live benchmark dashboard.