Kreuzberg vs MinerU¶

MinerU is an open-source tool from OpenDataLab designed for high-quality PDF extraction, especially for scientific and academic documents. Kreuzberg is a Rust-based general-purpose extraction library covering 91+ formats. They overlap on PDF extraction but differ significantly in scope, licensing, and architecture.

License

MinerU is licensed under AGPL-3.0. This means any application that uses MinerU must also be released under AGPL, or you need a commercial license. Kreuzberg is Apache-2.0 -- no copyleft restrictions.

At a Glance¶

	Kreuzberg	MinerU
Written in	Rust	Python
File formats	91+	PDF + PNG/JPG only
Use from	Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, WASM	Python CLI / library
License	Apache-2.0	AGPL-3.0
GPU	Optional (ONNX Runtime -- CUDA, CoreML, TensorRT)	Recommended for best results
Sweet spot	Broad format extraction, high throughput, polyglot stacks	Scientific PDFs, academic papers, complex layout analysis

How They Differ¶

Scope¶

The biggest difference is what each tool is designed to handle.

Kreuzberg (91+ formats) -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data, LaTeX, Typst, Jupyter notebooks, EPUB, and more. A general-purpose extraction library.
MinerU (PDF + images) -- Handles PDFs and PNG/JPG images. Nothing else. It's a specialist tool, not a general-purpose library.

If your pipeline processes only PDFs, both work. If you also need to handle Word docs, email files, or spreadsheets, Kreuzberg is the only option.

ML Approach¶

Different levels of ML investment.

Kreuzberg -- Offers optional ONNX-based layout detection (YOLO for speed, RT-DETR v2 for accuracy) covering 17 element classes. Also works without any ML models for pure extraction -- useful when speed matters more than layout understanding.
MinerU -- ML-first. Uses heavy deep learning models for layout detection, formula recognition, and table structure extraction. This produces excellent results on complex scientific documents but comes with significant model loading time, memory usage, and a GPU recommendation.

If you're extracting from scientific papers with complex multi-column layouts, equations, and nested tables, MinerU's deep ML pipeline is purpose-built for that. For general document extraction where you don't need heavy layout analysis, Kreuzberg is faster and lighter.

Architecture and Performance¶

Different runtime characteristics.

Kreuzberg -- Rust core, runs on CPU by default. GPU acceleration available via ONNX Runtime when layout detection is enabled. Fast startup, low memory footprint for basic extraction.
MinerU -- Python, with heavy PyTorch model loading on startup. GPU recommended for acceptable performance. First-run model downloads can be several gigabytes.

Language Support¶

How you integrate each tool.

Kreuzberg -- Native bindings for 10 languages. Same performance from Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, or WASM in the browser.
MinerU -- Python CLI and library only. No language bindings, no API server out of the box.

Embeddings and Chunking¶

Downstream pipeline support.

Kreuzberg -- Built-in chunking (recursive, semantic, markdown-aware), local embeddings via ONNX models, token reduction, and keyword extraction. Ready for RAG pipelines out of the box.
MinerU -- Outputs extracted content (Markdown, JSON). Chunking and embeddings are left to downstream tools.

When to Use Kreuzberg¶

You need to process more than just PDFs -- Office docs, email, archives, code, structured data
You're building a commercial product and need a permissive license (Apache-2.0)
You want native bindings in Go, TypeScript, Ruby, Java, or other languages
You need fast extraction without heavy ML model loading
You want built-in chunking and embeddings for RAG pipelines

When to Use MinerU¶

You're working exclusively with scientific papers and academic PDFs
You need deep layout analysis -- formulas, multi-column layouts, nested tables
The AGPL-3.0 license is compatible with your project (open-source, research, or you'll purchase a commercial license)
You have GPU resources available and can accept the startup cost of loading large models
You need the highest possible extraction quality on complex PDF layouts and throughput is secondary

Benchmarks

For extraction speed and quality comparisons between Kreuzberg and MinerU, see the live benchmark dashboard.