Kreuzberg vs MarkItDown¶

MarkItDown is a Microsoft-backed Python library that converts documents to Markdown -- purpose-built for feeding content into LLMs. Kreuzberg is a Rust-based extraction library with broader format support, native language bindings, and built-in RAG pipeline features. Both are permissively licensed and work well for AI-adjacent document processing.

At a Glance¶

	Kreuzberg	MarkItDown
Written in	Rust	Python
File formats	91+	~25
Use from	Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, WASM	Python
Output	Unified text, element-based, per-page JSON, Markdown	Markdown
License	Apache-2.0	MIT
Sweet spot	Full extraction pipelines with chunking and embeddings	Quick Markdown conversion for LLM context

How They Differ¶

Philosophy¶

Different tools for different stages of a pipeline.

Kreuzberg -- A full extraction library. Extracts text, tables, metadata, and images. Offers multiple output formats, built-in chunking, local embeddings, and OCR. Designed to be the complete document-to-vectors pipeline.
MarkItDown -- A converter. Takes documents in, outputs Markdown. Intentionally lightweight and focused on one job: turning files into clean Markdown that LLMs can consume. Downstream processing is left to you.

If you need a complete pipeline (extract, chunk, embed), Kreuzberg handles the full chain. If you just need Markdown for a prompt, MarkItDown does that with minimal setup.

Format Coverage¶

Both cover common formats, with different long-tail reach.

Kreuzberg (91+ formats) -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data (JSON/YAML/TOML), plus LaTeX, Typst, BibTeX, Jupyter notebooks, EPUB, OrgMode, and more.
MarkItDown (~25 formats) -- PDFs, DOCX, PPTX, XLSX, HTML, XML, CSV, JSON, EPUB, Jupyter notebooks, MSG email, images, and ZIP archives. Covers the essentials.

MarkItDown handles the formats you'll encounter most often. Kreuzberg also handles the ones you won't expect -- until you do.

OCR¶

Different approaches to image-based text extraction.

Kreuzberg -- Tesseract + native PaddleOCR (ONNX-based, runs locally, no Python needed). Multi-backend pipeline with automatic quality-based fallback. All processing happens on your machine.
MarkItDown -- Can use Azure Document Intelligence for image and PDF extraction. Powerful when enabled, but requires an Azure account and sends documents to Microsoft's cloud. Without it, image OCR is limited.

Language Support¶

A significant difference in how you integrate each tool.

Kreuzberg -- Native bindings for 10 languages. Same performance and API from Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, or WASM in the browser.
MarkItDown -- Python only. If your backend is in Go or TypeScript, you'd need to wrap MarkItDown in an HTTP service or call it as a subprocess.

Downstream Processing¶

What happens after extraction.

Kreuzberg -- Built-in chunking (recursive, semantic, markdown-aware), local embeddings (ONNX models, no API keys), token reduction, keyword extraction, and quality processing. Extraction output is ready for RAG pipelines.
MarkItDown -- Outputs Markdown and stops. Chunking, embeddings, and vector storage are your responsibility. This is by design -- it's a converter, not a pipeline.

When to Use Kreuzberg¶

You need a complete pipeline from document to embeddings
Your stack includes Go, TypeScript, Ruby, Java, or other languages beyond Python
You want local OCR without cloud API dependencies
You need to handle niche formats like LaTeX, Typst, email files, or archives
You need multiple output formats (text, elements, per-page JSON) not just Markdown

When to Use MarkItDown¶

You just need clean Markdown to feed into an LLM prompt
You're in a Python-only environment and want the simplest possible setup
You're already using Azure Document Intelligence and want to leverage it for OCR
Your use case is document-to-prompt conversion without further processing
You value minimal dependencies and a small footprint

Benchmarks

For extraction speed and quality comparisons between Kreuzberg and MarkItDown, see the live benchmark dashboard.