Skip to content

Kreuzberg vs MarkItDown

MarkItDown is a Microsoft-backed Python library that converts documents to Markdown -- purpose-built for feeding content into LLMs. Kreuzberg is a Rust-based extraction library with broader format support, native language bindings, and built-in RAG pipeline features. Both are permissively licensed and work well for AI-adjacent document processing.

At a Glance

Kreuzberg MarkItDown
Written in Rust Python
File formats 91+ ~25
Use from Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, WASM Python
Output Unified text, element-based, per-page JSON, Markdown Markdown
License Apache-2.0 MIT
Sweet spot Full extraction pipelines with chunking and embeddings Quick Markdown conversion for LLM context

How They Differ

Philosophy

Different tools for different stages of a pipeline.

  • Kreuzberg -- A full extraction library. Extracts text, tables, metadata, and images. Offers multiple output formats, built-in chunking, local embeddings, and OCR. Designed to be the complete document-to-vectors pipeline.
  • MarkItDown -- A converter. Takes documents in, outputs Markdown. Intentionally lightweight and focused on one job: turning files into clean Markdown that LLMs can consume. Downstream processing is left to you.

If you need a complete pipeline (extract, chunk, embed), Kreuzberg handles the full chain. If you just need Markdown for a prompt, MarkItDown does that with minimal setup.

Format Coverage

Both cover common formats, with different long-tail reach.

  • Kreuzberg (91+ formats) -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data (JSON/YAML/TOML), plus LaTeX, Typst, BibTeX, Jupyter notebooks, EPUB, OrgMode, and more.
  • MarkItDown (~25 formats) -- PDFs, DOCX, PPTX, XLSX, HTML, XML, CSV, JSON, EPUB, Jupyter notebooks, MSG email, images, and ZIP archives. Covers the essentials.

MarkItDown handles the formats you'll encounter most often. Kreuzberg also handles the ones you won't expect -- until you do.

OCR

Different approaches to image-based text extraction.

  • Kreuzberg -- Tesseract + native PaddleOCR (ONNX-based, runs locally, no Python needed). Multi-backend pipeline with automatic quality-based fallback. All processing happens on your machine.
  • MarkItDown -- Can use Azure Document Intelligence for image and PDF extraction. Powerful when enabled, but requires an Azure account and sends documents to Microsoft's cloud. Without it, image OCR is limited.

Language Support

A significant difference in how you integrate each tool.

  • Kreuzberg -- Native bindings for 10 languages. Same performance and API from Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, or WASM in the browser.
  • MarkItDown -- Python only. If your backend is in Go or TypeScript, you'd need to wrap MarkItDown in an HTTP service or call it as a subprocess.

Downstream Processing

What happens after extraction.

  • Kreuzberg -- Built-in chunking (recursive, semantic, markdown-aware), local embeddings (ONNX models, no API keys), token reduction, keyword extraction, and quality processing. Extraction output is ready for RAG pipelines.
  • MarkItDown -- Outputs Markdown and stops. Chunking, embeddings, and vector storage are your responsibility. This is by design -- it's a converter, not a pipeline.

When to Use Kreuzberg

  • You need a complete pipeline from document to embeddings
  • Your stack includes Go, TypeScript, Ruby, Java, or other languages beyond Python
  • You want local OCR without cloud API dependencies
  • You need to handle niche formats like LaTeX, Typst, email files, or archives
  • You need multiple output formats (text, elements, per-page JSON) not just Markdown

When to Use MarkItDown

  • You just need clean Markdown to feed into an LLM prompt
  • You're in a Python-only environment and want the simplest possible setup
  • You're already using Azure Document Intelligence and want to leverage it for OCR
  • Your use case is document-to-prompt conversion without further processing
  • You value minimal dependencies and a small footprint

Benchmarks

For extraction speed and quality comparisons between Kreuzberg and MarkItDown, see the live benchmark dashboard.