Kreuzberg Documentation¶
Kreuzberg is a document intelligence platform with a high‑performance Rust core and native bindings for Python, TypeScript/Node.js, C#, Ruby, Go, Elixir, and Rust itself. Use it as an SDK, CLI, Docker image, REST API server, or MCP tool to extract text, tables, and metadata from 75+ file formats (PDF, Office, images, HTML, XML, archives, email, and more) with optional OCR and post-processing pipelines.
What You Can Do¶
- Single API across languages – Binding idioms follow each ecosystem, but features (extraction, OCR, chunking, embeddings, plugins) map 1:1.
- Structured extraction – Convert PDFs, Office docs, images, emails, HTML, XML, and archives into clean Markdown/JSON, preserving tables and metadata.
- Multi-engine OCR – Built-in Tesseract and PaddleOCR support in all bindings, with EasyOCR extension for Python.
- Plugin ecosystem – Register post-processors, validators, OCR backends, and run them from any binding or via the CLI/API server.
- Deployment flexibility – Ship as a library, run the CLI, or host the API server/MCP adapter inside containers.
- AI coding assistant support – Ships with an Agent Skill for Claude Code, Codex, Gemini CLI, Cursor, and other AI tools.
Documentation Map¶
- Getting Started – First extraction in each language.
- Installation – Dependency matrix for Rust, Python, Ruby, Node.js, CLI, and Docker users.
- Guides – How to configure extraction, OCR, advanced features, plugins, and Docker/API deployments.
- Concepts – Architecture, extraction pipeline, MIME detection, plugin runtime, and performance strategies.
- Features directory – Exhaustive capability list per format/binding plus OCR and chunking options.
- Reference – API references for all supported languages, configuration schema, supported formats, types, and errors.
- CLI – Command syntax, flags, exit codes, and automation tips.
- API Server – Running the REST service and integrating with MCP.
- AI Coding Assistants – Agent Skill for Claude Code, Codex, Gemini CLI, Cursor, and more.
- Migration and Changelog – Track breaking changes and release history.
Supported Platforms¶
| Binding / Interface | Package | Use Case | Docs |
|---|---|---|---|
| Python | pip install kreuzberg | Server-side, data processing | Python API Reference |
| TypeScript/Node.js (Native) | npm install @kreuzberg/node | Node.js servers, command-line tools, native performance | TypeScript API Reference |
| WebAssembly (WASM) | npm install @kreuzberg/wasm | Browsers, Cloudflare Workers, Deno, serverless | WASM API Reference |
| Java | dev.kreuzberg:kreuzberg (Maven) | Server-side Java, FFM API | Java API Reference |
| C# | dotnet add package Kreuzberg | .NET applications, Windows servers | C# API Reference |
| PHP | kreuzberg/kreuzberg (Composer) | PHP applications, ext-ffi | PHP API Reference |
| Ruby | gem install kreuzberg | Server-side, Rails applications | Ruby API Reference |
| Go | go get github.com/kreuzberg-dev/kreuzberg/packages/go/v4@latest | Server-side, systems tools | Go API Reference |
| Elixir | {:kreuzberg, "~> 4.0"} | BEAM applications, Phoenix apps | Elixir API Reference |
| Rust | cargo add kreuzberg | System libraries, performance-critical | Rust API Reference |
| CLI | brew install kreuzberg-dev/tap/kreuzberg or cargo install kreuzberg-cli | Terminal automation, scripting | CLI Usage |
| API Server / MCP | Docker image ghcr.io/kreuzberg-dev/kreuzberg:core | Containerized services, MCP integration | API Server Guide |
Choosing Between TypeScript Packages¶
Kreuzberg provides two distinct TypeScript packages optimized for different runtimes:
Native TypeScript/Node.js (@kreuzberg/node)¶
Use @kreuzberg/node if you're targeting:
- Node.js servers and applications
- Command-line tools and scripts
- Environments requiring maximum performance (near-native speeds)
- Server-side batch processing and data pipelines
Native bindings compile to C++ N-API and deliver the best performance across all platforms.
WebAssembly (@kreuzberg/wasm)¶
Use @kreuzberg/wasm if you're targeting:
- Web browsers (Chrome, Firefox, Safari, Edge)
- Cloudflare Workers and other edge computing platforms
- Deno and other JavaScript runtimes
- Serverless environments (AWS Lambda, Vercel, etc.)
- In-browser document processing without server dependencies
WASM bindings run entirely in WebAssembly and work in any JavaScript runtime with WASM support. See Performance for tradeoffs.
Performance Comparison¶
| Binding | Speed Relative to Native | Memory | Platform Support | Use Case |
|---|---|---|---|---|
Native (@kreuzberg/node) | 100% (baseline) | Efficient | Node.js only | Server-side, high-performance |
WASM (@kreuzberg/wasm) | 60-80% | Higher | Browsers, Workers, Deno, Bun | In-browser, edge, serverless |
WASM provides broad platform compatibility at the cost of performance. For server-side Node.js applications, always use native @kreuzberg/node.
Getting Help¶
- Questions / bugs: open an issue at github.com/kreuzberg-dev/kreuzberg.
- Chat: join the community Discord (invite in README).
- Contributing: see Contributing for coding standards, environment setup, and testing instructions.
Happy extracting!