CLI Usage¶

The Kreuzberg CLI provides command-line access to all extraction features. This guide covers installation, basic usage, and advanced features.

Installation¶

Install Script (Linux/macOS)Homebrew (macOS/Linux)Cargo (Cross-platform)DockerGo (SDK)

Bash

curl -fsSL https://raw.githubusercontent.com/kreuzberg-dev/kreuzberg/main/scripts/install.sh | bash

Bash

brew install kreuzberg-dev/tap/kreuzberg

Bash

cargo install kreuzberg-cli

Bash

docker pull ghcr.io/kreuzberg-dev/kreuzberg-cli:latest
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest extract /data/document.pdf

Bash

go get github.com/kreuzberg-dev/kreuzberg/packages/go/v4@latest

Feature Availability

Homebrew Installation:

- ✅ Text extraction (PDF, Office, images, 75+ formats)
- ✅ OCR with Tesseract
- ✅ HTTP API server (`serve` command)
- ✅ MCP protocol server (`mcp` command)
- ✅ Chunking, quality scoring, language detection
- ❌ **Embeddings** - Not available via CLI flags. Use config file or Docker image.

**Docker Images:**

- All features enabled including embeddings (ONNX Runtime included)

Basic Usage¶

Extract from Single File¶

Terminal

# Extract text content to stdout
kreuzberg extract document.pdf

# Specify MIME type (auto-detected if not provided)
kreuzberg extract document.pdf --mime-type application/pdf

Batch Extract Multiple Files¶

Use the batch command to extract from multiple files:

Terminal

# Extract from multiple files
kreuzberg batch doc1.pdf doc2.docx doc3.txt

# Batch extract all PDFs in directory
kreuzberg batch documents/*.pdf

# Batch extract recursively
kreuzberg batch documents/**/*.pdf

Output Formats¶

Terminal

# Output as plain text (default for extract)
kreuzberg extract document.pdf --format text

# Output as JSON (default for batch)
kreuzberg batch documents/*.pdf --format json

# Extract single file as JSON
kreuzberg extract document.pdf --format json

Content Output Format¶

Control the formatting of extracted text content:

Terminal

# Extract as plain text (default)
kreuzberg extract document.pdf --output-format plain

# Extract as Markdown
kreuzberg extract document.pdf --output-format markdown

# Extract as Djot markup
kreuzberg extract document.pdf --output-format djot

# Extract as HTML
kreuzberg extract document.pdf --output-format html

The --output-format flag controls how the extracted text is formatted. This is different from --format which controls the output structure (text vs JSON).

OCR Extraction¶

Enable OCR¶

Terminal

# Enable OCR (overrides config file setting)
kreuzberg extract scanned.pdf --ocr true

# Disable OCR
kreuzberg extract document.pdf --ocr false

Force OCR¶

Force OCR even for PDFs with text layer:

Terminal

# Force OCR to run regardless of existing text
kreuzberg extract document.pdf --force-ocr true

OCR Language Selection¶

Set the OCR language using the --ocr-language flag. This flag is backend-agnostic and works with all supported OCR backends (Tesseract, PaddleOCR, EasyOCR).

Language Code Formats:

Tesseract: Uses ISO 639-3 codes (three-letter codes)
Examples: eng (English), fra (French), deu (German), spa (Spanish), jpn (Japanese)
PaddleOCR: Accepts flexible language codes and full language names
Examples: en, ch, french, korean, thai, greek, cyrillic, etc.
EasyOCR: Similar flexible format to PaddleOCR

When used with --ocr true, the language flag overrides the default language. When used without --ocr, it overrides the language specified in your config file.

Terminal

# French OCR with Tesseract (default backend)
kreuzberg extract --ocr true --ocr-language fra document.pdf

# Chinese OCR with PaddleOCR
kreuzberg extract --ocr true --ocr-backend paddle-ocr --ocr-language ch document.pdf

# Thai OCR with PaddleOCR
kreuzberg extract --ocr true --ocr-backend paddle-ocr --ocr-language thai document.pdf

# German OCR with Tesseract
kreuzberg extract --ocr true --ocr-language deu document.pdf

# Override config file language with Spanish
kreuzberg extract document.pdf --config kreuzberg.toml --ocr-language spa

OCR Configuration¶

OCR options are configured via config file. CLI flags override config settings:

Terminal

# Extract with OCR enabled via config file
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true

Configure OCR backend, language, and Tesseract options in your config file (see Configuration Files section).

Configuration Files¶

Using Config Files¶

Kreuzberg automatically discovers a configuration file by searching the current directory and parent directories for kreuzberg.toml only. If you use YAML or JSON, specify the file explicitly with --config.

Terminal

# Extract using discovered configuration (finds kreuzberg.toml)
kreuzberg extract document.pdf

Specify Config File¶

You can load TOML, YAML (.yaml or .yml), or JSON via --config:

Terminal

kreuzberg extract document.pdf --config my-config.toml
kreuzberg extract document.pdf --config kreuzberg.yaml
kreuzberg extract document.pdf --config my-config.json

Inline JSON Config¶

Override or supply config without a file using inline JSON (merged after config file, before individual flags):

Terminal

# Inline JSON (applied after config file)
kreuzberg extract document.pdf --config-json '{"ocr":{"backend":"tesseract"},"chunking":{"max_chars":1000}}'

# Base64-encoded JSON (useful in shells where quoting is awkward)
kreuzberg extract document.pdf --config-json-base64 eyJvY3IiOnsiYmFja2VuZCI6InRlc3NlcmFjdCJ9fQ==

Both extract and batch support --config-json and --config-json-base64.

Example Config Files¶

kreuzberg.toml:

OCR configuration

use_cache = true
enable_quality_processing = true

[ocr]
backend = "tesseract"
language = "eng"

[ocr.tesseract_config]
psm = 3

[chunking]
max_characters = 1000
overlap = 100

kreuzberg.yaml:

kreuzberg.yaml

use_cache: true
enable_quality_processing: true

ocr:
  backend: tesseract
  language: eng
  tesseract_config:
    psm: 3

chunking:
  max_characters: 1000
  overlap: 100

kreuzberg.json:

kreuzberg.json

{
  "use_cache": true,
  "enable_quality_processing": true,
  "ocr": {
    "backend": "tesseract",
    "language": "eng",
    "tesseract_config": {
      "psm": 3
    }
  },
  "chunking": {
    "max_characters": 1000,
    "overlap": 100
  }
}

Batch Processing¶

Use the batch command to process multiple files:

Terminal

# Extract all PDFs in directory
kreuzberg batch documents/*.pdf

# Extract PDFs recursively from subdirectories
kreuzberg batch documents/**/*.pdf

# Extract multiple file types
kreuzberg batch documents/**/*.{pdf,docx,txt}

Batch with Output Formats¶

Terminal

# Output as JSON (default for batch command)
kreuzberg batch documents/*.pdf --format json

# Output as plain text
kreuzberg batch documents/*.pdf --format text

Batch with OCR¶

Terminal

# Batch extract with OCR enabled
kreuzberg batch scanned/*.pdf --ocr true

# Batch extract with force OCR
kreuzberg batch documents/*.pdf --force-ocr true

# Batch extract with quality processing
kreuzberg batch documents/*.pdf --quality true

Batch with Content Format¶

Terminal

# Batch extract with djot formatting
kreuzberg batch documents/*.pdf --output-format djot --format json

# Batch extract as Markdown
kreuzberg batch documents/*.pdf --output-format markdown --format json

# Batch extract as HTML
kreuzberg batch documents/*.pdf --output-format html --format json

Advanced Features¶

Language Detection¶

Terminal

# Extract with automatic language detection
kreuzberg extract document.pdf --detect-language true

# Disable language detection
kreuzberg extract document.pdf --detect-language false

Content Chunking¶

Terminal

# Split content into chunks for LLM processing
kreuzberg extract document.pdf --chunk true

# Specify chunk size and overlap
kreuzberg extract document.pdf --chunk true --chunk-size 1000 --chunk-overlap 100

# Output chunked content as JSON
kreuzberg extract document.pdf --chunk true --format json

Quality Processing¶

Terminal

# Apply quality processing for improved formatting
kreuzberg extract document.pdf --quality true

# Disable quality processing
kreuzberg extract document.pdf --quality false

# Batch extraction with quality processing
kreuzberg batch documents/*.pdf --quality true

Caching¶

Terminal

# Extract with result caching enabled (default)
kreuzberg extract document.pdf

# Extract without caching results
kreuzberg extract document.pdf --no-cache true

# Clear all cached results
kreuzberg cache clear

# View cache statistics
kreuzberg cache stats

Output Options¶

Standard Output (Text Format)¶

Terminal

# Extract and print content to stdout
kreuzberg extract document.pdf

# Extract and redirect output to file
kreuzberg extract document.pdf > output.txt

# Batch extract as text
kreuzberg batch documents/*.pdf --format text

JSON Output¶

Terminal

# Output as JSON
kreuzberg extract document.pdf --format json

# Batch extract as JSON (default format)
kreuzberg batch documents/*.pdf --format json

JSON Output Structure:

The JSON output includes extracted content and related metadata:

JSON Response

{
  "content": "Extracted text content...",
  "metadata": {
    "mime_type": "application/pdf"
  }
}

Error Handling¶

The CLI returns appropriate exit codes on error. Basic error handling can be done with standard shell commands:

Terminal

# Check for extraction errors
kreuzberg extract document.pdf || echo "Extraction failed"

# Continue processing even if one file fails (bash)
for file in documents/*.pdf; do
  kreuzberg batch "$file" || continue
done

Examples¶

Extract Single PDF¶

Extract text from PDF

kreuzberg extract document.pdf

Batch Extract All PDFs in Directory¶

Extract all PDFs from directory as JSON

kreuzberg batch documents/*.pdf --format json

OCR Scanned Documents¶

OCR extraction from scanned documents

kreuzberg batch scans/*.pdf --ocr true --format json

Extract with Quality Processing¶

Extract with quality processing enabled

kreuzberg extract document.pdf --quality true --format json

Extract with Chunking¶

Extract with chunking for LLM processing

kreuzberg extract document.pdf --config kreuzberg.toml --chunk true --chunk-size 1000 --chunk-overlap 100 --format json

Batch Extract Multiple File Types¶

Extract multiple file types in batch

kreuzberg batch documents/**/*.{pdf,docx,txt} --format json

Extract with Config File¶

Extract using configuration file

kreuzberg extract document.pdf --config /path/to/kreuzberg.toml

Detect MIME Type¶

Detect file MIME type

kreuzberg detect document.pdf

Docker Usage¶

Use the CLI image ghcr.io/kreuzberg-dev/kreuzberg-cli:latest for command-line usage. The full image ghcr.io/kreuzberg-dev/kreuzberg:latest also includes the CLI.

Basic Docker¶

Terminal

# Extract document using Docker with mounted directory
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
  extract /data/document.pdf

# Extract and save output to host directory using shell redirection
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
  extract /data/document.pdf > output.txt

Docker with OCR¶

Terminal

# Extract with OCR using Docker
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
  extract /data/scanned.pdf --ocr true

Docker Compose¶

docker-compose.yaml:

docker-compose.yaml

version: "3.8"

services:
  kreuzberg:
    image: ghcr.io/kreuzberg-dev/kreuzberg-cli:latest
    volumes:
      - ./documents:/input
    command: extract /input/document.pdf --ocr true

Run:

Terminal

docker-compose up

Performance Tips¶

Optimize Extraction Speed¶

Terminal

# Extract without quality processing for faster speed
kreuzberg extract large.pdf --quality false

# Use batch for processing multiple files
kreuzberg batch large_files/*.pdf --format json

Manage Memory Usage¶

Terminal

# Disable caching to reduce memory footprint
kreuzberg extract large_file.pdf --no-cache true

# Compress output to save disk space
kreuzberg extract document.pdf | gzip > output.txt.gz

Troubleshooting¶

Check Installation¶

Terminal

# Display installed version
kreuzberg --version

# Display help for commands
kreuzberg --help

Common Issues¶

Issue: "Tesseract not found"

When using OCR, Tesseract must be installed:

Terminal

# Install Tesseract OCR engine on macOS
brew install tesseract

# Install Tesseract OCR engine on Ubuntu
sudo apt-get install tesseract-ocr

Issue: "File not found"

Ensure the file path is correct and accessible:

Terminal

# Check if file exists and is readable
ls -la document.pdf

# Extract with absolute path
kreuzberg extract /absolute/path/to/document.pdf

Server Commands¶

Start API Server¶

The serve command starts a RESTful HTTP API server:

Terminal

# Start server on default host (127.0.0.1) and port (8000)
kreuzberg serve

# Start server on specific host and port (-H / -p are short forms)
kreuzberg serve --host 0.0.0.0 --port 8000
kreuzberg serve -H 0.0.0.0 -p 8000

# Start server with custom configuration file
kreuzberg serve --config kreuzberg.toml --host 0.0.0.0 --port 8000

Server Endpoints¶

The server provides the following endpoints:

POST /extract - Extract text from uploaded files
POST /batch - Batch extract from multiple files
GET /detect - Detect MIME type of file
GET /health - Health check
GET /info - Server information
GET /cache/stats - Cache statistics
POST /cache/clear - Clear cache

See API Server Guide for full API details.

Start MCP Server¶

The mcp command starts a Model Context Protocol server for AI integration:

Terminal

# Start MCP server with stdio transport (default for Claude Desktop)
kreuzberg mcp

# Start MCP server with HTTP transport
kreuzberg mcp --transport http

# Start MCP server on specific HTTP host and port
kreuzberg mcp --transport http --host 0.0.0.0 --port 8001

# Start MCP server with custom configuration file
kreuzberg mcp --config kreuzberg.toml --transport stdio

The MCP server provides tools for AI agents:

extract_file - Extract text from a file path
extract_bytes - Extract text from base64-encoded bytes
batch_extract - Extract from multiple files

See API Server Guide for MCP integration details.

Cache Management¶

View Cache Statistics¶

Terminal

# Display cache usage statistics
kreuzberg cache stats

# Display statistics for specific cache directory
kreuzberg cache stats --cache-dir /path/to/cache

# Output cache statistics as JSON
kreuzberg cache stats --format json

Clear Cache¶

Terminal

# Remove all cached extraction results
kreuzberg cache clear

# Clear specific cache directory
kreuzberg cache clear --cache-dir /path/to/cache

# Clear cache and display removal details
kreuzberg cache clear --format json

Getting Help¶

CLI Help¶

Terminal

# Display general CLI help
kreuzberg --help

# Display command-specific help
kreuzberg extract --help
kreuzberg batch --help
kreuzberg detect --help
kreuzberg version --help
kreuzberg serve --help
kreuzberg mcp --help
kreuzberg cache --help
kreuzberg cache stats --help
kreuzberg cache clear --help

Version Information¶

Terminal

# Display version number
kreuzberg --version

# Show version with JSON output
kreuzberg version --format json

The version command displays the Kreuzberg version. Use --format json for machine-readable output.

Next Steps¶

API Server Guide - API and MCP server setup
Advanced Features - Advanced Kreuzberg features
Plugin Development - Extend Kreuzberg functionality
API Reference - Programmatic access