Skip to content

CLI Usage

The Kreuzberg CLI provides command-line access to all extraction features. This guide covers installation, basic usage, and advanced features.

Installation

Bash
brew install kreuzberg-dev/tap/kreuzberg
Bash
cargo install kreuzberg-cli
Bash
docker pull ghcr.io/kreuzberg-dev/kreuzberg:latest
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest extract /data/document.pdf
Bash
go get github.com/kreuzberg-dev/kreuzberg/packages/go/v4@latest

Feature Availability

Homebrew Installation:

  • ✅ Text extraction (PDF, Office, images, 75+ formats)
  • ✅ OCR with Tesseract
  • ✅ HTTP API server (serve command)
  • ✅ MCP protocol server (mcp command)
  • ✅ Chunking, quality scoring, language detection
  • Embeddings - Not available via CLI flags. Use config file or Docker image.

Docker Images:

  • All features enabled including embeddings (ONNX Runtime included)
  • Use kreuzberg/kreuzberg:full for LibreOffice support
  • Use kreuzberg/kreuzberg:core for smaller image without LibreOffice

Basic Usage

Extract from Single File

Terminal
# Extract text content to stdout
kreuzberg extract document.pdf

# Specify MIME type (auto-detected if not provided)
kreuzberg extract document.pdf --mime-type application/pdf

Batch Extract Multiple Files

Use the batch command to extract from multiple files:

Terminal
# Extract from multiple files
kreuzberg batch doc1.pdf doc2.docx doc3.txt

# Batch extract all PDFs in directory
kreuzberg batch documents/*.pdf

# Batch extract recursively
kreuzberg batch documents/**/*.pdf

Output Formats

Terminal
# Output as plain text (default for extract)
kreuzberg extract document.pdf --format text

# Output as JSON (default for batch)
kreuzberg batch documents/*.pdf --format json

# Extract single file as JSON
kreuzberg extract document.pdf --format json

Content Output Format

Control the formatting of extracted text content:

Terminal
# Extract as plain text (default)
kreuzberg extract document.pdf --output-format plain

# Extract as Markdown
kreuzberg extract document.pdf --output-format markdown

# Extract as Djot markup
kreuzberg extract document.pdf --output-format djot

# Extract as HTML
kreuzberg extract document.pdf --output-format html

The --output-format flag controls how the extracted text is formatted. This is different from --format which controls the output structure (text vs JSON).

OCR Extraction

Enable OCR

Terminal
# Enable OCR (overrides config file setting)
kreuzberg extract scanned.pdf --ocr true

# Disable OCR
kreuzberg extract document.pdf --ocr false

Force OCR

Force OCR even for PDFs with text layer:

Terminal
# Force OCR to run regardless of existing text
kreuzberg extract document.pdf --force-ocr true

OCR Configuration

OCR options are configured via config file. CLI flags override config settings:

Terminal
# Extract with OCR enabled via config file
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true

Configure OCR backend, language, and Tesseract options in your config file (see Configuration Files section).

Configuration Files

Using Config Files

Kreuzberg automatically discovers configuration files by searching the current directory and parent directories for:

  1. ./kreuzberg.{toml,yaml,yml,json} in the current directory
  2. ../kreuzberg.{toml,yaml,yml,json} in the parent directory (and so on, up the directory tree)
Terminal
# Extract using discovered configuration
kreuzberg extract document.pdf

Specify Config File

Terminal
kreuzberg extract document.pdf --config my-config.toml

Example Config Files

kreuzberg.toml:

OCR configuration
use_cache = true
enable_quality_processing = true

[ocr]
backend = "tesseract"
language = "eng"

[ocr.tesseract_config]
psm = 3

[chunking]
max_characters = 1000
overlap = 100

kreuzberg.yaml:

kreuzberg.yaml
use_cache: true
enable_quality_processing: true

ocr:
  backend: tesseract
  language: eng
  tesseract_config:
    psm: 3

chunking:
  max_characters: 1000
  overlap: 100

kreuzberg.json:

kreuzberg.json
{
  "use_cache": true,
  "enable_quality_processing": true,
  "ocr": {
    "backend": "tesseract",
    "language": "eng",
    "tesseract_config": {
      "psm": 3
    }
  },
  "chunking": {
    "max_characters": 1000,
    "overlap": 100
  }
}

Batch Processing

Use the batch command to process multiple files:

Terminal
# Extract all PDFs in directory
kreuzberg batch documents/*.pdf

# Extract PDFs recursively from subdirectories
kreuzberg batch documents/**/*.pdf

# Extract multiple file types
kreuzberg batch documents/**/*.{pdf,docx,txt}

Batch with Output Formats

Terminal
# Output as JSON (default for batch command)
kreuzberg batch documents/*.pdf --format json

# Output as plain text
kreuzberg batch documents/*.pdf --format text

Batch with OCR

Terminal
# Batch extract with OCR enabled
kreuzberg batch scanned/*.pdf --ocr true

# Batch extract with force OCR
kreuzberg batch documents/*.pdf --force-ocr true

# Batch extract with quality processing
kreuzberg batch documents/*.pdf --quality true

Batch with Content Format

Terminal
# Batch extract with djot formatting
kreuzberg batch documents/*.pdf --output-format djot --format json

# Batch extract as Markdown
kreuzberg batch documents/*.pdf --output-format markdown --format json

# Batch extract as HTML
kreuzberg batch documents/*.pdf --output-format html --format json

Advanced Features

Language Detection

Terminal
# Extract with automatic language detection
kreuzberg extract document.pdf --detect-language true

# Disable language detection
kreuzberg extract document.pdf --detect-language false

Content Chunking

Terminal
# Split content into chunks for LLM processing
kreuzberg extract document.pdf --chunk true

# Specify chunk size and overlap
kreuzberg extract document.pdf --chunk true --chunk-size 1000 --chunk-overlap 100

# Output chunked content as JSON
kreuzberg extract document.pdf --chunk true --format json

Quality Processing

Terminal
# Apply quality processing for improved formatting
kreuzberg extract document.pdf --quality true

# Disable quality processing
kreuzberg extract document.pdf --quality false

# Batch extraction with quality processing
kreuzberg batch documents/*.pdf --quality true

Caching

Terminal
# Extract with result caching enabled (default)
kreuzberg extract document.pdf

# Extract without caching results
kreuzberg extract document.pdf --no-cache true

# Clear all cached results
kreuzberg cache clear

# View cache statistics
kreuzberg cache stats

Output Options

Standard Output (Text Format)

Terminal
# Extract and print content to stdout
kreuzberg extract document.pdf

# Extract and redirect output to file
kreuzberg extract document.pdf > output.txt

# Batch extract as text
kreuzberg batch documents/*.pdf --format text

JSON Output

Terminal
# Output as JSON
kreuzberg extract document.pdf --format json

# Batch extract as JSON (default format)
kreuzberg batch documents/*.pdf --format json

JSON Output Structure:

The JSON output includes extracted content and related metadata:

JSON Response
{
  "content": "Extracted text content...",
  "metadata": {
    "mime_type": "application/pdf"
  }
}

Error Handling

The CLI returns appropriate exit codes on error. Basic error handling can be done with standard shell commands:

Terminal
# Check for extraction errors
kreuzberg extract document.pdf || echo "Extraction failed"

# Continue processing even if one file fails (bash)
for file in documents/*.pdf; do
  kreuzberg batch "$file" || continue
done

Examples

Extract Single PDF

Extract text from PDF
kreuzberg extract document.pdf

Batch Extract All PDFs in Directory

Extract all PDFs from directory as JSON
kreuzberg batch documents/*.pdf --format json

OCR Scanned Documents

OCR extraction from scanned documents
kreuzberg batch scans/*.pdf --ocr true --format json

Extract with Quality Processing

Extract with quality processing enabled
kreuzberg extract document.pdf --quality true --format json

Extract with Chunking

Extract with chunking for LLM processing
kreuzberg extract document.pdf --config kreuzberg.toml --chunk true --chunk-size 1000 --chunk-overlap 100 --format json

Batch Extract Multiple File Types

Extract multiple file types in batch
kreuzberg batch documents/**/*.{pdf,docx,txt} --format json

Extract with Config File

Extract using configuration file
kreuzberg extract document.pdf --config /path/to/kreuzberg.toml

Detect MIME Type

Detect file MIME type
kreuzberg detect document.pdf

Docker Usage

Basic Docker

Terminal
# Extract document using Docker with mounted directory
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
  extract /data/document.pdf

# Extract and save output to host directory using shell redirection
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
  extract /data/document.pdf > output.txt

Docker with OCR

Terminal
# Extract with OCR using Docker
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
  extract /data/scanned.pdf --ocr true

Docker Compose

docker-compose.yaml:

docker-compose.yaml
version: '3.8'

services:
  kreuzberg:
    image: ghcr.io/kreuzberg-dev/kreuzberg:latest
    volumes:
      - ./documents:/input
    command: extract /input/document.pdf --ocr true

Run:

Terminal
docker-compose up

Performance Tips

Optimize Extraction Speed

Terminal
# Extract without quality processing for faster speed
kreuzberg extract large.pdf --quality false

# Use batch for processing multiple files
kreuzberg batch large_files/*.pdf --format json

Manage Memory Usage

Terminal
# Disable caching to reduce memory footprint
kreuzberg extract large_file.pdf --no-cache true

# Compress output to save disk space
kreuzberg extract document.pdf | gzip > output.txt.gz

Troubleshooting

Check Installation

Terminal
# Display installed version
kreuzberg --version

# Display help for commands
kreuzberg --help

Common Issues

Issue: "Tesseract not found"

When using OCR, Tesseract must be installed:

Terminal
# Install Tesseract OCR engine on macOS
brew install tesseract

# Install Tesseract OCR engine on Ubuntu
sudo apt-get install tesseract-ocr

Issue: "File not found"

Ensure the file path is correct and accessible:

Terminal
# Check if file exists and is readable
ls -la document.pdf

# Extract with absolute path
kreuzberg extract /absolute/path/to/document.pdf

Server Commands

Start API Server

The serve command starts a RESTful HTTP API server:

Terminal
# Start server on default host (127.0.0.1) and port (8000)
kreuzberg serve

# Start server on specific host and port
kreuzberg serve --host 0.0.0.0 --port 8000

# Start server with custom configuration file
kreuzberg serve --config kreuzberg.toml --host 0.0.0.0 --port 8000

Server Endpoints

The server provides the following endpoints: - POST /extract - Extract text from uploaded files - POST /batch - Batch extract from multiple files - GET /detect - Detect MIME type of file - GET /health - Health check - GET /info - Server information - GET /cache/stats - Cache statistics - POST /cache/clear - Clear cache

See API Server Guide for full API details.

Start MCP Server

The mcp command starts a Model Context Protocol server for AI integration:

Terminal
# Start MCP server with stdio transport (default for Claude Desktop)
kreuzberg mcp

# Start MCP server with HTTP transport
kreuzberg mcp --transport http

# Start MCP server on specific HTTP host and port
kreuzberg mcp --transport http --host 0.0.0.0 --port 8001

# Start MCP server with custom configuration file
kreuzberg mcp --config kreuzberg.toml --transport stdio

The MCP server provides tools for AI agents: - extract_file - Extract text from a file path - extract_bytes - Extract text from base64-encoded bytes - batch_extract - Extract from multiple files

See API Server Guide for MCP integration details.

Cache Management

View Cache Statistics

Terminal
# Display cache usage statistics
kreuzberg cache stats

# Display statistics for specific cache directory
kreuzberg cache stats --cache-dir /path/to/cache

# Output cache statistics as JSON
kreuzberg cache stats --format json

Clear Cache

Terminal
# Remove all cached extraction results
kreuzberg cache clear

# Clear specific cache directory
kreuzberg cache clear --cache-dir /path/to/cache

# Clear cache and display removal details
kreuzberg cache clear --format json

Getting Help

CLI Help

Terminal
# Display general CLI help
kreuzberg --help

# Display command-specific help
kreuzberg extract --help
kreuzberg batch --help
kreuzberg detect --help
kreuzberg serve --help
kreuzberg mcp --help
kreuzberg cache --help

Version Information

Terminal
# Display version number
kreuzberg --version

# Show version with JSON output
kreuzberg version --format json

The version command displays the Kreuzberg version. Use --format json for machine-readable output.

Next Steps