Skip to content

CLI Usage

The Kreuzberg CLI provides command-line access to all extraction features. This guide covers installation, basic usage, and advanced features.

Installation

Bash
curl -fsSL https://raw.githubusercontent.com/kreuzberg-dev/kreuzberg/main/scripts/install.sh | bash
Bash
brew install kreuzberg-dev/tap/kreuzberg
Bash
cargo install kreuzberg-cli
Bash
docker pull ghcr.io/kreuzberg-dev/kreuzberg-cli:latest
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest extract /data/document.pdf
Bash
go get github.com/kreuzberg-dev/kreuzberg/packages/go/v4@latest

Feature Availability

Homebrew Installation:

- ✅ Text extraction (PDF, Office, images, 75+ formats)
- ✅ OCR with Tesseract
- ✅ HTTP API server (`serve` command)
- ✅ MCP protocol server (`mcp` command)
- ✅ Chunking, quality scoring, language detection
- ❌ **Embeddings** - Not available via CLI flags. Use config file or Docker image.

**Docker Images:**

- All features enabled including embeddings (ONNX Runtime included)

Basic Usage

Extract from Single File

Terminal
# Extract text content to stdout
kreuzberg extract document.pdf

# Specify MIME type (auto-detected if not provided)
kreuzberg extract document.pdf --mime-type application/pdf

Batch Extract Multiple Files

Use the batch command to extract from multiple files:

Terminal
# Extract from multiple files
kreuzberg batch doc1.pdf doc2.docx doc3.txt

# Batch extract all PDFs in directory
kreuzberg batch documents/*.pdf

# Batch extract recursively
kreuzberg batch documents/**/*.pdf

Output Formats

Terminal
# Output as plain text (default for extract)
kreuzberg extract document.pdf --format text

# Output as JSON (default for batch)
kreuzberg batch documents/*.pdf --format json

# Extract single file as JSON
kreuzberg extract document.pdf --format json

Content Output Format

Control the formatting of extracted text content:

Terminal
# Extract as plain text (default)
kreuzberg extract document.pdf --output-format plain

# Extract as Markdown
kreuzberg extract document.pdf --output-format markdown

# Extract as Djot markup
kreuzberg extract document.pdf --output-format djot

# Extract as HTML
kreuzberg extract document.pdf --output-format html

The --output-format flag controls how the extracted text is formatted. This is different from --format which controls the output structure (text vs JSON).

OCR Extraction

Enable OCR

Terminal
# Enable OCR (overrides config file setting)
kreuzberg extract scanned.pdf --ocr true

# Disable OCR
kreuzberg extract document.pdf --ocr false

Force OCR

Force OCR even for PDFs with text layer:

Terminal
# Force OCR to run regardless of existing text
kreuzberg extract document.pdf --force-ocr true

OCR Language Selection

Set the OCR language using the --ocr-language flag. This flag is backend-agnostic and works with all supported OCR backends (Tesseract, PaddleOCR, EasyOCR).

Language Code Formats:

  • Tesseract: Uses ISO 639-3 codes (three-letter codes)
  • Examples: eng (English), fra (French), deu (German), spa (Spanish), jpn (Japanese)
  • PaddleOCR: Accepts flexible language codes and full language names
  • Examples: en, ch, french, korean, thai, greek, cyrillic, etc.
  • EasyOCR: Similar flexible format to PaddleOCR

When used with --ocr true, the language flag overrides the default language. When used without --ocr, it overrides the language specified in your config file.

Terminal
# French OCR with Tesseract (default backend)
kreuzberg extract --ocr true --ocr-language fra document.pdf

# Chinese OCR with PaddleOCR
kreuzberg extract --ocr true --ocr-backend paddle-ocr --ocr-language ch document.pdf

# Thai OCR with PaddleOCR
kreuzberg extract --ocr true --ocr-backend paddle-ocr --ocr-language thai document.pdf

# German OCR with Tesseract
kreuzberg extract --ocr true --ocr-language deu document.pdf

# Override config file language with Spanish
kreuzberg extract document.pdf --config kreuzberg.toml --ocr-language spa

OCR Configuration

OCR options are configured via config file. CLI flags override config settings:

Terminal
# Extract with OCR enabled via config file
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true

Configure OCR backend, language, and Tesseract options in your config file (see Configuration Files section).

Configuration Files

Using Config Files

Kreuzberg automatically discovers a configuration file by searching the current directory and parent directories for kreuzberg.toml only. If you use YAML or JSON, specify the file explicitly with --config.

Terminal
# Extract using discovered configuration (finds kreuzberg.toml)
kreuzberg extract document.pdf

Specify Config File

You can load TOML, YAML (.yaml or .yml), or JSON via --config:

Terminal
kreuzberg extract document.pdf --config my-config.toml
kreuzberg extract document.pdf --config kreuzberg.yaml
kreuzberg extract document.pdf --config my-config.json

Inline JSON Config

Override or supply config without a file using inline JSON (merged after config file, before individual flags):

Terminal
# Inline JSON (applied after config file)
kreuzberg extract document.pdf --config-json '{"ocr":{"backend":"tesseract"},"chunking":{"max_chars":1000}}'

# Base64-encoded JSON (useful in shells where quoting is awkward)
kreuzberg extract document.pdf --config-json-base64 eyJvY3IiOnsiYmFja2VuZCI6InRlc3NlcmFjdCJ9fQ==

Both extract and batch support --config-json and --config-json-base64.

Example Config Files

kreuzberg.toml:

OCR configuration
use_cache = true
enable_quality_processing = true

[ocr]
backend = "tesseract"
language = "eng"

[ocr.tesseract_config]
psm = 3

[chunking]
max_characters = 1000
overlap = 100

kreuzberg.yaml:

kreuzberg.yaml
use_cache: true
enable_quality_processing: true

ocr:
  backend: tesseract
  language: eng
  tesseract_config:
    psm: 3

chunking:
  max_characters: 1000
  overlap: 100

kreuzberg.json:

kreuzberg.json
{
  "use_cache": true,
  "enable_quality_processing": true,
  "ocr": {
    "backend": "tesseract",
    "language": "eng",
    "tesseract_config": {
      "psm": 3
    }
  },
  "chunking": {
    "max_characters": 1000,
    "overlap": 100
  }
}

Batch Processing

Use the batch command to process multiple files:

Terminal
# Extract all PDFs in directory
kreuzberg batch documents/*.pdf

# Extract PDFs recursively from subdirectories
kreuzberg batch documents/**/*.pdf

# Extract multiple file types
kreuzberg batch documents/**/*.{pdf,docx,txt}

Batch with Output Formats

Terminal
# Output as JSON (default for batch command)
kreuzberg batch documents/*.pdf --format json

# Output as plain text
kreuzberg batch documents/*.pdf --format text

Batch with OCR

Terminal
# Batch extract with OCR enabled
kreuzberg batch scanned/*.pdf --ocr true

# Batch extract with force OCR
kreuzberg batch documents/*.pdf --force-ocr true

# Batch extract with quality processing
kreuzberg batch documents/*.pdf --quality true

Batch with Content Format

Terminal
# Batch extract with djot formatting
kreuzberg batch documents/*.pdf --output-format djot --format json

# Batch extract as Markdown
kreuzberg batch documents/*.pdf --output-format markdown --format json

# Batch extract as HTML
kreuzberg batch documents/*.pdf --output-format html --format json

Advanced Features

Language Detection

Terminal
# Extract with automatic language detection
kreuzberg extract document.pdf --detect-language true

# Disable language detection
kreuzberg extract document.pdf --detect-language false

Content Chunking

Terminal
# Split content into chunks for LLM processing
kreuzberg extract document.pdf --chunk true

# Specify chunk size and overlap
kreuzberg extract document.pdf --chunk true --chunk-size 1000 --chunk-overlap 100

# Output chunked content as JSON
kreuzberg extract document.pdf --chunk true --format json

Quality Processing

Terminal
# Apply quality processing for improved formatting
kreuzberg extract document.pdf --quality true

# Disable quality processing
kreuzberg extract document.pdf --quality false

# Batch extraction with quality processing
kreuzberg batch documents/*.pdf --quality true

Caching

Terminal
# Extract with result caching enabled (default)
kreuzberg extract document.pdf

# Extract without caching results
kreuzberg extract document.pdf --no-cache true

# Clear all cached results
kreuzberg cache clear

# View cache statistics
kreuzberg cache stats

Output Options

Standard Output (Text Format)

Terminal
# Extract and print content to stdout
kreuzberg extract document.pdf

# Extract and redirect output to file
kreuzberg extract document.pdf > output.txt

# Batch extract as text
kreuzberg batch documents/*.pdf --format text

JSON Output

Terminal
# Output as JSON
kreuzberg extract document.pdf --format json

# Batch extract as JSON (default format)
kreuzberg batch documents/*.pdf --format json

JSON Output Structure:

The JSON output includes extracted content and related metadata:

JSON Response
{
  "content": "Extracted text content...",
  "metadata": {
    "mime_type": "application/pdf"
  }
}

Error Handling

The CLI returns appropriate exit codes on error. Basic error handling can be done with standard shell commands:

Terminal
# Check for extraction errors
kreuzberg extract document.pdf || echo "Extraction failed"

# Continue processing even if one file fails (bash)
for file in documents/*.pdf; do
  kreuzberg batch "$file" || continue
done

Examples

Extract Single PDF

Extract text from PDF
kreuzberg extract document.pdf

Batch Extract All PDFs in Directory

Extract all PDFs from directory as JSON
kreuzberg batch documents/*.pdf --format json

OCR Scanned Documents

OCR extraction from scanned documents
kreuzberg batch scans/*.pdf --ocr true --format json

Extract with Quality Processing

Extract with quality processing enabled
kreuzberg extract document.pdf --quality true --format json

Extract with Chunking

Extract with chunking for LLM processing
kreuzberg extract document.pdf --config kreuzberg.toml --chunk true --chunk-size 1000 --chunk-overlap 100 --format json

Batch Extract Multiple File Types

Extract multiple file types in batch
kreuzberg batch documents/**/*.{pdf,docx,txt} --format json

Extract with Config File

Extract using configuration file
kreuzberg extract document.pdf --config /path/to/kreuzberg.toml

Detect MIME Type

Detect file MIME type
kreuzberg detect document.pdf

Docker Usage

Use the CLI image ghcr.io/kreuzberg-dev/kreuzberg-cli:latest for command-line usage. The full image ghcr.io/kreuzberg-dev/kreuzberg:latest also includes the CLI.

Basic Docker

Terminal
# Extract document using Docker with mounted directory
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
  extract /data/document.pdf

# Extract and save output to host directory using shell redirection
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
  extract /data/document.pdf > output.txt

Docker with OCR

Terminal
# Extract with OCR using Docker
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
  extract /data/scanned.pdf --ocr true

Docker Compose

docker-compose.yaml:

docker-compose.yaml
version: "3.8"

services:
  kreuzberg:
    image: ghcr.io/kreuzberg-dev/kreuzberg-cli:latest
    volumes:
      - ./documents:/input
    command: extract /input/document.pdf --ocr true

Run:

Terminal
docker-compose up

Performance Tips

Optimize Extraction Speed

Terminal
# Extract without quality processing for faster speed
kreuzberg extract large.pdf --quality false

# Use batch for processing multiple files
kreuzberg batch large_files/*.pdf --format json

Manage Memory Usage

Terminal
# Disable caching to reduce memory footprint
kreuzberg extract large_file.pdf --no-cache true

# Compress output to save disk space
kreuzberg extract document.pdf | gzip > output.txt.gz

Troubleshooting

Check Installation

Terminal
# Display installed version
kreuzberg --version

# Display help for commands
kreuzberg --help

Common Issues

Issue: "Tesseract not found"

When using OCR, Tesseract must be installed:

Terminal
# Install Tesseract OCR engine on macOS
brew install tesseract

# Install Tesseract OCR engine on Ubuntu
sudo apt-get install tesseract-ocr

Issue: "File not found"

Ensure the file path is correct and accessible:

Terminal
# Check if file exists and is readable
ls -la document.pdf

# Extract with absolute path
kreuzberg extract /absolute/path/to/document.pdf

Server Commands

Start API Server

The serve command starts a RESTful HTTP API server:

Terminal
# Start server on default host (127.0.0.1) and port (8000)
kreuzberg serve

# Start server on specific host and port (-H / -p are short forms)
kreuzberg serve --host 0.0.0.0 --port 8000
kreuzberg serve -H 0.0.0.0 -p 8000

# Start server with custom configuration file
kreuzberg serve --config kreuzberg.toml --host 0.0.0.0 --port 8000

Server Endpoints

The server provides the following endpoints:

  • POST /extract - Extract text from uploaded files
  • POST /batch - Batch extract from multiple files
  • GET /detect - Detect MIME type of file
  • GET /health - Health check
  • GET /info - Server information
  • GET /cache/stats - Cache statistics
  • POST /cache/clear - Clear cache

See API Server Guide for full API details.

Start MCP Server

The mcp command starts a Model Context Protocol server for AI integration:

Terminal
# Start MCP server with stdio transport (default for Claude Desktop)
kreuzberg mcp

# Start MCP server with HTTP transport
kreuzberg mcp --transport http

# Start MCP server on specific HTTP host and port
kreuzberg mcp --transport http --host 0.0.0.0 --port 8001

# Start MCP server with custom configuration file
kreuzberg mcp --config kreuzberg.toml --transport stdio

The MCP server provides tools for AI agents:

  • extract_file - Extract text from a file path
  • extract_bytes - Extract text from base64-encoded bytes
  • batch_extract - Extract from multiple files

See API Server Guide for MCP integration details.

Cache Management

View Cache Statistics

Terminal
# Display cache usage statistics
kreuzberg cache stats

# Display statistics for specific cache directory
kreuzberg cache stats --cache-dir /path/to/cache

# Output cache statistics as JSON
kreuzberg cache stats --format json

Clear Cache

Terminal
# Remove all cached extraction results
kreuzberg cache clear

# Clear specific cache directory
kreuzberg cache clear --cache-dir /path/to/cache

# Clear cache and display removal details
kreuzberg cache clear --format json

Getting Help

CLI Help

Terminal
# Display general CLI help
kreuzberg --help

# Display command-specific help
kreuzberg extract --help
kreuzberg batch --help
kreuzberg detect --help
kreuzberg version --help
kreuzberg serve --help
kreuzberg mcp --help
kreuzberg cache --help
kreuzberg cache stats --help
kreuzberg cache clear --help

Version Information

Terminal
# Display version number
kreuzberg --version

# Show version with JSON output
kreuzberg version --format json

The version command displays the Kreuzberg version. Use --format json for machine-readable output.

Next Steps