CLI Usage¶
The Kreuzberg CLI provides command-line access to all extraction features. This guide covers installation, basic usage, and advanced features.
Installation¶
Feature Availability
Homebrew Installation:
- ✅ Text extraction (PDF, Office, images, 75+ formats)
- ✅ OCR with Tesseract
- ✅ HTTP API server (
servecommand) - ✅ MCP protocol server (
mcpcommand) - ✅ Chunking, quality scoring, language detection
- ❌ Embeddings - Not available via CLI flags. Use config file or Docker image.
Docker Images:
- All features enabled including embeddings (ONNX Runtime included)
- Use
kreuzberg/kreuzberg:fullfor LibreOffice support - Use
kreuzberg/kreuzberg:corefor smaller image without LibreOffice
Basic Usage¶
Extract from Single File¶
# Extract text content to stdout
kreuzberg extract document.pdf
# Specify MIME type (auto-detected if not provided)
kreuzberg extract document.pdf --mime-type application/pdf
Batch Extract Multiple Files¶
Use the batch command to extract from multiple files:
# Extract from multiple files
kreuzberg batch doc1.pdf doc2.docx doc3.txt
# Batch extract all PDFs in directory
kreuzberg batch documents/*.pdf
# Batch extract recursively
kreuzberg batch documents/**/*.pdf
Output Formats¶
# Output as plain text (default for extract)
kreuzberg extract document.pdf --format text
# Output as JSON (default for batch)
kreuzberg batch documents/*.pdf --format json
# Extract single file as JSON
kreuzberg extract document.pdf --format json
Content Output Format¶
Control the formatting of extracted text content:
# Extract as plain text (default)
kreuzberg extract document.pdf --output-format plain
# Extract as Markdown
kreuzberg extract document.pdf --output-format markdown
# Extract as Djot markup
kreuzberg extract document.pdf --output-format djot
# Extract as HTML
kreuzberg extract document.pdf --output-format html
The --output-format flag controls how the extracted text is formatted. This is different from --format which controls the output structure (text vs JSON).
OCR Extraction¶
Enable OCR¶
# Enable OCR (overrides config file setting)
kreuzberg extract scanned.pdf --ocr true
# Disable OCR
kreuzberg extract document.pdf --ocr false
Force OCR¶
Force OCR even for PDFs with text layer:
# Force OCR to run regardless of existing text
kreuzberg extract document.pdf --force-ocr true
OCR Configuration¶
OCR options are configured via config file. CLI flags override config settings:
# Extract with OCR enabled via config file
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true
Configure OCR backend, language, and Tesseract options in your config file (see Configuration Files section).
Configuration Files¶
Using Config Files¶
Kreuzberg automatically discovers configuration files by searching the current directory and parent directories for:
./kreuzberg.{toml,yaml,yml,json}in the current directory../kreuzberg.{toml,yaml,yml,json}in the parent directory (and so on, up the directory tree)
Specify Config File¶
Example Config Files¶
kreuzberg.toml:
use_cache = true
enable_quality_processing = true
[ocr]
backend = "tesseract"
language = "eng"
[ocr.tesseract_config]
psm = 3
[chunking]
max_characters = 1000
overlap = 100
kreuzberg.yaml:
use_cache: true
enable_quality_processing: true
ocr:
backend: tesseract
language: eng
tesseract_config:
psm: 3
chunking:
max_characters: 1000
overlap: 100
kreuzberg.json:
{
"use_cache": true,
"enable_quality_processing": true,
"ocr": {
"backend": "tesseract",
"language": "eng",
"tesseract_config": {
"psm": 3
}
},
"chunking": {
"max_characters": 1000,
"overlap": 100
}
}
Batch Processing¶
Use the batch command to process multiple files:
# Extract all PDFs in directory
kreuzberg batch documents/*.pdf
# Extract PDFs recursively from subdirectories
kreuzberg batch documents/**/*.pdf
# Extract multiple file types
kreuzberg batch documents/**/*.{pdf,docx,txt}
Batch with Output Formats¶
# Output as JSON (default for batch command)
kreuzberg batch documents/*.pdf --format json
# Output as plain text
kreuzberg batch documents/*.pdf --format text
Batch with OCR¶
# Batch extract with OCR enabled
kreuzberg batch scanned/*.pdf --ocr true
# Batch extract with force OCR
kreuzberg batch documents/*.pdf --force-ocr true
# Batch extract with quality processing
kreuzberg batch documents/*.pdf --quality true
Batch with Content Format¶
# Batch extract with djot formatting
kreuzberg batch documents/*.pdf --output-format djot --format json
# Batch extract as Markdown
kreuzberg batch documents/*.pdf --output-format markdown --format json
# Batch extract as HTML
kreuzberg batch documents/*.pdf --output-format html --format json
Advanced Features¶
Language Detection¶
# Extract with automatic language detection
kreuzberg extract document.pdf --detect-language true
# Disable language detection
kreuzberg extract document.pdf --detect-language false
Content Chunking¶
# Split content into chunks for LLM processing
kreuzberg extract document.pdf --chunk true
# Specify chunk size and overlap
kreuzberg extract document.pdf --chunk true --chunk-size 1000 --chunk-overlap 100
# Output chunked content as JSON
kreuzberg extract document.pdf --chunk true --format json
Quality Processing¶
# Apply quality processing for improved formatting
kreuzberg extract document.pdf --quality true
# Disable quality processing
kreuzberg extract document.pdf --quality false
# Batch extraction with quality processing
kreuzberg batch documents/*.pdf --quality true
Caching¶
# Extract with result caching enabled (default)
kreuzberg extract document.pdf
# Extract without caching results
kreuzberg extract document.pdf --no-cache true
# Clear all cached results
kreuzberg cache clear
# View cache statistics
kreuzberg cache stats
Output Options¶
Standard Output (Text Format)¶
# Extract and print content to stdout
kreuzberg extract document.pdf
# Extract and redirect output to file
kreuzberg extract document.pdf > output.txt
# Batch extract as text
kreuzberg batch documents/*.pdf --format text
JSON Output¶
# Output as JSON
kreuzberg extract document.pdf --format json
# Batch extract as JSON (default format)
kreuzberg batch documents/*.pdf --format json
JSON Output Structure:
The JSON output includes extracted content and related metadata:
{
"content": "Extracted text content...",
"metadata": {
"mime_type": "application/pdf"
}
}
Error Handling¶
The CLI returns appropriate exit codes on error. Basic error handling can be done with standard shell commands:
# Check for extraction errors
kreuzberg extract document.pdf || echo "Extraction failed"
# Continue processing even if one file fails (bash)
for file in documents/*.pdf; do
kreuzberg batch "$file" || continue
done
Examples¶
Extract Single PDF¶
Batch Extract All PDFs in Directory¶
OCR Scanned Documents¶
Extract with Quality Processing¶
Extract with Chunking¶
kreuzberg extract document.pdf --config kreuzberg.toml --chunk true --chunk-size 1000 --chunk-overlap 100 --format json
Batch Extract Multiple File Types¶
Extract with Config File¶
Detect MIME Type¶
Docker Usage¶
Basic Docker¶
# Extract document using Docker with mounted directory
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/document.pdf
# Extract and save output to host directory using shell redirection
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/document.pdf > output.txt
Docker with OCR¶
# Extract with OCR using Docker
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/scanned.pdf --ocr true
Docker Compose¶
docker-compose.yaml:
version: '3.8'
services:
kreuzberg:
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
volumes:
- ./documents:/input
command: extract /input/document.pdf --ocr true
Run:
Performance Tips¶
Optimize Extraction Speed¶
# Extract without quality processing for faster speed
kreuzberg extract large.pdf --quality false
# Use batch for processing multiple files
kreuzberg batch large_files/*.pdf --format json
Manage Memory Usage¶
# Disable caching to reduce memory footprint
kreuzberg extract large_file.pdf --no-cache true
# Compress output to save disk space
kreuzberg extract document.pdf | gzip > output.txt.gz
Troubleshooting¶
Check Installation¶
# Display installed version
kreuzberg --version
# Display help for commands
kreuzberg --help
Common Issues¶
Issue: "Tesseract not found"
When using OCR, Tesseract must be installed:
# Install Tesseract OCR engine on macOS
brew install tesseract
# Install Tesseract OCR engine on Ubuntu
sudo apt-get install tesseract-ocr
Issue: "File not found"
Ensure the file path is correct and accessible:
# Check if file exists and is readable
ls -la document.pdf
# Extract with absolute path
kreuzberg extract /absolute/path/to/document.pdf
Server Commands¶
Start API Server¶
The serve command starts a RESTful HTTP API server:
# Start server on default host (127.0.0.1) and port (8000)
kreuzberg serve
# Start server on specific host and port
kreuzberg serve --host 0.0.0.0 --port 8000
# Start server with custom configuration file
kreuzberg serve --config kreuzberg.toml --host 0.0.0.0 --port 8000
Server Endpoints¶
The server provides the following endpoints: - POST /extract - Extract text from uploaded files - POST /batch - Batch extract from multiple files - GET /detect - Detect MIME type of file - GET /health - Health check - GET /info - Server information - GET /cache/stats - Cache statistics - POST /cache/clear - Clear cache
See API Server Guide for full API details.
Start MCP Server¶
The mcp command starts a Model Context Protocol server for AI integration:
# Start MCP server with stdio transport (default for Claude Desktop)
kreuzberg mcp
# Start MCP server with HTTP transport
kreuzberg mcp --transport http
# Start MCP server on specific HTTP host and port
kreuzberg mcp --transport http --host 0.0.0.0 --port 8001
# Start MCP server with custom configuration file
kreuzberg mcp --config kreuzberg.toml --transport stdio
The MCP server provides tools for AI agents: - extract_file - Extract text from a file path - extract_bytes - Extract text from base64-encoded bytes - batch_extract - Extract from multiple files
See API Server Guide for MCP integration details.
Cache Management¶
View Cache Statistics¶
# Display cache usage statistics
kreuzberg cache stats
# Display statistics for specific cache directory
kreuzberg cache stats --cache-dir /path/to/cache
# Output cache statistics as JSON
kreuzberg cache stats --format json
Clear Cache¶
# Remove all cached extraction results
kreuzberg cache clear
# Clear specific cache directory
kreuzberg cache clear --cache-dir /path/to/cache
# Clear cache and display removal details
kreuzberg cache clear --format json
Getting Help¶
CLI Help¶
# Display general CLI help
kreuzberg --help
# Display command-specific help
kreuzberg extract --help
kreuzberg batch --help
kreuzberg detect --help
kreuzberg serve --help
kreuzberg mcp --help
kreuzberg cache --help
Version Information¶
# Display version number
kreuzberg --version
# Show version with JSON output
kreuzberg version --format json
The version command displays the Kreuzberg version. Use --format json for machine-readable output.
Next Steps¶
- API Server Guide - API and MCP server setup
- Advanced Features - Advanced Kreuzberg features
- Plugin Development - Extend Kreuzberg functionality
- API Reference - Programmatic access