CLI Usage¶
The Kreuzberg CLI provides command-line access to all extraction features. This guide covers installation, basic usage, and advanced features.
Installation¶
Feature Availability
Homebrew Installation:
- ✅ Text extraction (PDF, Office, images, 75+ formats)
- ✅ OCR with Tesseract
- ✅ HTTP API server (`serve` command)
- ✅ MCP protocol server (`mcp` command)
- ✅ Chunking, quality scoring, language detection
- ❌ **Embeddings** - Not available via CLI flags. Use config file or Docker image.
**Docker Images:**
- All features enabled including embeddings (ONNX Runtime included)
Basic Usage¶
Extract from Single File¶
# Extract text content to stdout
kreuzberg extract document.pdf
# Specify MIME type (auto-detected if not provided)
kreuzberg extract document.pdf --mime-type application/pdf
Batch Extract Multiple Files¶
Use the batch command to extract from multiple files:
# Extract from multiple files
kreuzberg batch doc1.pdf doc2.docx doc3.txt
# Batch extract all PDFs in directory
kreuzberg batch documents/*.pdf
# Batch extract recursively
kreuzberg batch documents/**/*.pdf
Output Formats¶
# Output as plain text (default for extract)
kreuzberg extract document.pdf --format text
# Output as JSON (default for batch)
kreuzberg batch documents/*.pdf --format json
# Extract single file as JSON
kreuzberg extract document.pdf --format json
Content Output Format¶
Control the formatting of extracted text content:
# Extract as plain text (default)
kreuzberg extract document.pdf --output-format plain
# Extract as Markdown
kreuzberg extract document.pdf --output-format markdown
# Extract as Djot markup
kreuzberg extract document.pdf --output-format djot
# Extract as HTML
kreuzberg extract document.pdf --output-format html
The --output-format flag controls how the extracted text is formatted. This is different from --format which controls the output structure (text vs JSON).
OCR Extraction¶
Enable OCR¶
# Enable OCR (overrides config file setting)
kreuzberg extract scanned.pdf --ocr true
# Disable OCR
kreuzberg extract document.pdf --ocr false
Force OCR¶
Force OCR even for PDFs with text layer:
# Force OCR to run regardless of existing text
kreuzberg extract document.pdf --force-ocr true
OCR Language Selection¶
Set the OCR language using the --ocr-language flag. This flag is backend-agnostic and works with all supported OCR backends (Tesseract, PaddleOCR, EasyOCR).
Language Code Formats:
- Tesseract: Uses ISO 639-3 codes (three-letter codes)
- Examples:
eng(English),fra(French),deu(German),spa(Spanish),jpn(Japanese) - PaddleOCR: Accepts flexible language codes and full language names
- Examples:
en,ch,french,korean,thai,greek,cyrillic, etc. - EasyOCR: Similar flexible format to PaddleOCR
When used with --ocr true, the language flag overrides the default language. When used without --ocr, it overrides the language specified in your config file.
# French OCR with Tesseract (default backend)
kreuzberg extract --ocr true --ocr-language fra document.pdf
# Chinese OCR with PaddleOCR
kreuzberg extract --ocr true --ocr-backend paddle-ocr --ocr-language ch document.pdf
# Thai OCR with PaddleOCR
kreuzberg extract --ocr true --ocr-backend paddle-ocr --ocr-language thai document.pdf
# German OCR with Tesseract
kreuzberg extract --ocr true --ocr-language deu document.pdf
# Override config file language with Spanish
kreuzberg extract document.pdf --config kreuzberg.toml --ocr-language spa
OCR Configuration¶
OCR options are configured via config file. CLI flags override config settings:
# Extract with OCR enabled via config file
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true
Configure OCR backend, language, and Tesseract options in your config file (see Configuration Files section).
Configuration Files¶
Using Config Files¶
Kreuzberg automatically discovers a configuration file by searching the current directory and parent directories for kreuzberg.toml only. If you use YAML or JSON, specify the file explicitly with --config.
# Extract using discovered configuration (finds kreuzberg.toml)
kreuzberg extract document.pdf
Specify Config File¶
You can load TOML, YAML (.yaml or .yml), or JSON via --config:
kreuzberg extract document.pdf --config my-config.toml
kreuzberg extract document.pdf --config kreuzberg.yaml
kreuzberg extract document.pdf --config my-config.json
Inline JSON Config¶
Override or supply config without a file using inline JSON (merged after config file, before individual flags):
# Inline JSON (applied after config file)
kreuzberg extract document.pdf --config-json '{"ocr":{"backend":"tesseract"},"chunking":{"max_chars":1000}}'
# Base64-encoded JSON (useful in shells where quoting is awkward)
kreuzberg extract document.pdf --config-json-base64 eyJvY3IiOnsiYmFja2VuZCI6InRlc3NlcmFjdCJ9fQ==
Both extract and batch support --config-json and --config-json-base64.
Example Config Files¶
kreuzberg.toml:
use_cache = true
enable_quality_processing = true
[ocr]
backend = "tesseract"
language = "eng"
[ocr.tesseract_config]
psm = 3
[chunking]
max_characters = 1000
overlap = 100
kreuzberg.yaml:
use_cache: true
enable_quality_processing: true
ocr:
backend: tesseract
language: eng
tesseract_config:
psm: 3
chunking:
max_characters: 1000
overlap: 100
kreuzberg.json:
{
"use_cache": true,
"enable_quality_processing": true,
"ocr": {
"backend": "tesseract",
"language": "eng",
"tesseract_config": {
"psm": 3
}
},
"chunking": {
"max_characters": 1000,
"overlap": 100
}
}
Batch Processing¶
Use the batch command to process multiple files:
# Extract all PDFs in directory
kreuzberg batch documents/*.pdf
# Extract PDFs recursively from subdirectories
kreuzberg batch documents/**/*.pdf
# Extract multiple file types
kreuzberg batch documents/**/*.{pdf,docx,txt}
Batch with Output Formats¶
# Output as JSON (default for batch command)
kreuzberg batch documents/*.pdf --format json
# Output as plain text
kreuzberg batch documents/*.pdf --format text
Batch with OCR¶
# Batch extract with OCR enabled
kreuzberg batch scanned/*.pdf --ocr true
# Batch extract with force OCR
kreuzberg batch documents/*.pdf --force-ocr true
# Batch extract with quality processing
kreuzberg batch documents/*.pdf --quality true
Batch with Content Format¶
# Batch extract with djot formatting
kreuzberg batch documents/*.pdf --output-format djot --format json
# Batch extract as Markdown
kreuzberg batch documents/*.pdf --output-format markdown --format json
# Batch extract as HTML
kreuzberg batch documents/*.pdf --output-format html --format json
Advanced Features¶
Language Detection¶
# Extract with automatic language detection
kreuzberg extract document.pdf --detect-language true
# Disable language detection
kreuzberg extract document.pdf --detect-language false
Content Chunking¶
# Split content into chunks for LLM processing
kreuzberg extract document.pdf --chunk true
# Specify chunk size and overlap
kreuzberg extract document.pdf --chunk true --chunk-size 1000 --chunk-overlap 100
# Output chunked content as JSON
kreuzberg extract document.pdf --chunk true --format json
Quality Processing¶
# Apply quality processing for improved formatting
kreuzberg extract document.pdf --quality true
# Disable quality processing
kreuzberg extract document.pdf --quality false
# Batch extraction with quality processing
kreuzberg batch documents/*.pdf --quality true
Caching¶
# Extract with result caching enabled (default)
kreuzberg extract document.pdf
# Extract without caching results
kreuzberg extract document.pdf --no-cache true
# Clear all cached results
kreuzberg cache clear
# View cache statistics
kreuzberg cache stats
Output Options¶
Standard Output (Text Format)¶
# Extract and print content to stdout
kreuzberg extract document.pdf
# Extract and redirect output to file
kreuzberg extract document.pdf > output.txt
# Batch extract as text
kreuzberg batch documents/*.pdf --format text
JSON Output¶
# Output as JSON
kreuzberg extract document.pdf --format json
# Batch extract as JSON (default format)
kreuzberg batch documents/*.pdf --format json
JSON Output Structure:
The JSON output includes extracted content and related metadata:
{
"content": "Extracted text content...",
"metadata": {
"mime_type": "application/pdf"
}
}
Error Handling¶
The CLI returns appropriate exit codes on error. Basic error handling can be done with standard shell commands:
# Check for extraction errors
kreuzberg extract document.pdf || echo "Extraction failed"
# Continue processing even if one file fails (bash)
for file in documents/*.pdf; do
kreuzberg batch "$file" || continue
done
Examples¶
Extract Single PDF¶
Batch Extract All PDFs in Directory¶
OCR Scanned Documents¶
Extract with Quality Processing¶
Extract with Chunking¶
kreuzberg extract document.pdf --config kreuzberg.toml --chunk true --chunk-size 1000 --chunk-overlap 100 --format json
Batch Extract Multiple File Types¶
Extract with Config File¶
Detect MIME Type¶
Docker Usage¶
Use the CLI image ghcr.io/kreuzberg-dev/kreuzberg-cli:latest for command-line usage. The full image ghcr.io/kreuzberg-dev/kreuzberg:latest also includes the CLI.
Basic Docker¶
# Extract document using Docker with mounted directory
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
extract /data/document.pdf
# Extract and save output to host directory using shell redirection
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
extract /data/document.pdf > output.txt
Docker with OCR¶
# Extract with OCR using Docker
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
extract /data/scanned.pdf --ocr true
Docker Compose¶
docker-compose.yaml:
version: "3.8"
services:
kreuzberg:
image: ghcr.io/kreuzberg-dev/kreuzberg-cli:latest
volumes:
- ./documents:/input
command: extract /input/document.pdf --ocr true
Run:
Performance Tips¶
Optimize Extraction Speed¶
# Extract without quality processing for faster speed
kreuzberg extract large.pdf --quality false
# Use batch for processing multiple files
kreuzberg batch large_files/*.pdf --format json
Manage Memory Usage¶
# Disable caching to reduce memory footprint
kreuzberg extract large_file.pdf --no-cache true
# Compress output to save disk space
kreuzberg extract document.pdf | gzip > output.txt.gz
Troubleshooting¶
Check Installation¶
# Display installed version
kreuzberg --version
# Display help for commands
kreuzberg --help
Common Issues¶
Issue: "Tesseract not found"
When using OCR, Tesseract must be installed:
# Install Tesseract OCR engine on macOS
brew install tesseract
# Install Tesseract OCR engine on Ubuntu
sudo apt-get install tesseract-ocr
Issue: "File not found"
Ensure the file path is correct and accessible:
# Check if file exists and is readable
ls -la document.pdf
# Extract with absolute path
kreuzberg extract /absolute/path/to/document.pdf
Server Commands¶
Start API Server¶
The serve command starts a RESTful HTTP API server:
# Start server on default host (127.0.0.1) and port (8000)
kreuzberg serve
# Start server on specific host and port (-H / -p are short forms)
kreuzberg serve --host 0.0.0.0 --port 8000
kreuzberg serve -H 0.0.0.0 -p 8000
# Start server with custom configuration file
kreuzberg serve --config kreuzberg.toml --host 0.0.0.0 --port 8000
Server Endpoints¶
The server provides the following endpoints:
POST /extract- Extract text from uploaded filesPOST /batch- Batch extract from multiple filesGET /detect- Detect MIME type of fileGET /health- Health checkGET /info- Server informationGET /cache/stats- Cache statisticsPOST /cache/clear- Clear cache
See API Server Guide for full API details.
Start MCP Server¶
The mcp command starts a Model Context Protocol server for AI integration:
# Start MCP server with stdio transport (default for Claude Desktop)
kreuzberg mcp
# Start MCP server with HTTP transport
kreuzberg mcp --transport http
# Start MCP server on specific HTTP host and port
kreuzberg mcp --transport http --host 0.0.0.0 --port 8001
# Start MCP server with custom configuration file
kreuzberg mcp --config kreuzberg.toml --transport stdio
The MCP server provides tools for AI agents:
extract_file- Extract text from a file pathextract_bytes- Extract text from base64-encoded bytesbatch_extract- Extract from multiple files
See API Server Guide for MCP integration details.
Cache Management¶
View Cache Statistics¶
# Display cache usage statistics
kreuzberg cache stats
# Display statistics for specific cache directory
kreuzberg cache stats --cache-dir /path/to/cache
# Output cache statistics as JSON
kreuzberg cache stats --format json
Clear Cache¶
# Remove all cached extraction results
kreuzberg cache clear
# Clear specific cache directory
kreuzberg cache clear --cache-dir /path/to/cache
# Clear cache and display removal details
kreuzberg cache clear --format json
Getting Help¶
CLI Help¶
# Display general CLI help
kreuzberg --help
# Display command-specific help
kreuzberg extract --help
kreuzberg batch --help
kreuzberg detect --help
kreuzberg version --help
kreuzberg serve --help
kreuzberg mcp --help
kreuzberg cache --help
kreuzberg cache stats --help
kreuzberg cache clear --help
Version Information¶
# Display version number
kreuzberg --version
# Show version with JSON output
kreuzberg version --format json
The version command displays the Kreuzberg version. Use --format json for machine-readable output.
Next Steps¶
- API Server Guide - API and MCP server setup
- Advanced Features - Advanced Kreuzberg features
- Plugin Development - Extend Kreuzberg functionality
- API Reference - Programmatic access