Docker Deployment¶
Kreuzberg provides official Docker images built on a high-performance Rust core with Debian 13 (Trixie). Each image supports three execution modes through a flexible entrypoint pattern, enabling deployment as an API server, CLI tool, or MCP server.
Image Variants¶
Kreuzberg offers two Docker image variants optimized for different use cases:
Core Image¶
Size: ~1.0-1.3GB Image: ghcr.io/kreuzberg-dev/kreuzberg:latest
Included Features:
- Tesseract OCR with 12 language packs (eng, spa, fra, deu, ita, por, chi-sim, chi-tra, jpn, ara, rus, hin)
- pdfium for PDF rendering
- Full support for modern file formats
Supported Formats:
- PDF, DOCX, PPTX, XLSX (modern Office formats)
- Images (PNG, JPG, TIFF, BMP, etc.)
- HTML, XML, JSON, YAML, TOML
- Email (EML, MSG)
- Archives (ZIP, TAR, GZ)
Best For:
- Production deployments where image size matters
- Cloud environments with size/bandwidth constraints
- Kubernetes deployments with frequent pod scaling
- Workflows that don't require legacy Office format support
Full Image¶
Size: ~1.0-1.3GB Image: ghcr.io/kreuzberg-dev/kreuzberg:latest
Included Features:
- All Core image features
- Native OLE/CFB parsing for legacy formats
Additional Formats:
- Legacy Word (.doc)
- Legacy PowerPoint (.ppt)
- Legacy Excel (.xls)
Best For:
- Complete document intelligence pipelines
- Processing legacy MS Office files
- Development and testing environments
- When image size is not a constraint
Quick Start¶
Pull Image¶
Basic Usage¶
# Extract a single file
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/document.pdf
# Batch process multiple files
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
batch /data/*.pdf --output-format json
# Detect MIME type
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
detect /data/unknown-file.bin
Execution Modes¶
Kreuzberg Docker images use a flexible ENTRYPOINT pattern that supports three execution modes:
1. API Server Mode (Default)¶
The default mode starts an HTTP REST API server.
Default Behavior:
Custom Configuration:
# Change host and port
docker run -p 9000:9000 ghcr.io/kreuzberg-dev/kreuzberg:latest \
serve --host 0.0.0.0 --port 9000
# With environment variables
docker run -p 8000:8000 \
-e KREUZBERG_CORS_ORIGINS="https://myapp.com" \
-e KREUZBERG_MAX_UPLOAD_SIZE_MB=200 \
ghcr.io/kreuzberg-dev/kreuzberg:latest
# With configuration file
docker run -p 8000:8000 \
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
serve --config /config/kreuzberg.toml
See API Server Guide for complete API documentation.
2. CLI Mode¶
Run Kreuzberg as a command-line tool for file processing.
Extract Files:
# Mount directory and extract file
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/document.pdf
# Extract with OCR
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/scanned.pdf --ocr true
# Output as JSON (CLI output format)
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/document.pdf --format json > result.json
Batch Processing:
# Process multiple files (default batch output is JSON)
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
batch /data/*.pdf --format json
MIME Detection:
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
detect /data/unknown-file.bin
Cache Management:
# View cache statistics
docker run ghcr.io/kreuzberg-dev/kreuzberg:latest cache stats
# Clear cache
docker run ghcr.io/kreuzberg-dev/kreuzberg:latest cache clear
See CLI Usage Guide for complete CLI documentation.
3. MCP Server Mode¶
Run Kreuzberg as a Model Context Protocol server for AI agent integration.
Start MCP Server:
With Configuration:
docker run \
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
mcp --config /config/kreuzberg.toml
See API Server Guide - MCP Section for integration examples.
Architecture¶
Multi-Stage Build¶
Kreuzberg Docker images use multi-stage builds for optimal size and security:
- Builder Stage: Compiles Rust binary with all dependencies
- Runtime Stage: Minimal Debian Trixie slim base with only runtime dependencies
Benefits:
- No build tools or intermediate artifacts in final image
- Smaller image size (builder stage not included)
- Reduced attack surface
Rust Core¶
Docker images use the native Rust core directly, providing:
- Memory efficiency through streaming parsers for large files
- Async processing with Tokio runtime
- Zero-copy operations where possible
Multi-Architecture Support¶
Images are built for multiple architectures:
linux/amd64(x86_64)linux/arm64(aarch64)
Architecture-specific binaries are automatically selected during build.
Security Features¶
Non-Root User:
Security Options:
# Run with additional security constraints
docker run --security-opt no-new-privileges \
--read-only \
--tmpfs /tmp \
-p 8000:8000 \
ghcr.io/kreuzberg-dev/kreuzberg:latest
Production Deployment¶
Docker Compose¶
Basic Configuration:
version: "3.8"
services:
kreuzberg-api:
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
ports:
- "8000:8000"
environment:
- KREUZBERG_CORS_ORIGINS=https://myapp.com,https://api.myapp.com
- KREUZBERG_MAX_UPLOAD_SIZE_MB=500
- RUST_LOG=info
volumes:
- ./config:/config
- cache-data:/app/.kreuzberg
command: serve --host 0.0.0.0 --port 8000 --config /config/kreuzberg.toml
restart: unless-stopped
healthcheck:
test: ["CMD", "kreuzberg", "--version"]
interval: 30s
timeout: 10s
retries: 3
start_period: 5s
volumes:
cache-data:
With Full Features (Full Image):
services:
kreuzberg-full:
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
ports:
- "8000:8000"
environment:
- KREUZBERG_CORS_ORIGINS=https://myapp.com
volumes:
- cache-data:/app/.kreuzberg
restart: unless-stopped
Start Services:
Kubernetes Deployment¶
For complete Kubernetes deployment guidance including OCR configuration, permissions, and troubleshooting, see the Kubernetes Deployment Guide.
Quick Deployment Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: kreuzberg-api
labels:
app: kreuzberg
spec:
replicas: 3
selector:
matchLabels:
app: kreuzberg
template:
metadata:
labels:
app: kreuzberg
spec:
containers:
- name: kreuzberg
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
ports:
- containerPort: 8000
name: http
env:
- name: KREUZBERG_CORS_ORIGINS
value: "https://myapp.com"
- name: KREUZBERG_MAX_UPLOAD_SIZE_MB
value: "500"
- name: RUST_LOG
value: "info"
- name: TESSDATA_PREFIX
value: "/usr/share/tesseract-ocr/5/tessdata"
args: ["serve", "--host", "0.0.0.0", "--port", "8000"]
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
volumeMounts:
- name: cache
mountPath: /app/.kreuzberg
volumes:
- name: cache
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: kreuzberg-api
spec:
selector:
app: kreuzberg
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: LoadBalancer
Apply Configuration:
Kubernetes-Specific Configuration
This quick example includes the critical TESSDATA_PREFIX environment variable needed for OCR. The path /usr/share/tesseract-ocr/5/tessdata is for Tesseract 5.x (shipped with Debian Trixie). If using a different base image, verify your Tesseract version with tesseract --version and adjust the path accordingly. For production deployments with custom configurations, permissions handling, and health checks, refer to the comprehensive Kubernetes guide.
Environment Variables¶
Configure Docker containers via environment variables:
Upload Limits:
CORS Configuration:
# Comma-separated list of allowed origins
KREUZBERG_CORS_ORIGINS="https://app.example.com,https://api.example.com"
Logging:
Cache Configuration:
KREUZBERG_CACHE_DIR=/app/.kreuzberg # Main cache directory
HF_HOME=/app/.kreuzberg/huggingface # HuggingFace/ONNX model cache
Cache Directory Structure:
/app/.kreuzberg/
├── huggingface/ # Embedding models (downloaded on first use, ~90MB-1.2GB)
├── embeddings/ # ONNX runtime cache
└── ocr/ # OCR result cache
Model Downloads
Embedding models are downloaded on first use when embeddings features are enabled. The download size varies by preset (~90MB for small models, ~1.2GB for large models). For production deployments, consider using a persistent volume for the cache directory.
Note: Server host and port are configured via CLI arguments (serve --host 0.0.0.0 --port 8000), not environment variables.
Volume Mounts¶
Cache Persistence:
# Mount cache directory for persistence
docker run -p 8000:8000 \
-v kreuzberg-cache:/app/.kreuzberg \
ghcr.io/kreuzberg-dev/kreuzberg:latest
Configuration Files:
# Mount configuration file
docker run -p 8000:8000 \
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
serve --config /config/kreuzberg.toml
File Processing:
# Mount documents directory (read-only)
docker run -v $(pwd)/documents:/data:ro \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/document.pdf
Image Comparison¶
| Feature | Core | Full | Difference |
|---|---|---|---|
| Base Image | debian:trixie-slim | debian:trixie-slim | - |
| Size | ~1.0-1.3GB | ~1.5-2.1GB | ~500-800MB |
| Tesseract OCR | ✅ 12 languages | ✅ 12 languages | - |
| pdfium | ✅ | ✅ | - |
| Modern Office | ✅ DOCX, PPTX, XLSX | ✅ DOCX, PPTX, XLSX | - |
| Legacy Office | ✅ DOC, PPT, XLS (native) | ✅ DOC, PPT, XLS (native) | - |
| Pull Time | ~30s | ~45s | ~15s slower |
| Startup Time | ~1s | ~1s | Negligible |
Building Custom Images¶
Building from Source¶
Clone the repository and build:
Custom Dockerfiles¶
Create a custom Dockerfile based on official images:
FROM ghcr.io/kreuzberg-dev/kreuzberg:latest
# Install additional system dependencies
USER root
RUN apt-get update && \
apt-get install -y --no-install-recommends \
your-package-here && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Switch back to non-root user
USER kreuzberg
# Custom configuration
COPY kreuzberg.toml /app/kreuzberg.toml
# Custom entrypoint
CMD ["serve", "--config", "/app/kreuzberg.toml"]
Performance Tuning¶
Resource Allocation¶
Recommended Resources:
| Workload | Memory | CPU | Notes |
|---|---|---|---|
| Light | 512MB | 0.5 cores | Small documents, low concurrency |
| Medium | 1GB | 1 core | Typical documents, moderate concurrency |
| Heavy | 2GB+ | 2+ cores | Large documents, OCR, high concurrency |
Docker Run:
Docker Compose:
services:
kreuzberg:
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
deploy:
resources:
limits:
memory: 1G
cpus: "1"
reservations:
memory: 512M
cpus: "0.5"
Scaling¶
Horizontal Scaling:
# Scale to 5 replicas
docker-compose up -d --scale kreuzberg-api=5
# Kubernetes
kubectl scale deployment kreuzberg-api --replicas=5
Load Balancing:
- Use reverse proxy (Nginx, Caddy, Traefik)
- Kubernetes Service with LoadBalancer type
- Docker Swarm mode
Troubleshooting¶
Container Won't Start¶
Check logs:
Common Issues:
- Port already in use: Change
-pmapping - Insufficient permissions: Ensure volume mounts have correct permissions
- Memory limit too low: Increase
--memorylimit
Permission Errors¶
Images run as non-root user kreuzberg (UID 1000). Ensure mounted volumes have correct permissions:
Large File Processing¶
Increase memory limit:
Increase upload size:
docker run -p 8000:8000 \
-e KREUZBERG_MAX_UPLOAD_SIZE_MB=1000 \
ghcr.io/kreuzberg-dev/kreuzberg:latest
Legacy Office Format Support
Since Kreuzberg 4.3, legacy Office formats (.doc, .ppt, .xls) are extracted natively via OLE/CFB parsing without requiring external tools. Both Core and Full images support these formats.
Next Steps¶
- Kubernetes Deployment - Production Kubernetes deployments with OCR configuration and troubleshooting
- API Server Guide - Complete API documentation
- CLI Usage - Command-line interface
- Configuration - Configuration options
- Advanced Features - Chunking, language detection, token reduction