Environment Variables Reference¶
Configuration precedence in Kreuzberg follows this order (highest to lowest):
- Environment Variables - Highest priority, overrides all other sources
- Configuration Files - TOML, YAML, or JSON config files
- Defaults - Built-in sensible defaults
This document covers all KREUZBERG_* environment variables for version 4.3.8.
When to Use Environment Variables¶
Environment variables are ideal for:
- Container/Cloud Deployments: Docker, Kubernetes, serverless environments where config files are impractical
- CI/CD Pipelines: Override settings per environment (dev, staging, production)
- Simple Overrides: Changing one or two settings without managing a config file
- Secrets Management: Using secret management systems that inject values as env vars
For complex configurations with many settings, configuration files are recommended:
# kreuzberg.toml is cleaner for multiple settings
[ocr]
language = "eng"
backend = "tesseract"
[chunking]
max_chars = 2000
max_overlap = 300
API Server Configuration¶
These variables control the Kreuzberg server's network behavior and request handling.
KREUZBERG_HOST¶
Type: String
Default: 127.0.0.1
Valid Values: Any IPv4 or IPv6 address, or hostname
The server bind address. Use 0.0.0.0 to listen on all interfaces.
# Listen only on localhost (default)
export KREUZBERG_HOST=127.0.0.1
# Listen on all interfaces (Docker, cloud deployments)
export KREUZBERG_HOST=0.0.0.0
# Listen on specific interface
export KREUZBERG_HOST=192.168.1.100
KREUZBERG_PORT¶
Type: u16 (1-65535)
Default: 8000
The server port number.
Error: Port must be a valid u16 number:
KREUZBERG_CORS_ORIGINS¶
Type: String (comma-separated list)
Default: Empty (allows all origins)
Whitelist of allowed CORS origins. When empty, the server accepts requests from any origin.
# Allow all origins (default)
# unset KREUZBERG_CORS_ORIGINS
# Allow specific origins
export KREUZBERG_CORS_ORIGINS="https://api.example.com, https://app.example.com"
# Single origin
export KREUZBERG_CORS_ORIGINS="https://trusted.com"
Security Warning: Be explicit with CORS origins in production. Allowing all origins (*) means any website can call your API on behalf of users. In Kreuzberg, an empty list allows all origins - be intentional about this choice.
# Production: Restrict to known origins
export KREUZBERG_CORS_ORIGINS="https://app.mycompany.com, https://admin.mycompany.com"
# Development: Can use wildcard, but understand the security implications
# Don't use wildcard in production unless absolutely necessary
KREUZBERG_MAX_REQUEST_BODY_BYTES¶
Type: usize (bytes)
Default: 104857600 (100 MB)
Maximum size of HTTP request bodies. Prevents oversized requests from consuming server resources.
# 50 MB
export KREUZBERG_MAX_REQUEST_BODY_BYTES=52428800
# 200 MB
export KREUZBERG_MAX_REQUEST_BODY_BYTES=209715200
# 500 MB
export KREUZBERG_MAX_REQUEST_BODY_BYTES=524288000
Note: Both KREUZBERG_MAX_REQUEST_BODY_BYTES and KREUZBERG_MAX_MULTIPART_FIELD_BYTES control upload limits. Adjust both for consistent behavior.
KREUZBERG_MAX_MULTIPART_FIELD_BYTES¶
Type: usize (bytes)
Default: 104857600 (100 MB)
Maximum size of individual multipart form fields. Controls the size of file uploads in multipart requests.
# 100 MB (default)
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=104857600
# 500 MB for large document processing
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=524288000
# 1 GB for extreme cases
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=1073741824
Extraction Configuration¶
These variables control document extraction behavior, including OCR, text chunking, and caching.
KREUZBERG_OCR_LANGUAGE¶
Type: String (ISO 639-1 or 639-3 language code)
Default: eng (English)
OCR language for scanned documents. Must be a valid language code recognized by the OCR backend.
# English (default)
export KREUZBERG_OCR_LANGUAGE=eng
# German
export KREUZBERG_OCR_LANGUAGE=deu
# French
export KREUZBERG_OCR_LANGUAGE=fra
# Spanish
export KREUZBERG_OCR_LANGUAGE=spa
# Chinese (Simplified)
export KREUZBERG_OCR_LANGUAGE=chi_sim
# Japanese
export KREUZBERG_OCR_LANGUAGE=jpn
Supported Codes: Language codes are backend-agnostic and automatically mapped to the appropriate format for each backend:
- Tesseract codes (ISO 639-3):
eng,deu,fra,spa,ita,por,rus,chi_sim,chi_tra,jpn,kor - PaddleOCR codes:
en,ch,french,german,korean,thai,greek,cyrillic,latin,arabic,devanagari,tamil,telugu - ISO 639-1 codes:
en,de,fr,es,ja,ko,zh,ru,ar,th,el
All code formats are accepted regardless of backend — Kreuzberg automatically maps between them.
KREUZBERG_OCR_BACKEND¶
Type: String
Default: tesseract
Valid Values: tesseract, easyocr, paddleocr
OCR engine to use for text extraction from images and scanned documents.
# Tesseract (open source, good for English)
export KREUZBERG_OCR_BACKEND=tesseract
# EasyOCR (better multilingual support, slower)
export KREUZBERG_OCR_BACKEND=easyocr
# PaddleOCR (fast, good accuracy across languages)
export KREUZBERG_OCR_BACKEND=paddleocr
Performance Notes:
- tesseract: Fastest, best for English and Latin scripts
- easyocr: Slower, excellent multilingual support
- paddleocr: Fast with good accuracy for many languages
KREUZBERG_CHUNKING_MAX_CHARS¶
Type: usize (positive integer)
Default: 1000 (characters)
Maximum number of characters per text chunk. Smaller chunks are useful for LLM context windows.
# Small chunks for token-constrained LLMs
export KREUZBERG_CHUNKING_MAX_CHARS=512
# Default: balanced for most use cases
export KREUZBERG_CHUNKING_MAX_CHARS=1000
# Larger chunks for fewer splits
export KREUZBERG_CHUNKING_MAX_CHARS=2000
# Very large chunks for comprehensive context
export KREUZBERG_CHUNKING_MAX_CHARS=4000
Validation: Must be greater than 0. Must be greater than KREUZBERG_CHUNKING_MAX_OVERLAP.
KREUZBERG_CHUNKING_MAX_OVERLAP¶
Type: usize (non-negative integer)
Default: 200 (characters)
Character overlap between consecutive chunks. Maintains context across chunk boundaries.
# No overlap (creates discontinuities)
export KREUZBERG_CHUNKING_MAX_OVERLAP=0
# Default: 20% overlap with 1000-char chunks
export KREUZBERG_CHUNKING_MAX_OVERLAP=200
# More overlap: 30% for better context continuity
export KREUZBERG_CHUNKING_MAX_OVERLAP=300
# High overlap for sensitive documents
export KREUZBERG_CHUNKING_MAX_OVERLAP=500
Validation: Must be less than KREUZBERG_CHUNKING_MAX_CHARS.
Example Error:
KREUZBERG_CACHE_ENABLED¶
Type: Boolean (true or false, case-insensitive)
Default: true
Enable or disable extraction result caching. Cache stores results to avoid reprocessing identical documents.
# Enable cache (default, recommended for production)
export KREUZBERG_CACHE_ENABLED=true
# Disable cache (development, testing, or when cache is problematic)
export KREUZBERG_CACHE_ENABLED=false
# Case insensitive
export KREUZBERG_CACHE_ENABLED=TRUE
export KREUZBERG_CACHE_ENABLED=False
KREUZBERG_OUTPUT_FORMAT¶
Type: String
Default: plain
Valid Values: plain, markdown, djot, html
Controls the text content format of extraction results. Determines how extracted text is formatted in the result output.
# Plain text content only (default)
export KREUZBERG_OUTPUT_FORMAT=plain
# Markdown formatted output
export KREUZBERG_OUTPUT_FORMAT=markdown
# Djot markup format
export KREUZBERG_OUTPUT_FORMAT=djot
# HTML formatted output
export KREUZBERG_OUTPUT_FORMAT=html
Use Cases:
| Format | Use Case |
|---|---|
plain |
Raw extracted text without formatting |
markdown |
Structured text with headings, lists, emphasis (RAG, LLM input) |
djot |
Lightweight markup, alternative to Markdown |
html |
Rich formatted output for web display |
Example:
KREUZBERG_TOKEN_REDUCTION_MODE¶
Type: String
Default: off
Valid Values: off, light, moderate, aggressive, maximum
Token reduction aggressiveness for compressing extracted text while preserving meaning. Useful when working with token-limited LLMs.
# No reduction (keep all text as-is)
export KREUZBERG_TOKEN_REDUCTION_MODE=off
# Light reduction: Remove common stopwords, minimal impact
export KREUZBERG_TOKEN_REDUCTION_MODE=light
# Moderate reduction: Balance between compression and meaning preservation
export KREUZBERG_TOKEN_REDUCTION_MODE=moderate
# Aggressive reduction: Significant compression, some detail loss
export KREUZBERG_TOKEN_REDUCTION_MODE=aggressive
# Maximum reduction: Extreme compression for token-constrained scenarios
export KREUZBERG_TOKEN_REDUCTION_MODE=maximum
Impact on Tokens:
| Mode | Typical Reduction | Use Case |
|---|---|---|
off |
0% | Full preservation, no compression |
light |
10-15% | Minimal impact, clean up obvious redundancy |
moderate |
25-35% | Balanced approach for most scenarios |
aggressive |
40-50% | Significant compression, still readable |
maximum |
50-70% | Extreme compression, lose some detail |
Runtime Configuration¶
Control cache location, debug output, and runtime behavior.
KREUZBERG_CACHE_DIR¶
Type: String (file system path)
Default: Platform-specific global cache directory
Override the default cache directory for storing extraction cache, models, and intermediate files. When unset, Kreuzberg uses a platform-appropriate global cache:
- Linux:
~/.cache/kreuzberg/(or$XDG_CACHE_HOME/kreuzberg/) - macOS:
~/Library/Caches/kreuzberg/ - Windows:
%LOCALAPPDATA%/kreuzberg/
If the platform cache directory cannot be determined, Kreuzberg falls back to ~/.cache/kreuzberg/, then .kreuzberg/ in the current working directory as a last resort.
# Default: uses platform-specific global cache (recommended)
# unset KREUZBERG_CACHE_DIR
# Store cache in specific location
export KREUZBERG_CACHE_DIR=/var/cache/kreuzberg
# Docker: Use volume mount
export KREUZBERG_CACHE_DIR=/data/kreuzberg-cache
# Development: Quick local cleanup
export KREUZBERG_CACHE_DIR=/tmp/kreuzberg-cache
Directory Structure: Kreuzberg creates subdirectories for different cache types:
$KREUZBERG_CACHE_DIR/
ocr/ # OCR result cache
embeddings/ # Chunk embedding cache
extractions/ # Full extraction cache
KREUZBERG_CI_DEBUG¶
Type: Boolean (presence check: set to any value to enable)
Default: Disabled (unset)
Enable detailed debug logging for CI environments. Outputs step-by-step timing and parameter information for OCR operations.
# Enable CI debug output
export KREUZBERG_CI_DEBUG=1
export KREUZBERG_CI_DEBUG=true
export KREUZBERG_CI_DEBUG=yes
# Output example:
# [kreuzberg::ocr] perform_ocr:start bytes=1024000 language=eng output=text use_cache=true
# [kreuzberg::ocr] perform_ocr:end duration_ms=2534
Use Cases:
- Debugging slow OCR operations
- Tracing cache hits/misses
- Performance profiling in CI pipelines
- Understanding extraction pipeline behavior
KREUZBERG_DEBUG_OCR¶
Type: Boolean (presence check: set to any value to enable)
Default: Disabled (unset)
Enable OCR-specific debug output. Outputs diagnostic information about OCR decisions, fallbacks, and text coverage metrics.
# Enable OCR debug logging
export KREUZBERG_DEBUG_OCR=1
# Output example:
# [kreuzberg::pdf::ocr] fallback=true non_whitespace=8543 alnum=7234 meaningful_words=312
# [kreuzberg::pdf::ocr] avg_non_whitespace=45.2 avg_alnum=38.1 alnum_ratio=0.847
Diagnostic Information:
- Whether OCR fallback was triggered
- Character counts (whitespace, alphanumeric)
- Word counts and coverage ratios
- Coverage thresholds and decisions
Memory & Performance¶
Configure caching for string encoding operations to optimize performance.
KREUZBERG_ENCODING_CACHE_MAX_ENTRIES¶
Type: usize (positive integer)
Default: 10000
Maximum number of strings cached in the encoding cache. Each entry consumes memory proportional to string length.
# Default: reasonable for most applications
export KREUZBERG_ENCODING_CACHE_MAX_ENTRIES=10000
# Higher for very large batches
export KREUZBERG_ENCODING_CACHE_MAX_ENTRIES=50000
# Lower to reduce memory usage
export KREUZBERG_ENCODING_CACHE_MAX_ENTRIES=1000
KREUZBERG_ENCODING_CACHE_MAX_BYTES¶
Type: usize (bytes)
Default: 104857600 (100 MB)
Maximum total size of cached strings in bytes. Once exceeded, least-used entries are evicted.
# Default: 100 MB
export KREUZBERG_ENCODING_CACHE_MAX_BYTES=104857600
# Larger cache for high-throughput scenarios
export KREUZBERG_ENCODING_CACHE_MAX_BYTES=524288000 # 500 MB
# Smaller cache for memory-constrained environments
export KREUZBERG_ENCODING_CACHE_MAX_BYTES=10485760 # 10 MB
LLM Integration¶
Configure LLM-powered features such as structured extraction, vision-based OCR, and provider-hosted embeddings.
KREUZBERG_LLM_MODEL¶
Type: String
Default: None (must be set explicitly or via config)
Default LLM model for structured extraction. Uses liter-llm model format (provider/model-name).
# OpenAI
export KREUZBERG_LLM_MODEL=openai/gpt-4o-mini
# Anthropic
export KREUZBERG_LLM_MODEL=anthropic/claude-sonnet-4-20250514
# Local provider
export KREUZBERG_LLM_MODEL=ollama/llama3
KREUZBERG_LLM_API_KEY¶
Type: String
Default: None
API key for the structured extraction LLM provider. When not set, liter-llm falls back to provider-standard environment variables (for example, OPENAI_API_KEY, ANTHROPIC_API_KEY).
Security Warning: Prefer using provider-standard environment variables or a secrets manager over setting this directly. This variable is provided for cases where multiple providers are used and explicit key routing is needed.
KREUZBERG_LLM_BASE_URL¶
Type: String
Default: None (uses provider default)
Custom base URL for the structured extraction LLM provider. Useful for self-hosted models, proxies, or alternative API-compatible endpoints.
# Custom OpenAI-compatible endpoint
export KREUZBERG_LLM_BASE_URL=https://api.example.com
# Local Ollama instance
export KREUZBERG_LLM_BASE_URL=http://localhost:11434
KREUZBERG_VLM_OCR_MODEL¶
Type: String
Default: None (must be set explicitly or via config)
VLM (Vision Language Model) model for vision-based OCR. When configured, Kreuzberg can use a vision model as an OCR backend, sending document images directly to the VLM for text extraction.
# OpenAI GPT-4o for vision OCR
export KREUZBERG_VLM_OCR_MODEL=openai/gpt-4o
# Anthropic Claude for vision OCR
export KREUZBERG_VLM_OCR_MODEL=anthropic/claude-sonnet-4-20250514
KREUZBERG_VLM_EMBEDDING_MODEL¶
Type: String
Default: None (must be set explicitly or via config)
LLM model for provider-hosted embeddings. Instead of running local ONNX embedding models, Kreuzberg can delegate embedding generation to a cloud provider's embedding API.
# OpenAI embeddings
export KREUZBERG_VLM_EMBEDDING_MODEL=openai/text-embedding-3-small
# Cohere embeddings
export KREUZBERG_VLM_EMBEDDING_MODEL=cohere/embed-english-v3.0
Note: When api_key is not set in config, liter-llm falls back to provider-standard environment variables (for example, OPENAI_API_KEY, ANTHROPIC_API_KEY).
| Variable | Description | Example |
|---|---|---|
KREUZBERG_LLM_MODEL |
Default LLM model for structured extraction | openai/gpt-4o-mini |
KREUZBERG_LLM_API_KEY |
API key for structured extraction LLM provider | sk-... |
KREUZBERG_LLM_BASE_URL |
Custom base URL for structured extraction provider | https://api.example.com |
KREUZBERG_VLM_OCR_MODEL |
VLM model for vision-based OCR | openai/gpt-4o |
KREUZBERG_VLM_EMBEDDING_MODEL |
LLM model for provider-hosted embeddings | openai/text-embedding-3-small |
Testing Variables¶
Variables for development, testing, and quality assurance.
KREUZBERG_RUN_FULL_OCR¶
Type: Boolean (presence check: set to any value to enable)
Default: Disabled (skips expensive tests)
Status: Testing only
Enable expensive OCR quality tests. These tests perform full OCR on large documents and are slow (can take minutes).
# Skip expensive OCR tests (default, fast test runs)
# unset KREUZBERG_RUN_FULL_OCR
# Run full OCR quality tests
export KREUZBERG_RUN_FULL_OCR=1
# In test output:
# test test_ocr_quality_multi_page_consistency ... SKIPPED
# Skipping test_ocr_quality_multi_page_consistency: set KREUZBERG_RUN_FULL_OCR=1 to enable
Warning:
- These tests can take 10+ minutes
- Require OCR backends to be installed and working
- Produce large temporary files
- Use only in CI/CD for comprehensive validation
Docker Compose Examples¶
Basic Configuration¶
version: "3.8"
services:
kreuzberg:
image: kreuzberg:latest
ports:
- "3000:3000"
environment:
KREUZBERG_HOST: "0.0.0.0"
KREUZBERG_PORT: "3000"
KREUZBERG_OCR_LANGUAGE: "eng"
KREUZBERG_CACHE_ENABLED: "true"
Production Configuration¶
version: "3.8"
services:
kreuzberg:
image: kreuzberg:latest
ports:
- "8000:8000"
volumes:
- kreuzberg_cache:/data/cache
environment:
KREUZBERG_HOST: "0.0.0.0"
KREUZBERG_PORT: "8000"
KREUZBERG_CORS_ORIGINS: "https://app.example.com, https://admin.example.com"
KREUZBERG_MAX_REQUEST_BODY_BYTES: "209715200" # 200 MB
KREUZBERG_MAX_MULTIPART_FIELD_BYTES: "209715200"
KREUZBERG_CACHE_DIR: "/data/cache"
KREUZBERG_OCR_LANGUAGE: "eng"
KREUZBERG_OCR_BACKEND: "tesseract"
KREUZBERG_CHUNKING_MAX_CHARS: "2000"
KREUZBERG_CHUNKING_MAX_OVERLAP: "300"
KREUZBERG_TOKEN_REDUCTION_MODE: "moderate"
volumes:
kreuzberg_cache:
driver: local
Multilingual Configuration¶
version: "3.8"
services:
kreuzberg:
image: kreuzberg:latest
ports:
- "8000:8000"
environment:
KREUZBERG_HOST: "0.0.0.0"
KREUZBERG_PORT: "8000"
KREUZBERG_OCR_BACKEND: "easyocr" # Better multilingual support
KREUZBERG_OCR_LANGUAGE: "fra" # French
KREUZBERG_CACHE_ENABLED: "true"
Development Configuration¶
version: "3.8"
services:
kreuzberg:
image: kreuzberg:latest
ports:
- "8000:8000"
environment:
KREUZBERG_HOST: "127.0.0.1"
KREUZBERG_PORT: "8000"
KREUZBERG_CACHE_ENABLED: "false" # Disable for fresh testing
KREUZBERG_CI_DEBUG: "1" # Enable debug output
KREUZBERG_DEBUG_OCR: "1"
KREUZBERG_CACHE_DIR: "/tmp/kreuzberg"
Environment Variable Loading Order¶
Kreuzberg applies environment variables in this order:
- Load configuration file (TOML/YAML/JSON) if specified
- Parse environment variables using
apply_env_overrides() - Validate all settings
This ensures environment variables always win over file configuration:
let mut config = ExtractionConfig::from_file("kreuzberg.toml")?;
config.apply_env_overrides()?; // Overrides file values
Common Patterns¶
Using with Config Files¶
Combine files with environment overrides for flexibility:
# Load base config from file
# Override specific values for this deployment
export KREUZBERG_OCR_LANGUAGE=deu
export KREUZBERG_CACHE_DIR=/mnt/cache
kreuzberg --config kreuzberg.toml
Shell Script Initialization¶
#!/bin/bash
# Load deployment-specific settings
if [ "$ENVIRONMENT" = "production" ]; then
export KREUZBERG_HOST="0.0.0.0"
export KREUZBERG_CORS_ORIGINS="https://app.example.com"
export KREUZBERG_CACHE_ENABLED="true"
export KREUZBERG_MAX_REQUEST_BODY_BYTES=$((200 * 1048576))
elif [ "$ENVIRONMENT" = "development" ]; then
export KREUZBERG_HOST="127.0.0.1"
export KREUZBERG_CACHE_ENABLED="false"
export KREUZBERG_CI_DEBUG="1"
fi
kreuzberg
Kubernetes ConfigMap¶
apiVersion: v1
kind: ConfigMap
metadata:
name: kreuzberg-config
data:
KREUZBERG_HOST: "0.0.0.0"
KREUZBERG_PORT: "8000"
KREUZBERG_CORS_ORIGINS: "https://api.example.com"
KREUZBERG_CACHE_DIR: "/data/cache"
KREUZBERG_OCR_BACKEND: "tesseract"
KREUZBERG_TOKEN_REDUCTION_MODE: "moderate"
---
apiVersion: v1
kind: Pod
metadata:
name: kreuzberg-server
spec:
containers:
- name: kreuzberg
image: kreuzberg:latest
ports:
- containerPort: 8000
envFrom:
- configMapRef:
name: kreuzberg-config
volumeMounts:
- name: cache
mountPath: /data/cache
volumes:
- name: cache
persistentVolumeClaim:
claimName: kreuzberg-cache-pvc
See Also¶
- Configuration Guide - Detailed configuration file format and options
- File Size Limits - Upload and processing limits
- Types Reference - API type definitions and structures