OCR (Optical Character Recognition)¶
Extract text from images and scanned PDFs using OCR.
When OCR is Needed¶
flowchart TD
Start[Document File] --> FileType{File Type}
FileType -->|Image| ImageOCR[Always Use OCR]
FileType -->|PDF| CheckPDF{Check PDF}
FileType -->|Other| NoOCR[No OCR Needed]
CheckPDF --> ForceOCR{force_ocr=True?}
ForceOCR -->|Yes| AllPagesOCR[OCR All Pages]
ForceOCR -->|No| TextLayer{Has Text Layer?}
TextLayer -->|No Text| ScannedOCR[OCR Required]
TextLayer -->|Some Text| HybridPDF[Hybrid PDF]
TextLayer -->|All Text| NativeExtract[Native Extraction]
HybridPDF --> PageByPage[Process Pages]
PageByPage --> CheckPage{Page Has Text?}
CheckPage -->|No| PageOCR[OCR This Page]
CheckPage -->|Yes| PageNative[Native Extraction]
ImageOCR --> OCRBackend[OCR Backend Configured?]
ScannedOCR --> OCRBackend
AllPagesOCR --> OCRBackend
PageOCR --> OCRBackend
OCRBackend -->|Yes| ProcessOCR[Process with OCR]
OCRBackend -->|No| Error[MissingDependencyError]
style ImageOCR fill:#FFB6C1
style ScannedOCR fill:#FFB6C1
style AllPagesOCR fill:#FFB6C1
style PageOCR fill:#FFB6C1
style ProcessOCR fill:#90EE90
style Error fill:#FF6B6B Kreuzberg automatically determines when OCR is required:
- Images (
.png,.jpg,.tiff,.bmp,.webp) - Always requires OCR - PDFs with no text layer - Scanned documents automatically trigger OCR
- Hybrid PDFs - Pages without text are processed with OCR, others use native extraction
- Force OCR - Use
force_ocr=Trueto OCR all pages regardless of text layer
Automatic Detection
You don't need to manually enable OCR for images. Kreuzberg detects the file type and applies OCR automatically when an OCR backend is configured.
OCR Backend Comparison¶
flowchart TD
Start[Choose OCR Backend] --> Platform{Platform Support}
Platform -->|All Platforms| Tesseract
Platform -->|All except WASM| PaddleOCR[PaddleOCR]
Platform -->|Python Only| EasyOCR[EasyOCR]
Tesseract --> TessPriority{Priority}
TessPriority -->|Speed| TessSpeed[Tesseract: Fast]
TessPriority -->|Accuracy| TessAccuracy[Tesseract: Good]
TessPriority -->|Production| TessProd[Tesseract: Best Choice]
PaddleOCR --> PaddlePriority{Priority}
PaddlePriority -->|Speed + Accuracy| PaddleMain[PaddleOCR: Very Fast + Excellent]
PaddlePriority -->|CJK Languages| PaddleCJK[PaddleOCR: Best for CJK]
EasyOCR --> EasyPriority{Priority}
EasyPriority -->|Highest Accuracy| Easy[EasyOCR: Excellent Accuracy]
EasyPriority -->|GPU Available| GPU[EasyOCR with GPU]
style Tesseract fill:#90EE90
style Easy fill:#FFD700
style PaddleOCR fill:#87CEEB Kreuzberg supports three OCR backends with different strengths:
| Feature | Tesseract | EasyOCR | PaddleOCR |
|---|---|---|---|
| Speed | Fast | Moderate | Very Fast |
| Accuracy | Good | Excellent | Excellent |
| Languages | 100+ | 80+ | 80+ (11 script families) |
| Installation | System package | Python package | Feature flag (native) or Python package |
| Model Size | Small (~10MB) | Large (~100MB) | Medium (~120MB base + ~8MB per family) |
| CPU/GPU | CPU only | CPU + GPU | CPU + GPU |
| Platform Support | All | Python only | All (except WASM) |
| Best For | General use, production | High accuracy needs | Speed + accuracy, CJK languages |
Recommendation¶
- Production/CLI: Use Tesseract for simplicity and broad platform support
- Speed + Accuracy (any binding): Use PaddleOCR for fast processing with excellent accuracy, especially for CJK languages
- Python + Accuracy: Use EasyOCR for best accuracy with deep learning models (Python only)
Installation¶
Tesseract (Recommended)¶
Available on all platforms (Python, TypeScript, Rust, Ruby):
Download from GitHub releases
Additional Languages¶
# macOS
brew install tesseract-lang
# Ubuntu/Debian
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-fra # French
sudo apt-get install tesseract-ocr-spa # Spanish
# List all installed languages
tesseract --list-langs
EasyOCR (Python Only)¶
Available only in Python with deep learning models:
Python 3.14 Compatibility
EasyOCR is not supported on Python 3.14 due to upstream PyTorch compatibility. Use Python 3.10-3.13 or use Tesseract on Python 3.14.
PaddleOCR¶
PaddleOCR is available as a native Rust backend in all non-WASM bindings, and also as a Python package:
PaddleOCR is built into the native bindings via the paddle-ocr feature flag. Models are automatically downloaded on first use. No additional installation is required.
PaddleOCR Script Families¶
PaddleOCR supports 80+ languages across 11 script families (all PP-OCRv5). Recognition models are downloaded on demand from HuggingFace on first use:
- English - English, numbers, punctuation
- Chinese - Simplified Chinese, Traditional Chinese, Japanese
- Latin - French, German, Spanish, Portuguese, Italian, Polish, Dutch, Turkish, Vietnamese, etc.
- Korean - Korean (Hangul)
- Slavic - Russian, Ukrainian, Belarusian, Bulgarian, Serbian, etc.
- Thai - Thai script
- Greek - Greek script
- Arabic - Arabic, Persian, Urdu
- Devanagari - Hindi, Marathi, Sanskrit, Nepali
- Tamil - Tamil script
- Telugu - Telugu script
Per-family models are downloaded automatically and cached locally when first needed. This lazy-loading approach keeps startup time fast while supporting full multilingual capabilities.
Configuration¶
Basic Configuration¶
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
lang := "eng"
cfg := &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: &lang,
},
}
result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println(len(result.Content))
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
try {
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.language("eng")
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
System.out.println(result.getContent());
} catch (IOException | KreuzbergException e) {
System.err.println("Extraction failed: " + e.getMessage());
}
}
}
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng")
)
result = extract_file_sync("scanned.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
library(kreuzberg)
# Configure Tesseract OCR
ocr <- ocr_config(backend = "tesseract", language = "eng", dpi = 300L)
config <- extraction_config(force_ocr = TRUE, ocr = ocr)
# Extract text from a scanned image
result <- extract_file_sync("scan.png", config = config)
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat(sprintf("Quality score: %s\n", result$quality_score))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "eng".to_string(),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("scanned.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';
await initWasm();
await enableOcr();
const fileInput = document.getElementById('file') as HTMLInputElement;
const file = fileInput.files?.[0];
if (file) {
const result = await extractFromFile(file, file.type, {
ocr: {
backend: 'kreuzberg-tesseract',
language: 'eng',
},
});
console.log(result.content);
}
import { enableOcr, extractFile, initWasm } from '@kreuzberg/wasm';
await initWasm();
await enableOcr(); // Uses native kreuzberg-tesseract backend
const result = await extractFile('./scanned_document.png', 'image/png', {
ocr: {
backend: 'kreuzberg-tesseract',
language: 'eng',
},
});
console.log(result.content);
Multiple Languages¶
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
lang := "eng+deu+fra"
result, err := kreuzberg.ExtractFileSync("multilingual.pdf", &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: &lang,
},
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println(result.Content)
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.language("eng+deu+fra")
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("multilingual.pdf", config);
System.out.println(result.getContent());
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng+deu+fra")
)
result = extract_file_sync("multilingual.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
library(kreuzberg)
# Configure multi-language OCR (English, French, German)
ocr <- ocr_config(backend = "tesseract", language = "eng+fra+deu")
config <- extraction_config(force_ocr = TRUE, ocr = ocr)
# Extract from a multilingual document
result <- extract_file_sync("multilingual.png", config = config)
cat(sprintf("Detected language: %s\n", detected_language(result)))
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "eng+deu+fra".to_string(),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("multilingual.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';
await initWasm();
await enableOcr();
const fileInput = document.getElementById('file') as HTMLInputElement;
const file = fileInput.files?.[0];
if (file) {
const result = await extractFromFile(file, file.type, {
ocr: {
backend: 'tesseract-wasm',
language: 'eng+deu', // Multiple languages
},
});
console.log(result.content);
}
Force OCR on All Pages¶
Process PDFs with OCR even when they have a text layer:
package main
import (
"fmt"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
force := true
result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
},
ForceOCR: &force,
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
fmt.Println(result.Content)
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.build())
.forceOcr(true)
.build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
System.out.println(result.getContent());
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="tesseract"),
force_ocr=True,
)
result = extract_file_sync("document.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
library(kreuzberg)
config <- extraction_config(force_ocr = TRUE)
result <- extract_file_sync("multipage_document.pdf", "application/pdf", config)
cat(sprintf("Total pages: %d\n", result$pages))
cat(sprintf("Content extracted via OCR: %d characters\n",
nchar(result$content)))
cat(sprintf("Detected language: %s\n", result$detected_language))
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
..Default::default()
}),
force_ocr: true,
..Default::default()
};
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';
await initWasm();
await enableOcr();
const fileInput = document.getElementById('file') as HTMLInputElement;
const file = fileInput.files?.[0];
if (file) {
const result = await extractFromFile(file, file.type, {
force_ocr: true,
ocr: {
backend: 'tesseract-wasm',
language: 'eng',
},
});
console.log(result.content);
}
Using EasyOCR (Python Only)¶
EasyOCR is only available in Python.
EasyOCR is only available in Python.
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="easyocr", language="en", use_gpu=True)
)
result = extract_file_sync("scanned.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
library(kreuzberg)
# Note: EasyOCR backend requires Python to be installed
ocr_cfg <- ocr_config(backend = "easyocr", language = "en")
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)
result <- extract_file_sync("document.pdf", "application/pdf", config)
cat(sprintf("EasyOCR extraction:\n"))
cat(sprintf("Content length: %d characters\n", nchar(result$content)))
cat(sprintf("Detected language: %s\n", result$detected_language))
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "easyocr".to_string(),
language: "en".to_string(),
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
println!("Extracted text: {}", result.content);
Ok(())
}
EasyOCR is only available in Python.
GPU Acceleration
EasyOCR and PaddleOCR support GPU acceleration via PyTorch/PaddlePaddle. Set use_gpu=True to enable.
Using PaddleOCR¶
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
lang := "en"
cfg := &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "paddle-ocr",
Language: &lang,
},
}
result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println(len(result.Content))
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
try {
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("paddle-ocr")
.language("en")
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
System.out.println(result.getContent());
} catch (IOException | KreuzbergException e) {
System.err.println("Extraction failed: " + e.getMessage());
}
}
}
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="paddleocr", language="en")
)
result = extract_file_sync("scanned.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
library(kreuzberg)
# Configure PaddleOCR backend
ocr <- ocr_config(backend = "paddle-ocr", language = "en")
config <- extraction_config(force_ocr = TRUE, ocr = ocr)
# Extract text from an image using PaddleOCR
result <- extract_file_sync("document.jpg", config = config)
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat(sprintf("MIME type: %s\n", result$mime_type))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "paddleocr".to_string(),
language: "en".to_string(),
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
println!("Extracted text: {}", result.content);
Ok(())
}
Advanced OCR Options¶
DPI Configuration¶
Control image resolution for OCR processing:
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
targetDPI := 300
result, err := kreuzberg.ExtractFileSync("scanned.pdf", &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Tesseract: &kreuzberg.TesseractConfig{
Preprocessing: &kreuzberg.ImagePreprocessingConfig{
TargetDPI: &targetDPI,
},
},
},
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println("content length:", len(result.Content))
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.ImagePreprocessingConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.build())
.imagePreprocessing(ImagePreprocessingConfig.builder()
.targetDpi(300)
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
from kreuzberg import (
extract_file_sync,
ExtractionConfig,
OcrConfig,
TesseractConfig,
ImagePreprocessingConfig,
)
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
tesseract_config=TesseractConfig(
preprocessing=ImagePreprocessingConfig(target_dpi=300),
),
),
)
result = extract_file_sync("scanned.pdf", config=config)
content_length: int = len(result.content)
table_count: int = len(result.tables)
print(f"Content length: {content_length} characters")
print(f"Tables detected: {table_count}")
library(kreuzberg)
dpi_values <- c(150L, 300L, 600L)
results <- list()
for (dpi in dpi_values) {
ocr_cfg <- ocr_config(backend = "tesseract", language = "eng", dpi = dpi)
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)
results[[as.character(dpi)]] <- extract_file_sync("document.pdf", "application/pdf", config)
}
for (dpi in dpi_values) {
content_len <- nchar(results[[as.character(dpi)]]$content)
cat(sprintf("DPI %d: %d characters extracted\n", dpi, content_len))
}
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, PdfConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
..Default::default()
}),
pdf_options: Some(PdfConfig {
dpi: Some(300),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("scanned.pdf", None, &config)?;
Ok(())
}
DPI Recommendations
- 150 DPI: Fast processing, lower accuracy
- 300 DPI (default): Balanced speed and accuracy
- 600 DPI: High accuracy, slower processing
Image Preprocessing¶
Kreuzberg automatically preprocesses images for better OCR results:
- Grayscale conversion - Reduces noise
- Contrast enhancement - Improves text visibility
- Noise reduction - Removes artifacts
- Deskewing - Corrects rotation
These are applied automatically and require no configuration.
Concurrent Multi-Language OCR¶
Kreuzberg maintains an engine pool for concurrent OCR processing of multiple languages. When processing documents with different languages, instances are reused efficiently:
- Language-specific engines - Each language creates its own engine instance
- Connection pooling - Engines are cached and reused for subsequent calls with same language
- Concurrent processing - Multiple language files can be processed in parallel
- Memory efficient - Lazy initialization means unused languages don't consume memory
This is particularly useful when batch processing diverse multilingual documents with PaddleOCR or EasyOCR.
Troubleshooting¶
Tesseract not found
Error: MissingDependencyError: tesseract
Solution: Install Tesseract OCR:
Language not found
Error: Failed to initialize tesseract with language 'deu'
Solution: Install the language data:
Poor OCR accuracy
Problem: Extracted text has many errors
Solutions:
-
Increase DPI: Try 600 DPI for better quality
-
Try different backend: EasyOCR often has better accuracy
-
Specify correct language: Use the document's language
OCR is very slow
Problem: Processing takes too long
Solutions:
-
Reduce DPI: Use 150 DPI for faster processing
-
Use GPU acceleration (EasyOCR/PaddleOCR):
-
Use batch processing: Process multiple files concurrently
Out of memory with large PDFs
Problem: Memory errors when processing large scanned PDFs
Solutions:
-
Reduce DPI: Lower resolution uses less memory
-
Process pages separately: Extract specific page ranges
-
Increase system memory: OCR is memory-intensive
EasyOCR/PaddleOCR Python packages not working on Python 3.14
Error: Installation of Python EasyOCR/PaddleOCR packages fails on Python 3.14
Solution: Use Python 3.10-3.13, switch to Tesseract, or use the native PaddleOCR backend (which has no Python dependency):
CLI Usage¶
Extract with OCR using the command-line interface:
# Basic OCR extraction (uses config file for language/settings)
kreuzberg extract scanned.pdf --ocr true
# Extract with specific language (Tesseract)
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra
# Extract with specific language and backend (PaddleOCR for Chinese)
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch
# Force OCR on all pages (even if text layer exists)
kreuzberg extract document.pdf --force-ocr true
# Use config file to specify language and other OCR settings
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true
CLI Flags:
--ocr true- Enable OCR processing--ocr-language <code>- Language code (e.g.,eng,deu,fra,ch,ja,ru)--ocr-backend <backend>- OCR engine (tesseract,paddle-ocr,easyocr)--force-ocr true- OCR all pages regardless of text layer
Example config file (kreuzberg.toml) for OCR settings:
[ocr]
backend = "tesseract"
language = "eng" # Single language
# language = "eng+deu" # Multiple languages
[ocr.tesseract_config]
psm = 3 # Page segmentation mode
Next Steps¶
- Configuration - All configuration options
- Advanced Features - Chunking, language detection, and more
- Extraction Basics - Core extraction API