OCR (Optical Character Recognition)¶
Extract text from images and scanned PDFs. Kreuzberg automatically determines when OCR is needed — images always require it, scanned PDFs trigger it per-page, and hybrid PDFs only OCR the pages that lack a text layer. Set force_ocr=True to OCR all pages regardless.
Backend Comparison¶
Kreuzberg supports four OCR backends. Pick based on your platform, accuracy needs, and language coverage.
| Tesseract | PaddleOCR | EasyOCR | VLM | |
|---|---|---|---|---|
| Speed | Fast | Very fast | Moderate | Slow (API latency) |
| Accuracy | Good | Excellent | Excellent | Highest |
| Languages | 100+ | 80+ (11 script families) | 80+ | All (provider-dependent) |
| Installation | System package | Built-in (native) or Python package | Python package only | API key only |
| Model size | ~10 MB | Mobile ~8 MB, Server ~120 MB | ~100 MB | None (cloud-hosted) |
| GPU support | No | Yes | Yes | N/A (server-side) |
| Platform | All (including WASM) | All except WASM | Python only | All |
| Cost | Free | Free | Free | Per-token API cost |
When to use which:
- Tesseract — Default choice. Works everywhere, low overhead, broadest platform support.
- PaddleOCR — Best speed-to-accuracy ratio. Preferred for CJK languages. Mobile tier is fast; server tier maximizes accuracy with GPU.
- EasyOCR — Highest accuracy with deep learning models. Python-only, heavier dependency.
- VLM — Best for handwritten text, poor scans, Arabic/Farsi, and complex layouts. Requires an API key and incurs per-token costs. See LLM Integration for full details.
Installation¶
Tesseract¶
Download from GitHub releases.
Additional language packs:
# macOS — all languages
brew install tesseract-lang
# Ubuntu/Debian — individual languages
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-fra # French
# Verify installed languages
tesseract --list-langs
PaddleOCR¶
Built in via the paddle-ocr feature flag. Models download automatically on first use — no extra installation needed.
EasyOCR (Python only)¶
Python 3.14
EasyOCR is not supported on Python 3.14 due to upstream PyTorch compatibility. Use Python 3.10–3.13.
Configuration¶
Basic OCR¶
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng")
)
result = extract_file_sync("scanned.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "eng".to_string(),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("scanned.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
lang := "eng"
cfg := &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: &lang,
},
}
result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println(len(result.Content))
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
try {
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.language("eng")
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
System.out.println(result.getContent());
} catch (IOException | KreuzbergException e) {
System.err.println("Extraction failed: " + e.getMessage());
}
}
}
library(kreuzberg)
# Configure Tesseract OCR
ocr <- ocr_config(backend = "tesseract", language = "eng", dpi = 300L)
config <- extraction_config(force_ocr = TRUE, ocr = ocr)
# Extract text from a scanned image
result <- extract_file_sync("scan.png", config = config)
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat(sprintf("Quality score: %s\n", result$quality_score))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';
await initWasm();
await enableOcr();
const fileInput = document.getElementById('file') as HTMLInputElement;
const file = fileInput.files?.[0];
if (file) {
const result = await extractFromFile(file, file.type, {
ocr: {
backend: 'kreuzberg-tesseract',
language: 'eng',
},
});
console.log(result.content);
}
import { enableOcr, extractFile, initWasm } from '@kreuzberg/wasm';
await initWasm();
await enableOcr(); // Uses native kreuzberg-tesseract backend
const result = await extractFile('./scanned_document.png', 'image/png', {
ocr: {
backend: 'kreuzberg-tesseract',
language: 'eng',
},
});
console.log(result.content);
Multiple Languages¶
Specify multiple language codes separated by + (Tesseract) or as a list (EasyOCR/PaddleOCR):
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng+deu+fra")
)
result = extract_file_sync("multilingual.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "eng+deu+fra".to_string(),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("multilingual.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
lang := "eng+deu+fra"
result, err := kreuzberg.ExtractFileSync("multilingual.pdf", &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: &lang,
},
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println(result.Content)
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.language("eng+deu+fra")
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("multilingual.pdf", config);
System.out.println(result.getContent());
library(kreuzberg)
# Configure multi-language OCR (English, French, German)
ocr <- ocr_config(backend = "tesseract", language = "eng+fra+deu")
config <- extraction_config(force_ocr = TRUE, ocr = ocr)
# Extract from a multilingual document
result <- extract_file_sync("multilingual.png", config = config)
cat(sprintf("Detected language: %s\n", detected_language(result)))
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
Force OCR¶
Process PDFs with OCR even when they have a text layer:
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="tesseract"),
force_ocr=True,
)
result = extract_file_sync("document.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
..Default::default()
}),
force_ocr: true,
..Default::default()
};
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
package main
import (
"fmt"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
force := true
result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
},
ForceOCR: &force,
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
fmt.Println(result.Content)
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.build())
.forceOcr(true)
.build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
System.out.println(result.getContent());
library(kreuzberg)
config <- extraction_config(force_ocr = TRUE)
result <- extract_file_sync("multipage_document.pdf", "application/pdf", config)
cat(sprintf("Total pages: %d\n", result$pages))
cat(sprintf("Content extracted via OCR: %d characters\n",
nchar(result$content)))
cat(sprintf("Detected language: %s\n", result$detected_language))
Using EasyOCR¶
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="easyocr", language="en")
)
# EasyOCR-specific options (use_gpu, beam_width, etc.) go in easyocr_kwargs,
# not in OcrConfig — OcrConfig only accepts backend, language, and backend-specific configs.
result = extract_file_sync("scanned.pdf", config=config, easyocr_kwargs={"use_gpu": True})
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
EasyOCR is only available in Python.
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "easyocr".to_string(),
language: "en".to_string(),
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
println!("Extracted text: {}", result.content);
Ok(())
}
Disable OCR¶
Added in v4.7.0
Skip OCR entirely, even for image files that would normally require it. When disable_ocr is set, image files return empty content instead of raising a MissingDependencyError:
Using EasyOCR (Python Only)¶
EasyOCR is only available in Python.
EasyOCR is only available in Python.
library(kreuzberg)
# Note: EasyOCR backend requires Python to be installed
ocr_cfg <- ocr_config(backend = "easyocr", language = "en")
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)
result <- extract_file_sync("document.pdf", "application/pdf", config)
cat(sprintf("EasyOCR extraction:\n"))
cat(sprintf("Content length: %d characters\n", nchar(result$content)))
cat(sprintf("Detected language: %s\n", result$detected_language))
Using PaddleOCR¶
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="paddleocr", language="en") # model_tier="server" for max accuracy
)
result = extract_file_sync("scanned.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "paddleocr".to_string(),
language: "en".to_string(),
// paddle_ocr_config: Some(serde_json::json!({"model_tier": "server"})), // for max accuracy
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
println!("Extracted text: {}", result.content);
Ok(())
}
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
lang := "en"
cfg := &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "paddle-ocr",
Language: &lang,
// PaddleOcr: &kreuzberg.PaddleOcrConfig{ModelTier: "server"}, // for max accuracy
},
}
result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println(len(result.Content))
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
try {
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("paddle-ocr")
.language("en")
// .paddleOcrConfig(PaddleOcrConfig.builder().modelTier("server").build()) // for max accuracy
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
System.out.println(result.getContent());
} catch (IOException | KreuzbergException e) {
System.err.println("Extraction failed: " + e.getMessage());
}
}
}
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
ocr: Kreuzberg::Config::OCR.new(
backend: 'paddleocr',
language: 'eng'
# model_tier: 'server' # for max accuracy
)
)
result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts result.content[0..100]
puts "Total length: #{result.content.length}"
library(kreuzberg)
# Configure PaddleOCR backend (defaults to mobile tier; use model_tier = "server" for max accuracy)
ocr <- ocr_config(backend = "paddle-ocr", language = "en")
config <- extraction_config(force_ocr = TRUE, ocr = ocr)
# Extract text from an image using PaddleOCR
result <- extract_file_sync("document.jpg", config = config)
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat(sprintf("MIME type: %s\n", result$mime_type))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
Using VLM OCR v4.8.0¶
Use a vision-language model (for example, GPT-4o, Claude) as the OCR backend. Each page is rendered as an image and sent to the VLM for text extraction. Cloud providers require an API key; local engines like Ollama do not — just start the server and use the ollama/ prefix (for example, ollama/llama3.2-vision). See Local LLM Support for setup details.
import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, LlmConfig
async def main() -> None:
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(
backend="vlm",
vlm_config=LlmConfig(model="openai/gpt-4o-mini"),
),
)
result = await extract_file("scan.pdf", config=config)
print(result.content)
asyncio.run(main())
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};
let config = ExtractionConfig {
force_ocr: true,
ocr: Some(OcrConfig {
backend: "vlm".to_string(),
vlm_config: Some(LlmConfig {
model: "openai/gpt-4o-mini".to_string(),
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
let result = extract_file("scan.pdf", None, &config).await?;
For more on VLM OCR, including custom prompts, supported providers, and API key configuration, see LLM Integration.
GPU Acceleration
EasyOCR and PaddleOCR support GPU acceleration. Set use_gpu=True in your OCR config. PaddleOCR's model_tier="server" gives the best accuracy with GPU.
DPI Configuration¶
Image resolution affects both accuracy and speed. Higher DPI improves accuracy but increases processing time and memory usage.
| DPI | Trade-off |
|---|---|
| 150 | Fastest — lower accuracy, less memory |
| 300 (default) | Balanced — good accuracy, reasonable speed |
| 600 | Best accuracy — slower, more memory |
from kreuzberg import (
extract_file_sync,
ExtractionConfig,
OcrConfig,
TesseractConfig,
ImagePreprocessingConfig,
)
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
tesseract_config=TesseractConfig(
preprocessing=ImagePreprocessingConfig(target_dpi=300),
),
),
)
result = extract_file_sync("scanned.pdf", config=config)
content_length: int = len(result.content)
table_count: int = len(result.tables)
print(f"Content length: {content_length} characters")
print(f"Tables detected: {table_count}")
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, PdfConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
..Default::default()
}),
pdf_options: Some(PdfConfig {
dpi: Some(300),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("scanned.pdf", None, &config)?;
Ok(())
}
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
targetDPI := 300
result, err := kreuzberg.ExtractFileSync("scanned.pdf", &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Tesseract: &kreuzberg.TesseractConfig{
Preprocessing: &kreuzberg.ImagePreprocessingConfig{
TargetDPI: &targetDPI,
},
},
},
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println("content length:", len(result.Content))
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.ImagePreprocessingConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.build())
.imagePreprocessing(ImagePreprocessingConfig.builder()
.targetDpi(300)
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
library(kreuzberg)
dpi_values <- c(150L, 300L, 600L)
results <- list()
for (dpi in dpi_values) {
ocr_cfg <- ocr_config(backend = "tesseract", language = "eng", dpi = dpi)
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)
results[[as.character(dpi)]] <- extract_file_sync("document.pdf", "application/pdf", config)
}
for (dpi in dpi_values) {
content_len <- nchar(results[[as.character(dpi)]]$content)
cat(sprintf("DPI %d: %d characters extracted\n", dpi, content_len))
}
PaddleOCR Script Families¶
PaddleOCR supports 80+ languages across 11 script families (PP-OCRv5). Recognition models are downloaded on demand from HuggingFace:
| Family | Languages |
|---|---|
| English | English, numbers, punctuation |
| Chinese | Simplified/Traditional Chinese, Japanese |
| Latin | French, German, Spanish, Portuguese, Italian, Polish, Dutch, Turkish, Vietnamese, and so on. |
| Korean | Korean (Hangul) |
| Slavic | Russian, Ukrainian, Belarusian, Bulgarian, Serbian, and so on. |
| Thai | Thai script |
| Greek | Greek script |
| Arabic | Arabic, Persian, Urdu |
| Devanagari | Hindi, Marathi, Sanskrit, Nepali |
| Tamil | Tamil script |
| Telugu | Telugu script |
Models are cached locally after first download, so subsequent runs start immediately.
CLI Usage¶
# Basic OCR extraction
kreuzberg extract scanned.pdf --ocr true
# Specific language
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra
# Specific backend
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch
# Force OCR on all pages
kreuzberg extract document.pdf --force-ocr true
# VLM OCR backend
kreuzberg extract handwritten.pdf --force-ocr true --vlm-model openai/gpt-4o-mini
# Use a config file
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true
| Flag | Description |
|---|---|
--ocr true |
Enable OCR processing |
--ocr-language <code> |
Language code (eng, deu, fra, ch, ja, ru, etc.) |
--ocr-backend <backend> |
Engine: tesseract, paddle-ocr, easyocr, or vlm |
--force-ocr true |
OCR all pages regardless of text layer |
--vlm-model <model> |
VLM model for OCR (for example, openai/gpt-4o-mini). Implies --ocr-backend vlm |
Troubleshooting¶
Tesseract not found
Install Tesseract and verify it's on your PATH:
Language not found
Install the language data pack:
Poor accuracy
- Increase DPI to 600 for better quality
- Try a different backend — PaddleOCR and EasyOCR often outperform Tesseract on complex layouts
- Specify the correct language code for your document
- Use
force_ocr=Trueif a PDF's embedded text layer is low quality - For handwritten text or very poor scans, try the VLM backend with a vision-capable model (see LLM Integration)
Slow processing
- Reduce DPI to 150 for faster throughput
- Enable GPU acceleration with EasyOCR or PaddleOCR (
use_gpu=True) - Use batch extraction to process multiple files concurrently
Out of memory on large PDFs
- Reduce DPI — lower resolution uses significantly less memory
- Process pages in smaller batches
- Use PaddleOCR's mobile tier (
model_tier="mobile") for a smaller memory footprint
Next Steps¶
- LLM Integration — VLM OCR, structured extraction, and LLM embeddings
- Configuration — all configuration options
- Extraction Basics — core extraction API and supported formats
- Advanced Features — chunking, language detection, embeddings