OCR (Optical Character Recognition)¶

Extract text from images and scanned PDFs using OCR.

When OCR is Needed¶

flowchart TD
    Start[Document File] --> FileType{File Type}

    FileType -->|Image| ImageOCR[Always Use OCR]
    FileType -->|PDF| CheckPDF{Check PDF}
    FileType -->|Other| NoOCR[No OCR Needed]

    CheckPDF --> ForceOCR{force_ocr=True?}
    ForceOCR -->|Yes| AllPagesOCR[OCR All Pages]
    ForceOCR -->|No| TextLayer{Has Text Layer?}

    TextLayer -->|No Text| ScannedOCR[OCR Required]
    TextLayer -->|Some Text| HybridPDF[Hybrid PDF]
    TextLayer -->|All Text| NativeExtract[Native Extraction]

    HybridPDF --> PageByPage[Process Pages]
    PageByPage --> CheckPage{Page Has Text?}
    CheckPage -->|No| PageOCR[OCR This Page]
    CheckPage -->|Yes| PageNative[Native Extraction]

    ImageOCR --> OCRBackend[OCR Backend Configured?]
    ScannedOCR --> OCRBackend
    AllPagesOCR --> OCRBackend
    PageOCR --> OCRBackend

    OCRBackend -->|Yes| ProcessOCR[Process with OCR]
    OCRBackend -->|No| Error[MissingDependencyError]

    style ImageOCR fill:#FFB6C1
    style ScannedOCR fill:#FFB6C1
    style AllPagesOCR fill:#FFB6C1
    style PageOCR fill:#FFB6C1
    style ProcessOCR fill:#90EE90
    style Error fill:#FF6B6B

Kreuzberg automatically determines when OCR is required:

Images (.png, .jpg, .tiff, .bmp, .webp) - Always requires OCR
PDFs with no text layer - Scanned documents automatically trigger OCR
Hybrid PDFs - Pages without text are processed with OCR, others use native extraction
Force OCR - Use force_ocr=True to OCR all pages regardless of text layer

Automatic Detection

You don't need to manually enable OCR for images. Kreuzberg detects the file type and applies OCR automatically when an OCR backend is configured.

OCR Backend Comparison¶

flowchart TD
    Start[Choose OCR Backend] --> Platform{Platform Support}
    Platform -->|All Platforms| Tesseract
    Platform -->|All except WASM| PaddleOCR[PaddleOCR]
    Platform -->|Python Only| EasyOCR[EasyOCR]

    Tesseract --> TessPriority{Priority}
    TessPriority -->|Speed| TessSpeed[Tesseract: Fast]
    TessPriority -->|Accuracy| TessAccuracy[Tesseract: Good]
    TessPriority -->|Production| TessProd[Tesseract: Best Choice]

    PaddleOCR --> PaddlePriority{Priority}
    PaddlePriority -->|Speed + Accuracy| PaddleMain[PaddleOCR: Very Fast + Excellent]
    PaddlePriority -->|CJK Languages| PaddleCJK[PaddleOCR: Best for CJK]

    EasyOCR --> EasyPriority{Priority}
    EasyPriority -->|Highest Accuracy| Easy[EasyOCR: Excellent Accuracy]
    EasyPriority -->|GPU Available| GPU[EasyOCR with GPU]

    style Tesseract fill:#90EE90
    style Easy fill:#FFD700
    style PaddleOCR fill:#87CEEB

Kreuzberg supports three OCR backends with different strengths:

Feature	Tesseract	EasyOCR	PaddleOCR
Speed	Fast	Moderate	Very Fast
Accuracy	Good	Excellent	Excellent
Languages	100+	80+	80+ (11 script families)
Installation	System package	Python package	Feature flag (native) or Python package
Model Size	Small (~10MB)	Large (~100MB)	Medium (~120MB base + ~8MB per family)
CPU/GPU	CPU only	CPU + GPU	CPU + GPU
Platform Support	All	Python only	All (except WASM)
Best For	General use, production	High accuracy needs	Speed + accuracy, CJK languages

Recommendation¶

Production/CLI: Use Tesseract for simplicity and broad platform support
Speed + Accuracy (any binding): Use PaddleOCR for fast processing with excellent accuracy, especially for CJK languages
Python + Accuracy: Use EasyOCR for best accuracy with deep learning models (Python only)

Installation¶

Tesseract (Recommended)¶

Available on all platforms (Python, TypeScript, Rust, Ruby):

macOSUbuntu/DebianRHEL/CentOS/FedoraWindows

Terminal

brew install tesseract

Terminal

sudo apt-get install tesseract-ocr

Terminal

sudo dnf install tesseract

Download from GitHub releases

Additional Languages¶

Terminal

# macOS
brew install tesseract-lang

# Ubuntu/Debian
sudo apt-get install tesseract-ocr-deu  # German
sudo apt-get install tesseract-ocr-fra  # French
sudo apt-get install tesseract-ocr-spa  # Spanish

# List all installed languages
tesseract --list-langs

EasyOCR (Python Only)¶

Available only in Python with deep learning models:

Terminal

pip install "kreuzberg[easyocr]"

Python 3.14 Compatibility

EasyOCR is not supported on Python 3.14 due to upstream PyTorch compatibility. Use Python 3.10-3.13 or use Tesseract on Python 3.14.

PaddleOCR¶

PaddleOCR is available as a native Rust backend in all non-WASM bindings, and also as a Python package:

Native (Rust/Go/TypeScript/Ruby/Java/C#/PHP/Elixir)Python

PaddleOCR is built into the native bindings via the paddle-ocr feature flag. Models are automatically downloaded on first use. No additional installation is required.

Cargo.toml (Rust)

[dependencies]
kreuzberg = { version = "4.0", features = ["paddle-ocr"] }

Terminal

pip install "kreuzberg[paddleocr]"

Python 3.14 Compatibility

The Python PaddleOCR package is not supported on Python 3.14 due to upstream compatibility issues. Use Python 3.10-3.13, or use the native Rust backend which has no Python dependency.

PaddleOCR Script Families¶

PaddleOCR supports 80+ languages across 11 script families (all PP-OCRv5). Recognition models are downloaded on demand from HuggingFace on first use:

English - English, numbers, punctuation
Chinese - Simplified Chinese, Traditional Chinese, Japanese
Latin - French, German, Spanish, Portuguese, Italian, Polish, Dutch, Turkish, Vietnamese, etc.
Korean - Korean (Hangul)
Slavic - Russian, Ukrainian, Belarusian, Bulgarian, Serbian, etc.
Thai - Thai script
Greek - Greek script
Arabic - Arabic, Persian, Urdu
Devanagari - Hindi, Marathi, Sanskrit, Nepali
Tamil - Tamil script
Telugu - Telugu script

Per-family models are downloaded automatically and cached locally when first needed. This lazy-loading approach keeps startup time fast while supporting full multilingual capabilities.

Configuration¶

Basic Configuration¶

GoJavaPythonRubyRRustTypeScriptWASM

Go

package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    lang := "eng"
    cfg := &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend:  "tesseract",
            Language: &lang,
        },
    }

    result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }
    log.Println(len(result.Content))
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import java.io.IOException;

public class Main {
    public static void main(String[] args) {
        try {
            ExtractionConfig config = ExtractionConfig.builder()
                .ocr(OcrConfig.builder()
                    .backend("tesseract")
                    .language("eng")
                    .build())
                .build();

            ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
            System.out.println(result.getContent());
        } catch (IOException | KreuzbergException e) {
            System.err.println("Extraction failed: " + e.getMessage());
        }
    }
}

Python

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng")
)

result = extract_file_sync("scanned.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")

Ruby

require 'kreuzberg'

ocr_config = Kreuzberg::Config::OCR.new(
  backend: 'tesseract',
  language: 'eng'
)

config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts result.content

R

library(kreuzberg)

# Configure Tesseract OCR
ocr <- ocr_config(backend = "tesseract", language = "eng", dpi = 300L)
config <- extraction_config(force_ocr = TRUE, ocr = ocr)

# Extract text from a scanned image
result <- extract_file_sync("scan.png", config = config)

cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat(sprintf("Quality score: %s\n", result$quality_score))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))

Rust

use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: "eng".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("scanned.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

TypeScript

import { extractFileSync } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
        language: 'eng',
    },
};

const result = extractFileSync('scanned.pdf', null, config);
console.log(result.content);

WASM (Browser)

import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';

await initWasm();
await enableOcr();

const fileInput = document.getElementById('file') as HTMLInputElement;
const file = fileInput.files?.[0];

if (file) {
    const result = await extractFromFile(file, file.type, {
        ocr: {
            backend: 'kreuzberg-tesseract',
            language: 'eng',
        },
    });
    console.log(result.content);
}

WASM (Node.js / Deno / Bun)

import { enableOcr, extractFile, initWasm } from '@kreuzberg/wasm';

await initWasm();
await enableOcr(); // Uses native kreuzberg-tesseract backend

const result = await extractFile('./scanned_document.png', 'image/png', {
    ocr: {
        backend: 'kreuzberg-tesseract',
        language: 'eng',
    },
});
console.log(result.content);

Multiple Languages¶

GoJavaPythonRubyRRustTypeScriptWASM

Go

package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    lang := "eng+deu+fra"
    result, err := kreuzberg.ExtractFileSync("multilingual.pdf", &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend:  "tesseract",
            Language: &lang,
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println(result.Content)
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .language("eng+deu+fra")
        .build())
    .build();

ExtractionResult result = Kreuzberg.extractFile("multilingual.pdf", config);
System.out.println(result.getContent());

Python

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng+deu+fra")
)

result = extract_file_sync("multilingual.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")

Ruby

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    backend: 'tesseract',
    language: 'eng+deu+fra'
  )
)

result = Kreuzberg.extract_file_sync('multilingual.pdf', config: config)
puts result.content

R

library(kreuzberg)

# Configure multi-language OCR (English, French, German)
ocr <- ocr_config(backend = "tesseract", language = "eng+fra+deu")
config <- extraction_config(force_ocr = TRUE, ocr = ocr)

# Extract from a multilingual document
result <- extract_file_sync("multilingual.png", config = config)

cat(sprintf("Detected language: %s\n", detected_language(result)))
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))

Rust

use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: "eng+deu+fra".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("multilingual.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

TypeScript

import { extractFileSync } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
        language: 'eng+deu+fra',
    },
};

const result = extractFileSync('multilingual.pdf', null, config);
console.log(result.content);

import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';

await initWasm();
await enableOcr();

const fileInput = document.getElementById('file') as HTMLInputElement;
const file = fileInput.files?.[0];

if (file) {
  const result = await extractFromFile(file, file.type, {
    ocr: {
      backend: 'tesseract-wasm',
      language: 'eng+deu', // Multiple languages
    },
  });
  console.log(result.content);
}

Force OCR on All Pages¶

Process PDFs with OCR even when they have a text layer:

GoJavaPythonRubyRRustTypeScriptWASM

Go

package main

import (
    "fmt"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    force := true
    result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend: "tesseract",
        },
        ForceOCR: &force,
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    fmt.Println(result.Content)
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .build())
    .forceOcr(true)
    .build();

ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
System.out.println(result.getContent());

Python

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract"),
    force_ocr=True,
)

result = extract_file_sync("document.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")

Ruby

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(backend: 'tesseract'),
  force_ocr: true
)

result = Kreuzberg.extract_file_sync('document.pdf', config: config)
puts result.content

R

library(kreuzberg)

config <- extraction_config(force_ocr = TRUE)

result <- extract_file_sync("multipage_document.pdf", "application/pdf", config)

cat(sprintf("Total pages: %d\n", result$pages))
cat(sprintf("Content extracted via OCR: %d characters\n",
            nchar(result$content)))
cat(sprintf("Detected language: %s\n", result$detected_language))

Rust

use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            ..Default::default()
        }),
        force_ocr: true,
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

TypeScript

import { extractFileSync } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
    },
    forceOcr: true,
};

const result = extractFileSync('document.pdf', null, config);
console.log(result.content);

import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';

await initWasm();
await enableOcr();

const fileInput = document.getElementById('file') as HTMLInputElement;
const file = fileInput.files?.[0];

if (file) {
  const result = await extractFromFile(file, file.type, {
    force_ocr: true,
    ocr: {
      backend: 'tesseract-wasm',
      language: 'eng',
    },
  });
  console.log(result.content);
}

Using EasyOCR (Python Only)¶

GoJavaPythonRubyRRustTypeScript

EasyOCR is only available in Python.

Python

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="easyocr", language="en", use_gpu=True)
)

result = extract_file_sync("scanned.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")

Ruby

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    backend: 'easyocr',
    language: 'eng'
  )
)

result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts result.content[0..100]
puts "Total length: #{result.content.length}"

R

library(kreuzberg)

# Note: EasyOCR backend requires Python to be installed
ocr_cfg <- ocr_config(backend = "easyocr", language = "en")
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("EasyOCR extraction:\n"))
cat(sprintf("Content length: %d characters\n", nchar(result$content)))
cat(sprintf("Detected language: %s\n", result$detected_language))

Rust

use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "easyocr".to_string(),
            language: "en".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file("document.pdf", None, &config).await?;
    println!("Extracted text: {}", result.content);
    Ok(())
}

EasyOCR is only available in Python.

GPU Acceleration

EasyOCR and PaddleOCR support GPU acceleration via PyTorch/PaddlePaddle. Set use_gpu=True to enable.

Using PaddleOCR¶

GoJavaPythonRubyRRustTypeScript

Go

package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    lang := "en"
    cfg := &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend:  "paddle-ocr",
            Language: &lang,
        },
    }

    result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }
    log.Println(len(result.Content))
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import java.io.IOException;

public class Main {
    public static void main(String[] args) {
        try {
            ExtractionConfig config = ExtractionConfig.builder()
                .ocr(OcrConfig.builder()
                    .backend("paddle-ocr")
                    .language("en")
                    .build())
                .build();

            ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
            System.out.println(result.getContent());
        } catch (IOException | KreuzbergException e) {
            System.err.println("Extraction failed: " + e.getMessage());
        }
    }
}

Python

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="paddleocr", language="en")
)

result = extract_file_sync("scanned.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")

Ruby

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    backend: 'paddleocr',
    language: 'eng'
  )
)

result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts result.content[0..100]
puts "Total length: #{result.content.length}"

R

library(kreuzberg)

# Configure PaddleOCR backend
ocr <- ocr_config(backend = "paddle-ocr", language = "en")
config <- extraction_config(force_ocr = TRUE, ocr = ocr)

# Extract text from an image using PaddleOCR
result <- extract_file_sync("document.jpg", config = config)

cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat(sprintf("MIME type: %s\n", result$mime_type))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))

Rust

use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "paddleocr".to_string(),
            language: "en".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file("document.pdf", None, &config).await?;
    println!("Extracted text: {}", result.content);
    Ok(())
}

TypeScript

import { extractFileSync } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'paddle-ocr',
        language: 'en',
    },
};

const result = extractFileSync('scanned.pdf', null, config);
console.log(result.content);

Advanced OCR Options¶

DPI Configuration¶

Control image resolution for OCR processing:

GoJavaPythonRubyRRustTypeScript

Go

package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    targetDPI := 300
    result, err := kreuzberg.ExtractFileSync("scanned.pdf", &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend: "tesseract",
            Tesseract: &kreuzberg.TesseractConfig{
                Preprocessing: &kreuzberg.ImagePreprocessingConfig{
                    TargetDPI: &targetDPI,
                },
            },
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.ImagePreprocessingConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .build())
    .imagePreprocessing(ImagePreprocessingConfig.builder()
        .targetDpi(300)
        .build())
    .build();

ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);

Python

from kreuzberg import (
    extract_file_sync,
    ExtractionConfig,
    OcrConfig,
    TesseractConfig,
    ImagePreprocessingConfig,
)

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        tesseract_config=TesseractConfig(
            preprocessing=ImagePreprocessingConfig(target_dpi=300),
        ),
    ),
)

result = extract_file_sync("scanned.pdf", config=config)

content_length: int = len(result.content)
table_count: int = len(result.tables)

print(f"Content length: {content_length} characters")
print(f"Tables detected: {table_count}")

Ruby

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(backend: 'tesseract'),
  pdf: Kreuzberg::Config::PDF.new(dpi: 300)
)

result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)

R

library(kreuzberg)

dpi_values <- c(150L, 300L, 600L)
results <- list()

for (dpi in dpi_values) {
  ocr_cfg <- ocr_config(backend = "tesseract", language = "eng", dpi = dpi)
  config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)
  results[[as.character(dpi)]] <- extract_file_sync("document.pdf", "application/pdf", config)
}

for (dpi in dpi_values) {
  content_len <- nchar(results[[as.character(dpi)]]$content)
  cat(sprintf("DPI %d: %d characters extracted\n", dpi, content_len))
}

Rust

use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, PdfConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            ..Default::default()
        }),
        pdf_options: Some(PdfConfig {
            dpi: Some(300),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("scanned.pdf", None, &config)?;
    Ok(())
}

TypeScript

import { extractFileSync } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
    },
    pdfOptions: {
        extractImages: true,
    },
};

const result = extractFileSync('scanned.pdf', null, config);
console.log(result.content);

DPI Recommendations

150 DPI: Fast processing, lower accuracy
300 DPI (default): Balanced speed and accuracy
600 DPI: High accuracy, slower processing

Image Preprocessing¶

Kreuzberg automatically preprocesses images for better OCR results:

Grayscale conversion - Reduces noise
Contrast enhancement - Improves text visibility
Noise reduction - Removes artifacts
Deskewing - Corrects rotation

These are applied automatically and require no configuration.

Concurrent Multi-Language OCR¶

Kreuzberg maintains an engine pool for concurrent OCR processing of multiple languages. When processing documents with different languages, instances are reused efficiently:

Language-specific engines - Each language creates its own engine instance
Connection pooling - Engines are cached and reused for subsequent calls with same language
Concurrent processing - Multiple language files can be processed in parallel
Memory efficient - Lazy initialization means unused languages don't consume memory

This is particularly useful when batch processing diverse multilingual documents with PaddleOCR or EasyOCR.

Troubleshooting¶

Tesseract not found

Error: MissingDependencyError: tesseract

Solution: Install Tesseract OCR:

Terminal

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# Verify installation
tesseract --version

Language not found

Error: Failed to initialize tesseract with language 'deu'

Solution: Install the language data:

Terminal

# macOS
brew install tesseract-lang

# Ubuntu/Debian
sudo apt-get install tesseract-ocr-deu

# List installed languages
tesseract --list-langs

Poor OCR accuracy

Problem: Extracted text has many errors

Solutions:

Increase DPI: Try 600 DPI for better quality

ocr_high_quality.py

config = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract"),
    pdf=PdfConfig(dpi=600)
)

Try different backend: EasyOCR often has better accuracy

ocr_easyocr.py

config = ExtractionConfig(
    ocr=OcrConfig(backend="easyocr", language="en")
)

Specify correct language: Use the document's language

ocr_german.py

config = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="deu")
)

OCR is very slow

Problem: Processing takes too long

Solutions:

Reduce DPI: Use 150 DPI for faster processing

ocr_fast.py

config = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract"),
    pdf=PdfConfig(dpi=150)
)

Use GPU acceleration (EasyOCR/PaddleOCR):

ocr_gpu.py

config = ExtractionConfig(
    ocr=OcrConfig(backend="paddleocr", use_gpu=True)
)

Use batch processing: Process multiple files concurrently
batch_ocr.py
```
results = batch_extract_files_sync(files, config=config)
```

Out of memory with large PDFs

Problem: Memory errors when processing large scanned PDFs

Solutions:

Reduce DPI: Lower resolution uses less memory

ocr_low_memory.py

config = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract"),
    pdf=PdfConfig(dpi=150)
)

Process pages separately: Extract specific page ranges
Increase system memory: OCR is memory-intensive

EasyOCR/PaddleOCR Python packages not working on Python 3.14

Error: Installation of Python EasyOCR/PaddleOCR packages fails on Python 3.14

Solution: Use Python 3.10-3.13, switch to Tesseract, or use the native PaddleOCR backend (which has no Python dependency):

Terminal

# Option 1: Use Tesseract (works on all Python versions)
pip install kreuzberg
brew install tesseract  # or apt-get install tesseract-ocr

# Option 2: Use native PaddleOCR backend (no Python dependency)
# Set backend to "paddle-ocr" in your config - models download automatically

CLI Usage¶

Extract with OCR using the command-line interface:

Terminal

# Basic OCR extraction (uses config file for language/settings)
kreuzberg extract scanned.pdf --ocr true

# Extract with specific language (Tesseract)
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra

# Extract with specific language and backend (PaddleOCR for Chinese)
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch

# Force OCR on all pages (even if text layer exists)
kreuzberg extract document.pdf --force-ocr true

# Use config file to specify language and other OCR settings
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true

CLI Flags:

--ocr true - Enable OCR processing
--ocr-language <code> - Language code (e.g., eng, deu, fra, ch, ja, ru)
--ocr-backend <backend> - OCR engine (tesseract, paddle-ocr, easyocr)
--force-ocr true - OCR all pages regardless of text layer

Example config file (kreuzberg.toml) for OCR settings:

OCR Configuration Example

[ocr]
backend = "tesseract"
language = "eng"           # Single language
# language = "eng+deu"     # Multiple languages

[ocr.tesseract_config]
psm = 3                    # Page segmentation mode

Next Steps¶

Configuration - All configuration options
Advanced Features - Chunking, language detection, and more
Extraction Basics - Core extraction API