Skip to content

OCR (Optical Character Recognition)

Extract text from images and scanned PDFs using OCR.

When OCR is Needed

flowchart TD
    Start[Document File] --> FileType{File Type}

    FileType -->|Image| ImageOCR[Always Use OCR]
    FileType -->|PDF| CheckPDF{Check PDF}
    FileType -->|Other| NoOCR[No OCR Needed]

    CheckPDF --> ForceOCR{force_ocr=True?}
    ForceOCR -->|Yes| AllPagesOCR[OCR All Pages]
    ForceOCR -->|No| TextLayer{Has Text Layer?}

    TextLayer -->|No Text| ScannedOCR[OCR Required]
    TextLayer -->|Some Text| HybridPDF[Hybrid PDF]
    TextLayer -->|All Text| NativeExtract[Native Extraction]

    HybridPDF --> PageByPage[Process Pages]
    PageByPage --> CheckPage{Page Has Text?}
    CheckPage -->|No| PageOCR[OCR This Page]
    CheckPage -->|Yes| PageNative[Native Extraction]

    ImageOCR --> OCRBackend[OCR Backend Configured?]
    ScannedOCR --> OCRBackend
    AllPagesOCR --> OCRBackend
    PageOCR --> OCRBackend

    OCRBackend -->|Yes| ProcessOCR[Process with OCR]
    OCRBackend -->|No| Error[MissingDependencyError]

    style ImageOCR fill:#FFB6C1
    style ScannedOCR fill:#FFB6C1
    style AllPagesOCR fill:#FFB6C1
    style PageOCR fill:#FFB6C1
    style ProcessOCR fill:#90EE90
    style Error fill:#FF6B6B

Kreuzberg automatically determines when OCR is required:

  • Images (.png, .jpg, .tiff, .bmp, .webp) - Always requires OCR
  • PDFs with no text layer - Scanned documents automatically trigger OCR
  • Hybrid PDFs - Pages without text are processed with OCR, others use native extraction
  • Force OCR - Use force_ocr=True to OCR all pages regardless of text layer

Automatic Detection

You don't need to manually enable OCR for images. Kreuzberg detects the file type and applies OCR automatically when an OCR backend is configured.

OCR Backend Comparison

flowchart TD
    Start[Choose OCR Backend] --> Platform{Platform Support}
    Platform -->|All Platforms| Tesseract
    Platform -->|Python Only| PythonBackends[EasyOCR/PaddleOCR]

    Tesseract --> TessPriority{Priority}
    TessPriority -->|Speed| TessSpeed[Tesseract: Fast]
    TessPriority -->|Accuracy| TessAccuracy[Tesseract: Good]
    TessPriority -->|Production| TessProd[Tesseract: Best Choice]

    PythonBackends --> PyPriority{Priority}
    PyPriority -->|Highest Accuracy| Easy[EasyOCR: Excellent Accuracy]
    PyPriority -->|Speed + Accuracy| Paddle[PaddleOCR: Very Fast + Excellent]
    PyPriority -->|GPU Available| GPU[EasyOCR/PaddleOCR with GPU]

    style Tesseract fill:#90EE90
    style Easy fill:#FFD700
    style Paddle fill:#87CEEB

Kreuzberg supports three OCR backends with different strengths:

Feature Tesseract EasyOCR PaddleOCR
Speed Fast Moderate Very Fast
Accuracy Good Excellent Excellent
Languages 100+ 80+ 80+
Installation System package Python package Python package
Model Size Small (~10MB) Large (~100MB) Medium (~50MB)
CPU/GPU CPU only CPU + GPU CPU + GPU
Platform Support All Python only Python only
Best For General use, production High accuracy needs Speed + accuracy

Recommendation

  • Production/CLI: Use Tesseract for simplicity and broad platform support
  • Python + Accuracy: Use EasyOCR for best accuracy with deep learning models
  • Python + Speed: Use PaddleOCR for fast processing with good accuracy

Installation

Available on all platforms (Python, TypeScript, Rust, Ruby):

Terminal
brew install tesseract
Terminal
sudo apt-get install tesseract-ocr
Terminal
sudo dnf install tesseract

Download from GitHub releases

Additional Languages

Terminal
# macOS
brew install tesseract-lang

# Ubuntu/Debian
sudo apt-get install tesseract-ocr-deu  # German
sudo apt-get install tesseract-ocr-fra  # French
sudo apt-get install tesseract-ocr-spa  # Spanish

# List all installed languages
tesseract --list-langs

EasyOCR (Python Only)

Available only in Python with deep learning models:

Terminal
pip install "kreuzberg[easyocr]"

Python 3.14 Compatibility

EasyOCR is not supported on Python 3.14 due to upstream PyTorch compatibility. Use Python 3.10-3.13 or use Tesseract on Python 3.14.

PaddleOCR (Python Only)

Available only in Python with optimized deep learning:

Terminal
pip install "kreuzberg[paddleocr]"

Python 3.14 Compatibility

PaddleOCR is not supported on Python 3.14 due to upstream compatibility issues. Use Python 3.10-3.13.

Configuration

Basic Configuration

Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    lang := "eng"
    cfg := &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend:  "tesseract",
            Language: &lang,
        },
    }

    result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }
    log.Println(len(result.Content))
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import java.io.IOException;

public class Main {
    public static void main(String[] args) {
        try {
            ExtractionConfig config = ExtractionConfig.builder()
                .ocr(OcrConfig.builder()
                    .backend("tesseract")
                    .language("eng")
                    .build())
                .build();

            ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
            System.out.println(result.getContent());
        } catch (IOException | KreuzbergException e) {
            System.err.println("Extraction failed: " + e.getMessage());
        }
    }
}
Python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng")
)

result = extract_file_sync("scanned.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
Ruby
require 'kreuzberg'

ocr_config = Kreuzberg::Config::OCR.new(
  backend: 'tesseract',
  language: 'eng'
)

config = Kreuzberg::Config::Extraction.new(ocr: ocr_config)
result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts result.content
Rust
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: Some("eng".to_string()),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("scanned.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}
TypeScript
import { extractFileSync } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
        language: 'eng',
    },
};

const result = extractFileSync('scanned.pdf', null, config);
console.log(result.content);
WASM
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';

await initWasm();
await enableOcr();

const fileInput = document.getElementById('file') as HTMLInputElement;
const file = fileInput.files?.[0];

if (file) {
    const result = await extractFromFile(file, file.type, {
        ocr: {
            backend: 'tesseract-wasm',
            language: 'eng',
        },
    });
    console.log(result.content);
}

Multiple Languages

Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    lang := "eng+deu+fra"
    result, err := kreuzberg.ExtractFileSync("multilingual.pdf", &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend:  "tesseract",
            Language: &lang,
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println(result.Content)
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .language("eng+deu+fra")
        .build())
    .build();

ExtractionResult result = Kreuzberg.extractFile("multilingual.pdf", config);
System.out.println(result.getContent());
Python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng+deu+fra")
)

result = extract_file_sync("multilingual.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    backend: 'tesseract',
    language: 'eng+deu+fra'
  )
)

result = Kreuzberg.extract_file_sync('multilingual.pdf', config: config)
puts result.content
Rust
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: Some("eng+deu+fra".to_string()),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("multilingual.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}
TypeScript
import { extractFileSync } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
        language: 'eng+deu+fra',
    },
};

const result = extractFileSync('multilingual.pdf', null, config);
console.log(result.content);
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';

await initWasm();
await enableOcr();

const fileInput = document.getElementById('file') as HTMLInputElement;
const file = fileInput.files?.[0];

if (file) {
  const result = await extractFromFile(file, file.type, {
    ocr: {
      backend: 'tesseract-wasm',
      language: 'eng+deu', // Multiple languages
    },
  });
  console.log(result.content);
}

Force OCR on All Pages

Process PDFs with OCR even when they have a text layer:

Go
package main

import (
    "fmt"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    force := true
    result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend: "tesseract",
        },
        ForceOCR: &force,
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    fmt.Println(result.Content)
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .build())
    .forceOcr(true)
    .build();

ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
System.out.println(result.getContent());
Python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract"),
    force_ocr=True,
)

result = extract_file_sync("document.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(backend: 'tesseract'),
  force_ocr: true
)

result = Kreuzberg.extract_file_sync('document.pdf', config: config)
puts result.content
Rust
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            ..Default::default()
        }),
        force_ocr: true,
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}
TypeScript
import { extractFileSync } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
    },
    forceOcr: true,
};

const result = extractFileSync('document.pdf', null, config);
console.log(result.content);
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';

await initWasm();
await enableOcr();

const fileInput = document.getElementById('file') as HTMLInputElement;
const file = fileInput.files?.[0];

if (file) {
  const result = await extractFromFile(file, file.type, {
    force_ocr: true,
    ocr: {
      backend: 'tesseract-wasm',
      language: 'eng',
    },
  });
  console.log(result.content);
}

Using EasyOCR (Python Only)

EasyOCR is only available in Python.

EasyOCR is only available in Python.

Python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="easyocr", language="en", use_gpu=True)
)

result = extract_file_sync("scanned.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    backend: 'easyocr',
    language: 'eng'
  )
)

result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts result.content[0..100]
puts "Total length: #{result.content.length}"
Rust
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "easyocr".to_string(),
            language: Some("en".to_string()),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file("document.pdf", None, &config).await?;
    println!("Extracted text: {}", result.content);
    Ok(())
}

EasyOCR is only available in Python.

GPU Acceleration

EasyOCR and PaddleOCR support GPU acceleration via PyTorch/PaddlePaddle. Set use_gpu=True to enable.

Using PaddleOCR (Python Only)

PaddleOCR is only available in Python.

PaddleOCR is only available in Python.

Python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="paddleocr", language="en", use_gpu=True)
)

result = extract_file_sync("scanned.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    backend: 'paddleocr',
    language: 'eng'
  )
)

result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts result.content[0..100]
puts "Total length: #{result.content.length}"
Rust
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "paddleocr".to_string(),
            language: Some("en".to_string()),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file("document.pdf", None, &config).await?;
    println!("Extracted text: {}", result.content);
    Ok(())
}

PaddleOCR is only available in Python.

Advanced OCR Options

DPI Configuration

Control image resolution for OCR processing:

Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    targetDPI := 300
    result, err := kreuzberg.ExtractFileSync("scanned.pdf", &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend: "tesseract",
            Tesseract: &kreuzberg.TesseractConfig{
                Preprocessing: &kreuzberg.ImagePreprocessingConfig{
                    TargetDPI: &targetDPI,
                },
            },
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.ImagePreprocessingConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .build())
    .imagePreprocessing(ImagePreprocessingConfig.builder()
        .targetDpi(300)
        .build())
    .build();

ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
Python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig, PdfConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract"),
    pdf_options=PdfConfig(dpi=300),
)

result = extract_file_sync("scanned.pdf", config=config)

content_length: int = len(result.content)
table_count: int = len(result.tables)

print(f"Content length: {content_length} characters")
print(f"Tables detected: {table_count}")
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(backend: 'tesseract'),
  pdf: Kreuzberg::Config::PDF.new(dpi: 300)
)

result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
Rust
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, PdfConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            ..Default::default()
        }),
        pdf_options: Some(PdfConfig {
            dpi: Some(300),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("scanned.pdf", None, &config)?;
    Ok(())
}
TypeScript
import { extractFileSync } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
    },
    pdfOptions: {
        extractImages: true,
    },
};

const result = extractFileSync('scanned.pdf', null, config);
console.log(result.content);

DPI Recommendations

  • 150 DPI: Fast processing, lower accuracy
  • 300 DPI (default): Balanced speed and accuracy
  • 600 DPI: High accuracy, slower processing

Image Preprocessing

Kreuzberg automatically preprocesses images for better OCR results:

  • Grayscale conversion - Reduces noise
  • Contrast enhancement - Improves text visibility
  • Noise reduction - Removes artifacts
  • Deskewing - Corrects rotation

These are applied automatically and require no configuration.

Troubleshooting

Tesseract not found

Error: MissingDependencyError: tesseract

Solution: Install Tesseract OCR:

Terminal
# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# Verify installation
tesseract --version
Language not found

Error: Failed to initialize tesseract with language 'deu'

Solution: Install the language data:

Terminal
# macOS
brew install tesseract-lang

# Ubuntu/Debian
sudo apt-get install tesseract-ocr-deu

# List installed languages
tesseract --list-langs
Poor OCR accuracy

Problem: Extracted text has many errors

Solutions:

  1. Increase DPI: Try 600 DPI for better quality

    ocr_high_quality.py
    config = ExtractionConfig(
        ocr=OcrConfig(backend="tesseract"),
        pdf=PdfConfig(dpi=600)
    )
    

  2. Try different backend: EasyOCR often has better accuracy

    ocr_easyocr.py
    config = ExtractionConfig(
        ocr=OcrConfig(backend="easyocr", language="en")
    )
    

  3. Specify correct language: Use the document's language

    ocr_german.py
    config = ExtractionConfig(
        ocr=OcrConfig(backend="tesseract", language="deu")
    )
    

OCR is very slow

Problem: Processing takes too long

Solutions:

  1. Reduce DPI: Use 150 DPI for faster processing

    ocr_fast.py
    config = ExtractionConfig(
        ocr=OcrConfig(backend="tesseract"),
        pdf=PdfConfig(dpi=150)
    )
    

  2. Use GPU acceleration (EasyOCR/PaddleOCR):

    ocr_gpu.py
    config = ExtractionConfig(
        ocr=OcrConfig(backend="paddleocr", use_gpu=True)
    )
    

  3. Use batch processing: Process multiple files concurrently

    batch_ocr.py
    results = batch_extract_files_sync(files, config=config)
    

Out of memory with large PDFs

Problem: Memory errors when processing large scanned PDFs

Solutions:

  1. Reduce DPI: Lower resolution uses less memory

    ocr_low_memory.py
    config = ExtractionConfig(
        ocr=OcrConfig(backend="tesseract"),
        pdf=PdfConfig(dpi=150)
    )
    

  2. Process pages separately: Extract specific page ranges

  3. Increase system memory: OCR is memory-intensive

EasyOCR/PaddleOCR not working on Python 3.14

Error: Installation fails on Python 3.14

Solution: Use Python 3.10-3.13 or switch to Tesseract:

Terminal
# Use Tesseract (works on all Python versions)
pip install kreuzberg
brew install tesseract  # or apt-get install tesseract-ocr

CLI Usage

Extract with OCR using the command-line interface:

Terminal
# Basic OCR extraction (uses config file for language/settings)
kreuzberg extract scanned.pdf --ocr true

# Force OCR on all pages (even if text layer exists)
kreuzberg extract document.pdf --force-ocr true

# Use config file to specify language and other OCR settings
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true

Example config file (kreuzberg.toml) for OCR settings:

[ocr]
backend = "tesseract"
language = "eng"           # Single language
# language = "eng+deu"     # Multiple languages

[ocr.tesseract_config]
psm = 3                    # Page segmentation mode

Next Steps