Skip to content

Python API Reference

Complete reference for the Kreuzberg Python API.

Installation

Terminal
pip install kreuzberg

With EasyOCR:

Terminal
pip install "kreuzberg[easyocr]"

With PaddleOCR:

Terminal
pip install "kreuzberg[paddleocr]"

With API server:

Terminal
pip install "kreuzberg[api]"

With all features:

Terminal
pip install "kreuzberg[all]"

Core Functions

extract_file_sync()

Extract content from a file (synchronous).

Signature:

Python
def extract_file_sync(
    file_path: str | Path,
    mime_type: str | None = None,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult

Parameters:

  • file_path (str | Path): Path to the file to extract
  • mime_type (str | None): Optional MIME type hint. If None, MIME type is auto-detected from file extension and content
  • config (ExtractionConfig | None): Extraction configuration. Uses defaults if None
  • easyocr_kwargs (dict | None): EasyOCR initialization options (languages, use_gpu, beam_width, etc.)
  • paddleocr_kwargs (dict | None): PaddleOCR initialization options (lang, use_angle_cls, show_log, etc.)

Returns:

  • ExtractionResult: Extraction result containing content, metadata, and tables

Raises:

  • KreuzbergError: Base exception for all extraction errors
  • ValidationError: Invalid configuration or file path
  • ParsingError: Document parsing failure
  • OCRError: OCR processing failure
  • MissingDependencyError: Required system dependency not found

Example - Basic usage:

basic_extraction.py
from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")
print(result.content)
print(f"Pages: {result.metadata['page_count']}")

Example - With OCR:

with_ocr.py
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng")
)
result = extract_file_sync("scanned.pdf", config=config)

Example - With EasyOCR custom options:

easyocr_custom.py
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config = ExtractionConfig(
    ocr=OcrConfig(backend="easyocr", language="eng")
)
result = extract_file_sync(
    "scanned.pdf",
    config=config,
    easyocr_kwargs={"use_gpu": True, "beam_width": 10}
)

extract_file()

Extract content from a file (asynchronous).

Signature:

Python
async def extract_file(
    file_path: str | Path,
    mime_type: str | None = None,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult

Parameters:

Same as extract_file_sync().

Returns:

  • ExtractionResult: Extraction result containing content, metadata, and tables

Examples:

basic_extraction.py
import asyncio
from kreuzberg import extract_file

async def main():
    result = await extract_file("document.pdf")
    print(result.content)

asyncio.run(main())

extract_bytes_sync()

Extract content from bytes (synchronous).

Signature:

Python
def extract_bytes_sync(
    data: bytes | bytearray,
    mime_type: str,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult

Parameters:

  • data (bytes | bytearray): File content as bytes or bytearray
  • mime_type (str): MIME type of the data (required for format detection)
  • config (ExtractionConfig | None): Extraction configuration. Uses defaults if None
  • easyocr_kwargs (dict | None): EasyOCR initialization options
  • paddleocr_kwargs (dict | None): PaddleOCR initialization options

Returns:

  • ExtractionResult: Extraction result containing content, metadata, and tables

Examples:

basic_extraction.py
from kreuzberg import extract_bytes_sync

with open("document.pdf", "rb") as f:
    data = f.read()

result = extract_bytes_sync(data, "application/pdf")
print(result.content)

extract_bytes()

Extract content from bytes (asynchronous).

Signature:

Python
async def extract_bytes(
    data: bytes | bytearray,
    mime_type: str,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult

Parameters:

Same as extract_bytes_sync().

Returns:

  • ExtractionResult: Extraction result containing content, metadata, and tables

batch_extract_files_sync()

Extract content from multiple files in parallel (synchronous).

Signature:

Python
def batch_extract_files_sync(
    paths: list[str | Path],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]

Parameters:

  • paths (list[str | Path]): List of file paths to extract
  • config (ExtractionConfig | None): Extraction configuration applied to all files
  • easyocr_kwargs (dict | None): EasyOCR initialization options
  • paddleocr_kwargs (dict | None): PaddleOCR initialization options

Returns:

  • list[ExtractionResult]: List of extraction results (one per file)

Examples:

basic_extraction.py
from kreuzberg import batch_extract_files_sync

paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
results = batch_extract_files_sync(paths)

for path, result in zip(paths, results):
    print(f"{path}: {len(result.content)} characters")

batch_extract_files()

Extract content from multiple files in parallel (asynchronous).

Signature:

Python
async def batch_extract_files(
    paths: list[str | Path],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]

Parameters:

Same as batch_extract_files_sync().

Returns:

  • list[ExtractionResult]: List of extraction results (one per file)

batch_extract_bytes_sync()

Extract content from multiple byte arrays in parallel (synchronous).

Signature:

Python
def batch_extract_bytes_sync(
    data_list: list[bytes | bytearray],
    mime_types: list[str],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]

Parameters:

  • data_list (list[bytes | bytearray]): List of file contents as bytes/bytearray
  • mime_types (list[str]): List of MIME types (one per data item, same length as data_list)
  • config (ExtractionConfig | None): Extraction configuration applied to all items
  • easyocr_kwargs (dict | None): EasyOCR initialization options
  • paddleocr_kwargs (dict | None): PaddleOCR initialization options

Returns:

  • list[ExtractionResult]: List of extraction results (one per data item)

batch_extract_bytes()

Extract content from multiple byte arrays in parallel (asynchronous).

Signature:

Python
async def batch_extract_bytes(
    data_list: list[bytes | bytearray],
    mime_types: list[str],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]

Parameters:

Same as batch_extract_bytes_sync().

Returns:

  • list[ExtractionResult]: List of extraction results (one per data item)

Configuration

ExtractionConfig

Main configuration class for extraction operations.

Fields:

  • ocr (OcrConfig | None): OCR configuration. Default: None (no OCR)
  • force_ocr (bool): Force OCR even for text-based PDFs. Default: False
  • pdf_options (PdfConfig | None): PDF-specific configuration. Default: None
  • chunking (ChunkingConfig | None): Text chunking configuration. Default: None
  • language_detection (LanguageDetectionConfig | None): Language detection configuration. Default: None
  • token_reduction (TokenReductionConfig | None): Token reduction configuration. Default: None
  • image_extraction (ImageExtractionConfig | None): Image extraction from documents. Default: None
  • post_processor (PostProcessorConfig | None): Post-processing configuration. Default: None

Example:

config.py
from kreuzberg import ExtractionConfig, OcrConfig, PdfConfig

config = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng"),
    force_ocr=False,
    pdf_options=PdfConfig(
        passwords=["password1", "password2"],
        extract_images=True
    )
)

result = extract_file_sync("document.pdf", config=config)

OcrConfig

OCR processing configuration.

Fields:

  • backend (str): OCR backend to use. Options: "tesseract", "easyocr", "paddleocr". Default: "tesseract"
  • language (str): Language code for OCR (ISO 639-3). Default: "eng"
  • tesseract_config (TesseractConfig | None): Tesseract-specific configuration. Default: None

Example - Basic OCR:

with_ocr.py
from kreuzberg import OcrConfig

ocr_config = OcrConfig(backend="tesseract", language="eng")

Example - With EasyOCR:

with_ocr.py
from kreuzberg import OcrConfig

ocr_config = OcrConfig(backend="easyocr", language="en")

TesseractConfig

Tesseract OCR backend configuration.

Fields:

  • psm (int): Page segmentation mode (0-13). Default: 3 (auto)
  • oem (int): OCR engine mode (0-3). Default: 3 (LSTM only)
  • enable_table_detection (bool): Enable table detection and extraction. Default: False
  • tessedit_char_whitelist (str | None): Character whitelist (e.g., "0123456789" for digits only). Default: None
  • tessedit_char_blacklist (str | None): Character blacklist. Default: None

Example:

basic_extraction.py
from kreuzberg import OcrConfig, TesseractConfig

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(
            psm=6,
            enable_table_detection=True,
            tessedit_char_whitelist="0123456789"
        )
    )
)

PdfConfig

PDF-specific configuration.

Fields:

  • passwords (list[str] | None): List of passwords to try for encrypted PDFs. Default: None
  • extract_images (bool): Extract images from PDF. Default: False
  • image_dpi (int): DPI for image extraction. Default: 300

Example:

basic_extraction.py
from kreuzberg import PdfConfig

pdf_config = PdfConfig(
    passwords=["password1", "password2"],
    extract_images=True,
    image_dpi=300
)

ChunkingConfig

Text chunking configuration for splitting long documents.

Fields:

  • chunk_size (int): Maximum chunk size in tokens. Default: 512
  • chunk_overlap (int): Overlap between chunks in tokens. Default: 50
  • chunking_strategy (str): Chunking strategy. Options: "fixed", "semantic". Default: "fixed"

Example:

basic_extraction.py
from kreuzberg import ChunkingConfig

chunking_config = ChunkingConfig(
    chunk_size=1024,
    chunk_overlap=100,
    chunking_strategy="semantic"
)

LanguageDetectionConfig

Language detection configuration.

Fields:

  • enabled (bool): Enable language detection. Default: True
  • confidence_threshold (float): Minimum confidence threshold (0.0-1.0). Default: 0.5

Example:

basic_extraction.py
from kreuzberg import LanguageDetectionConfig

lang_config = LanguageDetectionConfig(
    enabled=True,
    confidence_threshold=0.7
)

ImageExtractionConfig

Image extraction configuration.

Fields:

  • enabled (bool): Enable image extraction from documents. Default: False
  • min_width (int): Minimum image width in pixels. Default: 100
  • min_height (int): Minimum image height in pixels. Default: 100

TokenReductionConfig

Token reduction configuration for compressing extracted text.

Fields:

  • enabled (bool): Enable token reduction. Default: False
  • strategy (str): Reduction strategy. Options: "whitespace", "stemming". Default: "whitespace"

PostProcessorConfig

Post-processing configuration.

Fields:

  • enabled (bool): Enable post-processing. Default: True
  • processors (list[str]): List of processor names to enable. Default: all registered processors

ImagePreprocessingConfig

Image preprocessing configuration for OCR.

Fields:

  • target_dpi (int): Target DPI for image preprocessing. Default: 300
  • auto_rotate (bool): Auto-rotate images based on orientation. Default: True
  • denoise (bool): Apply denoising filter. Default: False

Results & Types

ExtractionResult

Result object returned by all extraction functions.

Type Definition:

Python
class ExtractionResult(TypedDict):
    content: str
    mime_type: str
    metadata: Metadata
    tables: list[Table]
    detected_languages: list[str] | None

Fields:

  • content (str): Extracted text content
  • mime_type (str): MIME type of the processed document
  • metadata (Metadata): Document metadata (format-specific fields)
  • tables (list[Table]): List of extracted tables
  • detected_languages (list[str] | None): List of detected language codes (ISO 639-1) if language detection is enabled
  • pages (list[PageContent] | None): Per-page extracted content when page extraction is enabled via PageConfig.extract_pages = true

Example:

basic_extraction.py
result = extract_file_sync("document.pdf")

print(f"Content: {result.content}")
print(f"MIME type: {result.mime_type}")
print(f"Page count: {result.metadata.get('page_count')}")
print(f"Tables: {len(result.tables)}")

if result.detected_languages:
    print(f"Languages: {', '.join(result.detected_languages)}")

pages

Type: list[PageContent] | None

Per-page extracted content when page extraction is enabled via PageConfig.extract_pages = true.

Each page contains: - Page number (1-indexed) - Text content for that page - Tables on that page - Images on that page

Example:

page_extraction.py
from kreuzberg import extract_file_sync, ExtractionConfig, PageConfig

config = ExtractionConfig(
    pages=PageConfig(extract_pages=True)
)

result = extract_file_sync("document.pdf", config=config)

if result.pages:
    for page in result.pages:
        print(f"Page {page.page_number}:")
        print(f"  Content: {len(page.content)} chars")
        print(f"  Tables: {len(page.tables)}")
        print(f"  Images: {len(page.images)}")

Accessing Per-Page Content

When page extraction is enabled, access individual pages and iterate over them:

iterate_pages.py
from kreuzberg import extract_file_sync, ExtractionConfig, PageConfig

config = ExtractionConfig(
    pages=PageConfig(
        extract_pages=True,
        insert_page_markers=True,
        marker_format="\n\n--- Page {page_num} ---\n\n"
    )
)

result = extract_file_sync("document.pdf", config=config)

# Access combined content with page markers
print("Combined content with markers:")
print(result.content[:500])
print()

# Access per-page content
if result.pages:
    for page in result.pages:
        print(f"Page {page.page_number}:")
        print(f"  {page.content[:100]}...")
        if page.tables:
            print(f"  Found {len(page.tables)} table(s)")
        if page.images:
            print(f"  Found {len(page.images)} image(s)")

Metadata

Strongly-typed metadata dictionary. Fields vary by document format.

Common Fields:

  • language (str): Document language (ISO 639-1 code)
  • date (str): Document date (ISO 8601 format)
  • subject (str): Document subject
  • format_type (str): Format discriminator ("pdf", "excel", "email", etc.)

PDF-Specific Fields (when format_type == "pdf"):

  • title (str): PDF title
  • author (str): PDF author
  • page_count (int): Number of pages
  • creation_date (str): Creation date (ISO 8601)
  • modification_date (str): Modification date (ISO 8601)
  • creator (str): Creator application
  • producer (str): Producer application
  • keywords (str): PDF keywords
  • subject (str): PDF subject

Excel-Specific Fields (when format_type == "excel"):

  • sheet_count (int): Number of sheets
  • sheet_names (list[str]): List of sheet names

Email-Specific Fields (when format_type == "email"):

  • from_email (str): Sender email address
  • from_name (str): Sender name
  • to_emails (list[str]): Recipient email addresses
  • cc_emails (list[str]): CC email addresses
  • bcc_emails (list[str]): BCC email addresses
  • message_id (str): Email message ID
  • attachments (list[str]): List of attachment filenames

Example:

basic_extraction.py
result = extract_file_sync("document.pdf")
metadata = result.metadata

if metadata.get("format_type") == "pdf":
    print(f"Title: {metadata.get('title')}")
    print(f"Author: {metadata.get('author')}")
    print(f"Pages: {metadata.get('page_count')}")

See the Types Reference for complete metadata field documentation.


Table

Extracted table structure.

Type Definition:

Python
class Table(TypedDict):
    cells: list[list[str]]
    markdown: str
    page_number: int

Fields:

  • cells (list[list[str]]): 2D array of table cells (rows x columns)
  • markdown (str): Table rendered as markdown
  • page_number (int): Page number where table was found

Example:

basic_extraction.py
result = extract_file_sync("invoice.pdf")

for table in result.tables:
    print(f"Table on page {table.page_number}:")
    print(table.markdown)
    print()

ChunkMetadata

Metadata for a single text chunk.

Type Definition:

Python
class ChunkMetadata(TypedDict, total=False):
    byte_start: int
    byte_end: int
    char_count: int
    token_count: int | None
    first_page: int | None
    last_page: int | None

Fields:

  • byte_start (int): UTF-8 byte offset in content (inclusive)
  • byte_end (int): UTF-8 byte offset in content (exclusive)
  • char_count (int): Number of characters in chunk
  • token_count (int | None): Estimated token count (if configured)
  • first_page (int | None): First page this chunk appears on (1-indexed, only when page boundaries available)
  • last_page (int | None): Last page this chunk appears on (1-indexed, only when page boundaries available)

Page tracking: When PageStructure.boundaries is available and chunking is enabled, first_page and last_page are automatically calculated based on byte offsets.

Example:

chunk_metadata.py
from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig, PageConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(chunk_size=500, overlap=50),
    pages=PageConfig(extract_pages=True)
)

result = extract_file_sync("document.pdf", config=config)

if result.chunks:
    for chunk in result.chunks:
        meta = chunk.metadata
        page_info = ""
        if meta.get('first_page'):
            if meta['first_page'] == meta.get('last_page'):
                page_info = f" (page {meta['first_page']})"
            else:
                page_info = f" (pages {meta['first_page']}-{meta.get('last_page')})"

        print(f"Chunk [{meta['byte_start']}:{meta['byte_end']}]: {len(chunk.text)} chars{page_info}")

ExtractedTable

Deprecated alias for Table. Use Table instead.


Extensibility

Custom Post-Processors

Create custom post-processors to add processing logic to the extraction pipeline.

Protocol:

Python
from kreuzberg import PostProcessorProtocol, ExtractionResult

class PostProcessorProtocol:
    def name(self) -> str:
        """Return unique processor name"""
        ...

    def process(self, result: ExtractionResult) -> ExtractionResult:
        """Process extraction result and return modified result"""
        ...

    def processing_stage(self) -> str:
        """Return processing stage: 'early', 'middle', or 'late'"""
        ...

Example:

basic_extraction.py
from kreuzberg import (
    PostProcessorProtocol,
    ExtractionResult,
    register_post_processor
)

class CustomProcessor:
    def name(self) -> str:
        return "custom_processor"

    def process(self, result: ExtractionResult) -> ExtractionResult:
        # Add custom field to metadata
        result["metadata"]["custom_field"] = "custom_value"
        return result

    def processing_stage(self) -> str:
        return "middle"

# Register the processor
register_post_processor(CustomProcessor())

# Now all extractions will use this processor
result = extract_file_sync("document.pdf")
print(result.metadata["custom_field"])  # "custom_value"

Managing Processors:

basic_extraction.py
from kreuzberg import (
    register_post_processor,
    unregister_post_processor,
    clear_post_processors
)

# Register
register_post_processor(CustomProcessor())

# Unregister by name
unregister_post_processor("custom_processor")

# Clear all processors
clear_post_processors()

Custom Validators

Create custom validators to validate extraction results.

Functions:

custom_validator.py
from kreuzberg import register_validator, unregister_validator, clear_validators

# Register a validator
register_validator(validator)

# Unregister by name
unregister_validator("validator_name")

# Clear all validators
clear_validators()

Error Handling

All errors inherit from KreuzbergError. See Error Handling Reference for complete documentation.

Exception Hierarchy:

  • KreuzbergError — Base exception for all extraction errors
  • ValidationError — Invalid configuration or input
  • ParsingError — Document parsing failure
  • OCRError — OCR processing failure
  • MissingDependencyError — Missing optional dependency

Example:

error_handling.py
from kreuzberg import (
    extract_file_sync,
    KreuzbergError,
    ValidationError,
    ParsingError,
    MissingDependencyError
)

try:
    result = extract_file_sync("document.pdf")
except ValidationError as e:
    print(f"Invalid input: {e}")
except ParsingError as e:
    print(f"Failed to parse document: {e}")
except MissingDependencyError as e:
    print(f"Missing dependency: {e}")
    print(f"Install with: {e.install_command}")
except KreuzbergError as e:
    print(f"Extraction failed: {e}")

See Error Handling Reference for detailed error documentation and best practices.


Version Information

basic_extraction.py
import kreuzberg

print(kreuzberg.__version__)