Python API Reference¶
Complete reference for the Kreuzberg Python API.
Installation¶
With EasyOCR:
With API server:
With all features:
Core Functions¶
extract_file_sync()¶
Extract content from a file (synchronous).
Signature:
def extract_file_sync(
file_path: str | Path,
mime_type: str | None = None,
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
Parameters:
file_path(str | Path): Path to the file to extractmime_type(str | None): Optional MIME type hint. If None, MIME type is auto-detected from file extension and contentconfig(ExtractionConfig | None): Extraction configuration. Uses defaults if Noneeasyocr_kwargs(dict | None): EasyOCR initialization options (languages, use_gpu, beam_width, etc.)
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Raises:
KreuzbergError: Base exception for all extraction errorsValidationError: Invalid configuration or file pathParsingError: Document parsing failureOCRError: OCR processing failureMissingDependencyError: Required system dependency not found
Example - Basic usage:
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content)
print(f"Pages: {result.metadata['page_count']}")
Example - With OCR:
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng")
)
result = extract_file_sync("scanned.pdf", config=config)
Example - With EasyOCR custom options:
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="easyocr", language="eng")
)
result = extract_file_sync(
"scanned.pdf",
config=config,
easyocr_kwargs={"use_gpu": True, "beam_width": 10}
)
extract_file()¶
Extract content from a file (asynchronous).
Signature:
async def extract_file(
file_path: str | Path,
mime_type: str | None = None,
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
Parameters:
Same as extract_file_sync().
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Examples:
import asyncio
from kreuzberg import extract_file
async def main():
result = await extract_file("document.pdf")
print(result.content)
asyncio.run(main())
extract_bytes_sync()¶
Extract content from bytes (synchronous).
Signature:
def extract_bytes_sync(
data: bytes | bytearray,
mime_type: str,
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
Parameters:
data(bytes | bytearray): File content as bytes or bytearraymime_type(str): MIME type of the data (required for format detection)config(ExtractionConfig | None): Extraction configuration. Uses defaults if Noneeasyocr_kwargs(dict | None): EasyOCR initialization options
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Examples:
from kreuzberg import extract_bytes_sync
with open("document.pdf", "rb") as f:
data = f.read()
result = extract_bytes_sync(data, "application/pdf")
print(result.content)
extract_bytes()¶
Extract content from bytes (asynchronous).
Signature:
async def extract_bytes(
data: bytes | bytearray,
mime_type: str,
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
Parameters:
Same as extract_bytes_sync().
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
batch_extract_files_sync()¶
Extract content from multiple files in parallel (synchronous).
Signature:
def batch_extract_files_sync(
paths: list[str | Path],
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
Parameters:
paths(list[str | Path]): List of file paths to extractconfig(ExtractionConfig | None): Extraction configuration applied to all fileseasyocr_kwargs(dict | None): EasyOCR initialization options
Returns:
list[ExtractionResult]: List of extraction results (one per file)
Examples:
from kreuzberg import batch_extract_files_sync
paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
results = batch_extract_files_sync(paths)
for path, result in zip(paths, results):
print(f"{path}: {len(result.content)} characters")
batch_extract_files()¶
Extract content from multiple files in parallel (asynchronous).
Signature:
async def batch_extract_files(
paths: list[str | Path],
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
Parameters:
Same as batch_extract_files_sync().
Returns:
list[ExtractionResult]: List of extraction results (one per file)
batch_extract_bytes_sync()¶
Extract content from multiple byte arrays in parallel (synchronous).
Signature:
def batch_extract_bytes_sync(
data_list: list[bytes | bytearray],
mime_types: list[str],
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
Parameters:
data_list(list[bytes | bytearray]): List of file contents as bytes/bytearraymime_types(list[str]): List of MIME types (one per data item, same length as data_list)config(ExtractionConfig | None): Extraction configuration applied to all itemseasyocr_kwargs(dict | None): EasyOCR initialization options
Returns:
list[ExtractionResult]: List of extraction results (one per data item)
batch_extract_bytes()¶
Extract content from multiple byte arrays in parallel (asynchronous).
Signature:
async def batch_extract_bytes(
data_list: list[bytes | bytearray],
mime_types: list[str],
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
Parameters:
Same as batch_extract_bytes_sync().
Returns:
list[ExtractionResult]: List of extraction results (one per data item)
Configuration¶
ExtractionConfig¶
Deprecated API
The force_ocr parameter has been deprecated in favor of the new ocr configuration object.
**Old pattern (no longer supported):**
```python
config = ExtractionConfig(force_ocr=True)
```
**New pattern:**
```python
config = ExtractionConfig(
ocr=OcrConfig(backend="tesseract")
)
```
The new approach provides more granular control over OCR behavior through the `OcrConfig` object.
Main configuration class for extraction operations.
Fields:
use_cache(bool): Enable caching of extraction results to improve performance on repeated extractions. Default:Trueenable_quality_processing(bool): Enable quality post-processing to clean and normalize extracted text. Default:Trueocr(OcrConfig | None): OCR configuration for extracting text from images and scanned documents.None= OCR disabled. Default:Noneforce_ocr(bool): Force OCR processing even for searchable PDFs that contain extractable text. Useful for ensuring consistent formatting. Default:Falsechunking(ChunkingConfig | None): Text chunking configuration for dividing content into manageable chunks.None= chunking disabled. Default:Noneimages(ImageExtractionConfig | None): Image extraction configuration for extracting images FROM documents (not for OCR preprocessing).None= no image extraction. Default:Nonepdf_options(PdfConfig | None): PDF-specific options like password handling and metadata extraction.None= use defaults. Default:Nonetoken_reduction(TokenReductionConfig | None): Token reduction configuration for reducing token count in extracted content (useful for LLM APIs).None= no token reduction. Default:Nonelanguage_detection(LanguageDetectionConfig | None): Language detection configuration for identifying the language(s) in documents.None= no language detection. Default:Nonepages(PageConfig | None): Page extraction configuration for tracking and extracting page boundaries.None= no page tracking. Default:Nonekeywords(KeywordConfig | None): Keyword extraction configuration for identifying important terms and phrases in content.None= no keyword extraction. Default:Nonepostprocessor(PostProcessorConfig | None): Post-processor configuration for custom text processing.None= use defaults. Default:Nonemax_concurrent_extractions(int | None): Maximum concurrent extractions in batch operations.None=num_cpus * 2. Default:Nonehtml_options(HtmlConversionOptions | None): HTML conversion options for converting documents to markdown. Default:Noneresult_format(str): Result format for extraction output. Specifies whether results use unified format (all content incontentfield) or element-based format (with semantic elements for Unstructured-compatible output). Values:"unified"(default),"element_based". Default:"unified"output_format(str): Output content format. Controls the format of the extracted content. Values:"plain"(default),"markdown","djot","html". Default:"plain"
Example:
from kreuzberg import ExtractionConfig, OcrConfig, PdfConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng"),
pdf_options=PdfConfig(
passwords=["password1", "password2"],
extract_images=True
)
)
result = extract_file_sync("document.pdf", config=config)
Configuration loading:
ExtractionConfig.from_file(path: str | Path)→ExtractionConfig: Load configuration from a file (.toml,.yaml, or.jsonby extension).ExtractionConfig.discover()→ExtractionConfig: Discover config fromKREUZBERG_CONFIG_PATHor search forkreuzberg.toml/kreuzberg.yaml/kreuzberg.jsonin current and parent directories (raises if not found).
Module-level:
load_extraction_config_from_file(path)→ExtractionConfigdiscover_extraction_config()→ExtractionConfig | None(returns None if no config file found)
OcrConfig¶
OCR processing configuration.
Fields:
backend(str): OCR backend to use. Options: "tesseract", "easyocr", "paddleocr". Default: "tesseract"language(str): Language code for OCR (ISO 639-3). Default: "eng"tesseract_config(TesseractConfig | None): Tesseract-specific configuration. Default: None
Example - Basic OCR:
from kreuzberg import OcrConfig
ocr_config = OcrConfig(backend="tesseract", language="eng")
Example - With EasyOCR:
TesseractConfig¶
Tesseract OCR backend configuration.
Fields (common):
psm(int): Page segmentation mode (0-13). Default: 3 (auto)oem(int): OCR engine mode (0-3). Default: 3 (Auto - Tesseract chooses based on build)enable_table_detection(bool): Enable table detection and extraction. Default: Truetessedit_char_whitelist(str): Character whitelist (e.g., "0123456789" for digits only). Empty string = all characters. Default: ""tessedit_char_blacklist(str): Character blacklist. Empty string = none. Default: ""language(str): OCR language (ISO 639-3). Default: "eng"min_confidence(float): Minimum confidence (0.0-1.0) for accepting OCR results. Default: 0.0preprocessing(ImagePreprocessingConfig | None): Image preprocessing before OCR. Default: Noneoutput_format(str): OCR output format. Default: "markdown"
Additional fields (table thresholds, cache, tessedit options, etc.) are available; see the type stub for the full list.
Example:
from kreuzberg import OcrConfig, TesseractConfig
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(
psm=6,
enable_table_detection=True,
tessedit_char_whitelist="0123456789"
)
)
)
PdfConfig¶
PDF-specific configuration.
Fields:
extract_images(bool): Extract images from PDF documents. Default:Falsepasswords(list[str] | None): List of passwords to try when opening encrypted PDFs. Try each password in order until one succeeds. Default: Noneextract_metadata(bool): Extract PDF metadata (title, author, creation date, etc.). Default:Truehierarchy(HierarchyConfig | None): Document hierarchy detection configuration for detecting document structure and organization.None= no hierarchy detection. Default:None
Example:
from kreuzberg import PdfConfig
pdf_config = PdfConfig(
passwords=["password1", "password2"],
extract_images=True,
extract_metadata=True
)
HierarchyConfig¶
Document hierarchy detection configuration (used with PdfConfig.hierarchy).
Fields:
enabled(bool): Enable hierarchy detection. Default: Truek_clusters(int): Number of clusters for k-means clustering. Default: 6include_bbox(bool): Include bounding box information in hierarchy output. Default: Trueocr_coverage_threshold(float | None): Optional threshold for OCR coverage before enabling hierarchy detection. Default: None
PageConfig¶
Page extraction and tracking configuration.
Fields:
extract_pages(bool): Enable page tracking and per-page extraction. Default: Falseinsert_page_markers(bool): Insert page markers intocontent. Default: Falsemarker_format(str): Marker template containing{page_num}. Default:"\n\n<!-- PAGE {page_num} -->\n\n"
ChunkingConfig¶
Text chunking configuration for splitting long documents.
Fields:
max_chars(int): Maximum characters per chunk. Default: 1000max_overlap(int): Overlap between chunks in characters. Default: 200embedding(EmbeddingConfig | None): Embedding configuration for generating embeddings. Default: Nonepreset(str | None): Chunking preset to use (e.g. fromlist_embedding_presets()). Default: None
Example:
from kreuzberg import ChunkingConfig
chunking_config = ChunkingConfig(
max_chars=1000,
max_overlap=200
)
LanguageDetectionConfig¶
Language detection configuration.
Fields:
enabled(bool): Enable language detection. Default: Truemin_confidence(float): Minimum confidence threshold (0.0-1.0). Default: 0.8detect_multiple(bool): Detect multiple languages in the document. When False, only the most confident language is returned. Default: False
Example:
from kreuzberg import LanguageDetectionConfig
lang_config = LanguageDetectionConfig(
enabled=True,
min_confidence=0.7
)
KeywordConfig¶
Keyword extraction configuration (used with ExtractionConfig.keywords).
Fields:
algorithm(KeywordAlgorithm): Algorithm to use. Values:KeywordAlgorithm.Yake,KeywordAlgorithm.Rake. Default: Yakemax_keywords(int): Maximum number of keywords to extract. Default: 10min_score(float): Minimum score threshold. Default: 0.0ngram_range(tuple[int, int]): N-gram range (min, max). Default: (1, 3)language(str | None): Optional language hint. Default: "en"yake_params(YakeParams | None): YAKE-specific tuning (e.g.window_size). Default: Nonerake_params(RakeParams | None): RAKE-specific tuning (min_word_length,max_words_per_phrase). Default: None
ImageExtractionConfig¶
Image extraction configuration.
Fields:
extract_images(bool): Enable image extraction from documents. Default: Truetarget_dpi(int): Target DPI for image normalization. Default: 300max_image_dimension(int): Maximum width or height for extracted images. Default: 4096auto_adjust_dpi(bool): Automatically adjust DPI based on image content. Default: Truemin_dpi(int): Minimum DPI threshold. Default: 72max_dpi(int): Maximum DPI threshold. Default: 600
TokenReductionConfig¶
Token reduction configuration for compressing extracted text.
Fields:
mode(str): Token reduction mode. Options:"off","light","moderate","aggressive","maximum". Default:"off""off": No token reduction"light": Remove extra whitespace and redundant punctuation"moderate": Also remove common filler words and some formatting"aggressive": Also remove longer stopwords and collapse similar phrases"maximum": Maximum reduction while preserving semantic contentpreserve_important_words(bool): Preserve important words (capitalized, technical terms) even in aggressive reduction modes. Default: True
PostProcessorConfig¶
Post-processing configuration.
Fields:
enabled(bool): Enable post-processors in the extraction pipeline. Default: Trueenabled_processors(list[str] | None): Whitelist of processor names to run. If specified, only these processors are executed. None = run all enabled. Default: Nonedisabled_processors(list[str] | None): Blacklist of processor names to skip. If specified, these processors are not executed. None = none disabled. Default: None
ImagePreprocessingConfig¶
Image preprocessing configuration for OCR (used with TesseractConfig.preprocessing).
Fields:
target_dpi(int): Target DPI for image preprocessing. Default: 300auto_rotate(bool): Auto-rotate images based on orientation. Default: Truedeskew(bool): Correct skewed images. Default: Truedenoise(bool): Apply denoising filter. Default: Falsecontrast_enhance(bool): Enhance contrast. Default: Falsebinarization_method(str): Binarization method (e.g., "otsu"). Default: "otsu"invert_colors(bool): Invert colors (e.g., white text on black). Default: False
Results & Types¶
ExtractionResult¶
Result object returned by all extraction functions.
Type Definition:
class ExtractionResult:
content: str
mime_type: str
metadata: Metadata
tables: list[ExtractedTable]
detected_languages: list[str] | None
chunks: list[Chunk] | None
images: list[ExtractedImage] | None
pages: list[PageContent] | None
elements: list[Element] | None
djot_content: DjotContent | None
output_format: str | None
result_format: str | None
def get_page_count(self) -> int: ...
def get_chunk_count(self) -> int: ...
def get_detected_language(self) -> str | None: ...
def get_metadata_field(self, field_name: str) -> Any | None: ...
Fields:
content(str): Extracted text contentmime_type(str): MIME type of the processed documentmetadata(Metadata): Document metadata (format-specific fields)tables(list[ExtractedTable]): List of extracted tablesdetected_languages(list[str] | None): List of detected language codes (ISO 639-1) if language detection is enabledchunks(list[Chunk] | None): Text chunks when chunking is enabled viaChunkingConfig. Each chunk hascontent(str),metadata(ChunkMetadata), and optionallyembedding(list[float] | None).images(list[ExtractedImage] | None): Extracted images when image extraction is enabledpages(list[PageContent] | None): Per-page extracted content when page extraction is enabled viaPageConfig.extract_pages = trueelements(list[Element] | None): Semantic elements whenresult_format="element_based"djot_content(DjotContent | None): Structured djot content whenoutput_format="djot"output_format(str | None): Requested output format ("plain","markdown","djot","html")result_format(str | None): Result layout ("unified"or"element_based")
Methods:
get_page_count()→ int: Number of pages (from metadata when available)get_chunk_count()→ int: Number of chunks (0 if chunking disabled)get_detected_language()→ str | None: Primary detected language codeget_metadata_field(field_name: str)→ Any | None: Get a metadata field by name
Example:
result = extract_file_sync("document.pdf")
print(f"Content: {result.content}")
print(f"MIME type: {result.mime_type}")
print(f"Page count: {result.metadata.get('page_count')}")
print(f"Tables: {len(result.tables)}")
if result.detected_languages:
print(f"Languages: {', '.join(result.detected_languages)}")
pages¶
Type: list[PageContent] | None
Per-page extracted content when page extraction is enabled via PageConfig.extract_pages = true.
Each page contains:
- Page number (1-indexed)
- Text content for that page
- Tables on that page
- Images on that page
Example:
from kreuzberg import extract_file_sync, ExtractionConfig, PageConfig
config = ExtractionConfig(
pages=PageConfig(extract_pages=True)
)
result = extract_file_sync("document.pdf", config=config)
if result.pages:
for page in result.pages:
print(f"Page {page.page_number}:")
print(f" Content: {len(page.content)} chars")
print(f" Tables: {len(page.tables)}")
print(f" Images: {len(page.images)}")
Accessing Per-Page Content¶
When page extraction is enabled, access individual pages and iterate over them:
from kreuzberg import extract_file_sync, ExtractionConfig, PageConfig
config = ExtractionConfig(
pages=PageConfig(
extract_pages=True,
insert_page_markers=True,
marker_format="\n\n--- Page {page_num} ---\n\n"
)
)
result = extract_file_sync("document.pdf", config=config)
# Access combined content with page markers
print("Combined content with markers:")
print(result.content[:500])
print()
# Access per-page content
if result.pages:
for page in result.pages:
print(f"Page {page.page_number}:")
print(f" {page.content[:100]}...")
if page.tables:
print(f" Found {len(page.tables)} table(s)")
if page.images:
print(f" Found {len(page.images)} image(s)")
Metadata¶
Strongly-typed metadata dictionary. Fields vary by document format.
Common Fields:
language(str): Document language (ISO 639-1 code)created_at(str): Creation date (ISO 8601 format)modified_at(str): Modification date (ISO 8601 format)created_by(str): Creator applicationmodified_by(str): Last modifier applicationsubject(str): Document subjectformat_type(str): Format discriminator ("pdf", "excel", "email", "pptx", "archive", "image", "xml", "text", "html", "ocr")
PDF-Specific Fields (when format_type == "pdf"):
title(str): PDF titleauthors(list[str]): PDF author(s)page_count(int): Number of pagescreated_at(str): Creation date (ISO 8601)modified_at(str): Modification date (ISO 8601)created_by(str): Creator applicationproducer(str): Producer applicationkeywords(str): PDF keywordssubject(str): PDF subject
Excel-Specific Fields (when format_type == "excel"):
sheet_count(int): Number of sheetssheet_names(list[str]): List of sheet names
Email-Specific Fields (when format_type == "email"):
from_email(str): Sender email addressfrom_name(str): Sender nameto_emails(list[str]): Recipient email addressescc_emails(list[str]): CC email addressesbcc_emails(list[str]): BCC email addressesmessage_id(str): Email message IDattachments(list[str]): List of attachment filenames
Example:
result = extract_file_sync("document.pdf")
metadata = result.metadata
if metadata.get("format_type") == "pdf":
print(f"Title: {metadata.get('title')}")
print(f"Authors: {metadata.get('authors')}")
print(f"Pages: {metadata.get('page_count')}")
See the Types Reference for complete metadata field documentation.
ExtractedTable¶
Extracted table structure. The API type is ExtractedTable (same shape as below).
Type Definition:
Fields:
cells(list[list[str]]): 2D array of table cells (rows x columns)markdown(str): Table rendered as markdownpage_number(int): Page number where table was found
Example:
result = extract_file_sync("invoice.pdf")
for table in result.tables:
print(f"Table on page {table.page_number}:")
print(table.markdown)
print()
ChunkMetadata¶
Metadata for a single text chunk.
Type Definition:
class ChunkMetadata(TypedDict, total=False):
byte_start: int
byte_end: int
chunk_index: int
total_chunks: int
token_count: int | None
first_page: int
last_page: int
Fields:
byte_start(int): UTF-8 byte offset in content (inclusive)byte_end(int): UTF-8 byte offset in content (exclusive)chunk_index(int): Zero-based index of this chunk in the documenttotal_chunks(int): Total number of chunks for the documenttoken_count(int | None): Estimated token count (if configured)first_page(int): First page this chunk appears on (1-indexed, only when page boundaries available)last_page(int): Last page this chunk appears on (1-indexed, only when page boundaries available)
Page tracking: When PageStructure.boundaries is available and chunking is enabled, first_page and last_page are automatically calculated based on byte offsets.
Example:
from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig, PageConfig
config = ExtractionConfig(
chunking=ChunkingConfig(max_chars=500, max_overlap=50),
pages=PageConfig(extract_pages=True)
)
result = extract_file_sync("document.pdf", config=config)
if result.chunks:
for chunk in result.chunks:
meta = chunk.metadata
page_info = ""
if meta.get('first_page'):
if meta['first_page'] == meta.get('last_page'):
page_info = f" (page {meta['first_page']})"
else:
page_info = f" (pages {meta['first_page']}-{meta.get('last_page')})"
print(f"Chunk [{meta['byte_start']}:{meta['byte_end']}]: {len(chunk.content)} chars{page_info}")
Extensibility¶
Custom Post-Processors¶
Create custom post-processors to add processing logic to the extraction pipeline.
Protocol:
from kreuzberg import PostProcessorProtocol, ExtractionResult
class PostProcessorProtocol:
def name(self) -> str:
"""Return unique processor name"""
...
def process(self, result: ExtractionResult) -> ExtractionResult:
"""Process extraction result and return modified result"""
...
def processing_stage(self) -> str:
"""Return processing stage: 'early', 'middle', or 'late'"""
...
Optional lifecycle methods: initialize() (called when registered), shutdown() (called when unregistered).
Example:
from kreuzberg import (
PostProcessorProtocol,
ExtractionResult,
register_post_processor
)
class CustomProcessor:
def name(self) -> str:
return "custom_processor"
def process(self, result: ExtractionResult) -> ExtractionResult:
# Add custom field to metadata
result.metadata["custom_field"] = "custom_value"
return result
def processing_stage(self) -> str:
return "middle"
# Register the processor
register_post_processor(CustomProcessor())
# Now all extractions will use this processor
result = extract_file_sync("document.pdf")
print(result.metadata["custom_field"]) # "custom_value"
Managing Processors:
from kreuzberg import (
register_post_processor,
unregister_post_processor,
clear_post_processors
)
# Register
register_post_processor(CustomProcessor())
# Unregister by name
unregister_post_processor("custom_processor")
# Clear all processors
clear_post_processors()
Custom Validators¶
Create custom validators to validate extraction results.
ValidatorProtocol: Implement:
name() -> strvalidate(result: ExtractionResult) -> None(raise to fail)- Optional:
priority() -> int(default 50, higher runs first) - Optional:
should_validate(result: ExtractionResult) -> bool(default True) - Optional lifecycle:
initialize(),shutdown()
Functions:
from kreuzberg import register_validator, unregister_validator, clear_validators
# Register a validator
register_validator(validator)
# Unregister by name
unregister_validator("validator_name")
# Clear all validators
clear_validators()
Error Handling¶
All errors inherit from KreuzbergError. See Error Handling Reference for complete documentation.
Exception Hierarchy:
KreuzbergError— Base exception for all extraction errorsValidationError— Invalid configuration or inputParsingError— Document parsing failureOCRError— OCR processing failureMissingDependencyError— Missing optional dependencyCacheError— Cache read/write failureImageProcessingError— Image processing failurePluginError— Plugin (post-processor, validator, OCR backend) failure
Example:
from kreuzberg import (
extract_file_sync,
KreuzbergError,
ValidationError,
ParsingError,
MissingDependencyError
)
try:
result = extract_file_sync("document.pdf")
except ValidationError as e:
print(f"Invalid input: {e}")
except ParsingError as e:
print(f"Failed to parse document: {e}")
except MissingDependencyError as e:
print(f"Missing dependency: {e}")
print(f"Install with: {e.install_command}")
except KreuzbergError as e:
print(f"Extraction failed: {e}")
Error inspection:
get_last_error_code()→ int | Noneget_error_details()→ dict (message, error_code, error_type, source_file, source_line, is_panic, etc.)classify_error(message: str)→ interror_code_name(code: int)→ str
See Error Handling Reference for detailed error documentation and best practices.
Utilities¶
detect_mime_type(data: bytes | bytearray)→ str: Detect MIME type from file bytes (e.g. forextract_bytes_sync).detect_mime_type_from_path(path: str | Path)→ str: Detect MIME type from file path (reads file).get_extensions_for_mime(mime_type: str)→ list[str]: Return file extensions associated with a MIME type.