Skip to content

C API Reference

C API Reference v5.0.0-rc.3

Functions

kreuzberg_extract_bytes()

Extract content from a byte array.

This is the main entry point for in-memory extraction. It performs the following steps:

  1. Validate MIME type
  2. Handle legacy format conversion if needed
  3. Select appropriate extractor from registry
  4. Extract content
  5. Run post-processing pipeline

Returns:

An ExtractionResult containing the extracted content and metadata.

Errors:

Returns KreuzbergError.Validation if MIME type is invalid. Returns KreuzbergError.UnsupportedFormat if MIME type is not supported.

Signature:

KreuzbergExtractionResult* kreuzberg_extract_bytes(const uint8_t* content, const char* mime_type, KreuzbergExtractionConfig config);

Parameters:

Name Type Required Description
content const uint8_t* Yes The byte array to extract
mime_type const char* Yes MIME type of the content
config KreuzbergExtractionConfig Yes Extraction configuration

Returns: KreuzbergExtractionResult Errors: Returns NULL on error.


kreuzberg_extract_file()

Extract content from a file.

This is the main entry point for file-based extraction. It performs the following steps:

  1. Check cache for existing result (if caching enabled)
  2. Detect or validate MIME type
  3. Select appropriate extractor from registry
  4. Extract content
  5. Run post-processing pipeline
  6. Store result in cache (if caching enabled)

Returns:

An ExtractionResult containing the extracted content and metadata.

Errors:

Returns KreuzbergError.Io if the file doesn't exist (NotFound) or for other file I/O errors. Returns KreuzbergError.UnsupportedFormat if MIME type is not supported.

Signature:

KreuzbergExtractionResult* kreuzberg_extract_file(const char* path, const char* mime_type, KreuzbergExtractionConfig config);

Parameters:

Name Type Required Description
path const char* Yes Path to the file to extract
mime_type const char** No Optional MIME type override. If None, will be auto-detected
config KreuzbergExtractionConfig Yes Extraction configuration

Returns: KreuzbergExtractionResult Errors: Returns NULL on error.


kreuzberg_extract_file_sync()

Synchronous wrapper for extract_file.

This is a convenience function that blocks the current thread until extraction completes. For async code, use extract_file directly.

Uses the global Tokio runtime for 100x+ performance improvement over creating a new runtime per call. Always uses the global runtime to avoid nested runtime issues.

This function is only available with the tokio-runtime feature. For WASM targets, use a truly synchronous extraction approach instead.

Signature:

KreuzbergExtractionResult* kreuzberg_extract_file_sync(const char* path, const char* mime_type, KreuzbergExtractionConfig config);

Parameters:

Name Type Required Description
path const char* Yes Path to the file
mime_type const char** No The mime type
config KreuzbergExtractionConfig Yes The configuration options

Returns: KreuzbergExtractionResult Errors: Returns NULL on error.


kreuzberg_extract_bytes_sync()

Synchronous wrapper for extract_bytes.

Uses the global Tokio runtime for 100x+ performance improvement over creating a new runtime per call.

With the tokio-runtime feature, this blocks the current thread using the global Tokio runtime. Without it (WASM), this calls a truly synchronous implementation.

Signature:

KreuzbergExtractionResult* kreuzberg_extract_bytes_sync(const uint8_t* content, const char* mime_type, KreuzbergExtractionConfig config);

Parameters:

Name Type Required Description
content const uint8_t* Yes The content to process
mime_type const char* Yes The mime type
config KreuzbergExtractionConfig Yes The configuration options

Returns: KreuzbergExtractionResult Errors: Returns NULL on error.


kreuzberg_batch_extract_files_sync()

Synchronous wrapper for batch_extract_files.

Uses the global Tokio runtime for optimal performance. Only available with tokio-runtime (WASM has no filesystem).

Signature:

KreuzbergExtractionResult* kreuzberg_batch_extract_files_sync(KreuzbergBatchFileItem* items, KreuzbergExtractionConfig config);

Parameters:

Name Type Required Description
items KreuzbergBatchFileItem* Yes The items
config KreuzbergExtractionConfig Yes The configuration options

Returns: KreuzbergExtractionResult* Errors: Returns NULL on error.


kreuzberg_batch_extract_bytes_sync()

Synchronous wrapper for batch_extract_bytes.

Uses the global Tokio runtime for optimal performance. With the tokio-runtime feature, this blocks the current thread using the global Tokio runtime. Without it (WASM), this calls a truly synchronous implementation that iterates through items and calls extract_bytes_sync().

Signature:

KreuzbergExtractionResult* kreuzberg_batch_extract_bytes_sync(KreuzbergBatchBytesItem* items, KreuzbergExtractionConfig config);

Parameters:

Name Type Required Description
items KreuzbergBatchBytesItem* Yes The items
config KreuzbergExtractionConfig Yes The configuration options

Returns: KreuzbergExtractionResult* Errors: Returns NULL on error.


kreuzberg_batch_extract_files()

Extract content from multiple files concurrently.

This function processes multiple files in parallel, automatically managing concurrency to prevent resource exhaustion. The concurrency limit can be configured via ExtractionConfig.max_concurrent_extractions or defaults to (num_cpus * 1.5).ceil().

Each file can optionally specify a FileExtractionConfig that overrides specific fields from the batch-level config. Pass NULL for a file to use the batch defaults. Batch-level settings like max_concurrent_extractions and use_cache are always taken from the batch-level config.

per-file configuration overrides.

  • config - Batch-level extraction configuration (provides defaults and batch settings)

Returns:

A vector of ExtractionResult in the same order as the input items.

Errors:

Individual file errors are captured in the result metadata. System errors (IO, RuntimeError equivalents) will bubble up and fail the entire batch.

Simple usage with no per-file overrides:

Per-file configuration overrides:

Signature:

KreuzbergExtractionResult* kreuzberg_batch_extract_files(KreuzbergBatchFileItem* items, KreuzbergExtractionConfig config);

Parameters:

Name Type Required Description
items KreuzbergBatchFileItem* Yes Vector of BatchFileItem structs, each containing a path and optional
config KreuzbergExtractionConfig Yes Batch-level extraction configuration (provides defaults and batch settings)

Returns: KreuzbergExtractionResult* Errors: Returns NULL on error.


kreuzberg_batch_extract_bytes()

Extract content from multiple byte arrays concurrently.

This function processes multiple byte arrays in parallel, automatically managing concurrency to prevent resource exhaustion. The concurrency limit can be configured via ExtractionConfig.max_concurrent_extractions or defaults to (num_cpus * 1.5).ceil().

Each item can optionally specify a FileExtractionConfig that overrides specific fields from the batch-level config. Pass NULL as the config to use the batch-level defaults for that item.

MIME type, and optional per-item configuration overrides.

  • config - Batch-level extraction configuration

Returns:

A vector of ExtractionResult in the same order as the input items.

Simple usage with no per-item overrides:

Per-item configuration overrides:

Signature:

KreuzbergExtractionResult* kreuzberg_batch_extract_bytes(KreuzbergBatchBytesItem* items, KreuzbergExtractionConfig config);

Parameters:

Name Type Required Description
items KreuzbergBatchBytesItem* Yes Vector of BatchBytesItem structs, each containing content bytes,
config KreuzbergExtractionConfig Yes Batch-level extraction configuration

Returns: KreuzbergExtractionResult* Errors: Returns NULL on error.


kreuzberg_detect_mime_type_from_bytes()

Detect MIME type from raw file bytes.

Uses magic byte signatures to detect file type from content. Falls back to infer crate for comprehensive detection.

For ZIP-based files, inspects contents to distinguish Office Open XML formats (DOCX, XLSX, PPTX) from plain ZIP archives.

Returns:

The detected MIME type string.

Errors:

Returns KreuzbergError.UnsupportedFormat if MIME type cannot be determined.

Signature:

const char* kreuzberg_detect_mime_type_from_bytes(const uint8_t* content);

Parameters:

Name Type Required Description
content const uint8_t* Yes Raw file bytes

Returns: const char* Errors: Returns NULL on error.


kreuzberg_get_extensions_for_mime()

Get file extensions for a given MIME type.

Returns all known file extensions that map to the specified MIME type.

Returns:

A vector of file extensions (without leading dot) for the MIME type.

Signature:

const char** kreuzberg_get_extensions_for_mime(const char* mime_type);

Parameters:

Name Type Required Description
mime_type const char* Yes The MIME type to look up

Returns: const char** Errors: Returns NULL on error.


kreuzberg_clear_embedding_backends()

Clear all embedding backends from the global registry.

Calls shutdown() on every registered backend, then empties the registry.

Errors:

  • Any error returned by a backend's shutdown() method. The first error encountered stops processing of remaining backends.

Signature:

void kreuzberg_clear_embedding_backends();

Returns: void Errors: Returns NULL on error.


kreuzberg_list_embedding_backends()

List the names of all registered embedding backends.

Used by kreuzberg-cli and the api/mcp endpoints; excluded from the language bindings via alef.toml [exclude].functions.

Signature:

const char** kreuzberg_list_embedding_backends();

Returns: const char** Errors: Returns NULL on error.


kreuzberg_list_document_extractors()

List names of all registered document extractors.

Signature:

const char** kreuzberg_list_document_extractors();

Returns: const char** Errors: Returns NULL on error.


kreuzberg_clear_document_extractors()

Clear all document extractors from the global registry.

Calls shutdown() on every registered extractor, then empties the registry.

Errors:

  • Any error returned by an extractor's shutdown() method. The first error encountered stops processing of remaining extractors.

Signature:

void kreuzberg_clear_document_extractors();

Returns: void Errors: Returns NULL on error.


kreuzberg_list_ocr_backends()

List all registered OCR backends.

Returns the names of all OCR backends currently registered in the global registry.

Returns:

A vector of OCR backend names.

Signature:

const char** kreuzberg_list_ocr_backends();

Returns: const char** Errors: Returns NULL on error.


kreuzberg_clear_ocr_backends()

Clear all OCR backends from the global registry.

Removes all OCR backends and calls their shutdown() methods.

Returns:

  • Ok(()) if all backends were cleared successfully
  • Err(...) if any shutdown method failed

Signature:

void kreuzberg_clear_ocr_backends();

Returns: void Errors: Returns NULL on error.


kreuzberg_list_post_processors()

List all registered post-processor names.

Returns a vector of all post-processor names currently registered in the global registry.

Returns:

  • Ok(Vec<String>) - Vector of post-processor names
  • Err(...) if the registry lock is poisoned

Signature:

const char** kreuzberg_list_post_processors();

Returns: const char** Errors: Returns NULL on error.


kreuzberg_clear_post_processors()

Remove all registered post-processors.

Signature:

void kreuzberg_clear_post_processors();

Returns: void Errors: Returns NULL on error.


kreuzberg_list_renderers()

List names of all registered renderers.

Errors:

Returns an error if the registry lock is poisoned.

Signature:

const char** kreuzberg_list_renderers();

Returns: const char** Errors: Returns NULL on error.


kreuzberg_clear_renderers()

Clear all renderers from the global registry.

Removes every renderer, including the built-in defaults (markdown, html, djot, plain). After calling this no renderers are registered; re-register as needed.

Errors:

Returns an error if the registry lock is poisoned.

Signature:

void kreuzberg_clear_renderers();

Returns: void Errors: Returns NULL on error.


kreuzberg_list_validators()

List names of all registered validators.

Signature:

const char** kreuzberg_list_validators();

Returns: const char** Errors: Returns NULL on error.


kreuzberg_clear_validators()

Remove all registered validators.

Signature:

void kreuzberg_clear_validators();

Returns: void Errors: Returns NULL on error.


kreuzberg_embed_texts_async()

Generate embeddings asynchronously for a list of text strings.

This is the async counterpart to embed_texts. It offloads the blocking ONNX inference work to a dedicated blocking thread pool via Tokio's spawn_blocking, keeping the async executor free.

Returns one embedding vector per input text in the same order.

Errors:

  • KreuzbergError.MissingDependency if ONNX Runtime is not installed
  • KreuzbergError.Embedding if the preset name is unknown, model download fails, or the blocking inference task panics

Signature:

float** kreuzberg_embed_texts_async(const char** texts, KreuzbergEmbeddingConfig config);

Parameters:

Name Type Required Description
texts const char** Yes Vec of strings to embed (owned, sent to blocking thread)
config KreuzbergEmbeddingConfig Yes Embedding configuration specifying model, batch size, and normalization

Returns: float** Errors: Returns NULL on error.


kreuzberg_render_pdf_page_to_png()

Render a single PDF page to PNG bytes.

Returns raw PNG-encoded bytes for the specified page at the given DPI. Uses pdf_oxide with tiny-skia for pure-Rust rendering.

Errors:

Returns KreuzbergError.Parsing if the PDF cannot be opened, authenticated, or rendered, or if page_index is out of range.

Signature:

const uint8_t* kreuzberg_render_pdf_page_to_png(const uint8_t* pdf_bytes, uintptr_t page_index, int32_t dpi, const char* password);

Parameters:

Name Type Required Description
pdf_bytes const uint8_t* Yes Raw PDF file bytes
page_index uintptr_t Yes Zero-based page index
dpi int32_t* No Resolution in dots per inch (default: 150)
password const char** No Optional password for encrypted PDFs

Returns: const uint8_t* Errors: Returns NULL on error.


kreuzberg_detect_mime_type()

Detect the MIME type of a file at the given path.

Uses the file extension and optionally the file content to determine the MIME type. Set check_exists to true to verify the file exists before detection.

Signature:

const char* kreuzberg_detect_mime_type(const char* path, bool check_exists);

Parameters:

Name Type Required Description
path const char* Yes Path to the file
check_exists bool Yes The check exists

Returns: const char* Errors: Returns NULL on error.


kreuzberg_embed_texts()

Embed a list of texts using the configured embedding model.

Returns a 2D vector where each inner vector is the embedding for the corresponding text.

Signature:

float** kreuzberg_embed_texts(const char** texts, KreuzbergEmbeddingConfig config);

Parameters:

Name Type Required Description
texts const char** Yes The texts
config KreuzbergEmbeddingConfig Yes The configuration options

Returns: float** Errors: Returns NULL on error.


kreuzberg_get_embedding_preset()

Get an embedding preset by name.

Returns NULL if no preset with the given name exists. Returns an owned clone so the value is safe to pass across FFI boundaries.

Signature:

KreuzbergEmbeddingPreset* kreuzberg_get_embedding_preset(const char* name);

Parameters:

Name Type Required Description
name const char* Yes The name

Returns: KreuzbergEmbeddingPreset*


kreuzberg_list_embedding_presets()

List the names of all available embedding presets.

Returns owned Strings so the values are safe to pass across FFI boundaries.

Signature:

const char** kreuzberg_list_embedding_presets();

Returns: const char**


Types

KreuzbergAccelerationConfig

Hardware acceleration configuration for ONNX Runtime models.

Controls which execution provider (CPU, CoreML, CUDA, TensorRT) is used for inference in layout detection and embedding generation.

Field Type Default Description
provider KreuzbergExecutionProviderType KREUZBERG_KREUZBERG_AUTO Execution provider to use for ONNX inference.
device_id uint32_t GPU device ID (for CUDA/TensorRT). Ignored for CPU/CoreML/Auto.

KreuzbergArchiveEntry

A single file extracted from an archive.

When archives (ZIP, TAR, 7Z, GZIP) are extracted with recursive extraction enabled, each processable file produces its own full ExtractionResult.

Field Type Default Description
path const char* Archive-relative file path (e.g. "folder/document.pdf").
mime_type const char* Detected MIME type of the file.
result KreuzbergExtractionResult Full extraction result for this file.

KreuzbergArchiveMetadata

Archive (ZIP/TAR/7Z) metadata.

Extracted from compressed archive files containing file lists and size information.

Field Type Default Description
format const char* Archive format ("ZIP", "TAR", "7Z", etc.)
file_count uint32_t Total number of files in the archive
file_list const char** NULL List of file paths within the archive
total_size uint64_t Total uncompressed size in bytes
compressed_size uint64_t* NULL Compressed size in bytes (if available)

KreuzbergBBox

Bounding box in original image coordinates (x1, y1) top-left, (x2, y2) bottom-right.

Field Type Default Description
x1 float X1
y1 float Y1
x2 float X2
y2 float Y2

KreuzbergBatchBytesItem

Batch item for byte array extraction.

Used with batch_extract_bytes and batch_extract_bytes_sync to represent a single item in a batch extraction job.

Field Type Default Description
content const uint8_t* The content bytes to extract from
mime_type const char* MIME type of the content (e.g., "application/pdf", "text/html")
config KreuzbergFileExtractionConfig* NULL Per-item configuration overrides (None uses batch-level defaults)

KreuzbergBatchFileItem

Batch item for file extraction.

Used with batch_extract_files and batch_extract_files_sync to represent a single file in a batch extraction job.

Field Type Default Description
path const char* Path to the file to extract from
config KreuzbergFileExtractionConfig* NULL Per-file configuration overrides (None uses batch-level defaults)

KreuzbergBibtexMetadata

BibTeX bibliography metadata.

Field Type Default Description
entry_count uintptr_t Number of entries in the bibliography.
citation_keys const char** NULL Citation keys
authors const char** NULL Authors
year_range KreuzbergYearRange* NULL Year range (year range)
entry_types void** NULL Entry types

KreuzbergBoundingBox

Bounding box coordinates for element positioning.

Field Type Default Description
x0 double Left x-coordinate
y0 double Bottom y-coordinate
x1 double Right x-coordinate
y1 double Top y-coordinate

KreuzbergChunk

A text chunk with optional embedding and metadata.

Chunks are created when chunking is enabled in ExtractionConfig. Each chunk contains the text content, optional embedding vector (if embedding generation is configured), and metadata about its position in the document.

Field Type Default Description
content const char* The text content of this chunk.
chunk_type KreuzbergChunkType /* serde(default) */ Semantic structural classification of this chunk. Assigned by the heuristic classifier based on content patterns and heading context. Defaults to ChunkType.Unknown when no rule matches.
embedding float** NULL Optional embedding vector for this chunk. Only populated when EmbeddingConfig is provided in chunking configuration. The dimensionality depends on the chosen embedding model.
metadata KreuzbergChunkMetadata Metadata about this chunk's position and properties.

KreuzbergChunkMetadata

Metadata about a chunk's position in the original document.

Field Type Default Description
byte_start uintptr_t Byte offset where this chunk starts in the original text (UTF-8 valid boundary).
byte_end uintptr_t Byte offset where this chunk ends in the original text (UTF-8 valid boundary).
token_count uintptr_t* NULL Number of tokens in this chunk (if available). This is calculated by the embedding model's tokenizer if embeddings are enabled.
chunk_index uintptr_t Zero-based index of this chunk in the document.
total_chunks uintptr_t Total number of chunks in the document.
first_page uint32_t* NULL First page number this chunk spans (1-indexed). Only populated when page tracking is enabled in extraction configuration.
last_page uint32_t* NULL Last page number this chunk spans (1-indexed, equal to first_page for single-page chunks). Only populated when page tracking is enabled in extraction configuration.
heading_context KreuzbergHeadingContext* /* serde(default) */ Heading context when using Markdown chunker. Contains the heading hierarchy this chunk falls under. Only populated when ChunkerType.Markdown is used.
image_indices uint32_t* /* serde(default) */ Indices into ExtractionResult.images for images on pages covered by this chunk. Contains zero-based indices into the top-level images collection for every image whose page_number falls within [first_page, last_page]. Empty when image extraction is disabled or the chunk spans no pages with images.

KreuzbergChunkingConfig

Chunking configuration.

Configures text chunking for document content, including chunk size, overlap, trimming behavior, and optional embeddings.

Use ..the default constructor when constructing to allow for future field additions:

Field Type Default Description
max_characters uintptr_t 1000 Maximum size per chunk (in units determined by sizing). When sizing is Characters (default), this is the max character count. When using token-based sizing, this is the max token count. Default: 1000
overlap uintptr_t 200 Overlap between chunks (in units determined by sizing). Default: 200
trim bool true Whether to trim whitespace from chunk boundaries. Default: true
chunker_type KreuzbergChunkerType KREUZBERG_KREUZBERG_TEXT Type of chunker to use (Text or Markdown). Default: Text
embedding KreuzbergEmbeddingConfig* NULL Optional embedding configuration for chunk embeddings.
preset const char** NULL Use a preset configuration (overrides individual settings if provided).
sizing KreuzbergChunkSizing KREUZBERG_KREUZBERG_CHARACTERS How to measure chunk size. Default: Characters (Unicode character count). Enable chunking-tiktoken or chunking-tokenizers features for token-based sizing.
prepend_heading_context bool false When true and chunker_type is Markdown, prepend the heading hierarchy path (e.g. "# Title > ## Section\n\n") to each chunk's content string. This is useful for RAG pipelines where each chunk needs self-contained context about its position in the document structure. Default: false
topic_threshold float* NULL Optional cosine similarity threshold for semantic topic boundary detection. Only used when chunker_type is Semantic and an EmbeddingConfig is provided. You almost never need to set this. When omitted, defaults to 0.75 which works well for most documents. Lower values detect more topic boundaries (more, smaller chunks); higher values detect fewer. Range: 0.0..=1.0.

Methods

kreuzberg_default()

Signature:

KreuzbergChunkingConfig kreuzberg_default();

KreuzbergCitationMetadata

Citation file metadata (RIS, PubMed, EndNote).

Field Type Default Description
citation_count uintptr_t Number of citations
format const char** NULL Format
authors const char** NULL Authors
year_range KreuzbergYearRange* NULL Year range (year range)
dois const char** NULL Dois
keywords const char** NULL Keywords

KreuzbergContentFilterConfig

Cross-extractor content filtering configuration.

Controls whether "furniture" content (headers, footers, page numbers, watermarks, repeating text) is included in or stripped from extraction results. Applies across all extractors (PDF, DOCX, RTF, ODT, HTML, etc.) with format-specific implementation.

When NULL on ExtractionConfig, each extractor uses its current default behavior unchanged.

Field Type Default Description
include_headers bool false Include running headers in extraction output. - PDF: Disables top-margin furniture stripping and prevents the layout model from treating PageHeader-classified regions as furniture. - DOCX: Includes document headers in text output. - RTF/ODT: Headers already included; this is a no-op when true. - HTML/EPUB: Keeps <header> element content. Default: false (headers are stripped or excluded).
include_footers bool false Include running footers in extraction output. - PDF: Disables bottom-margin furniture stripping and prevents the layout model from treating PageFooter-classified regions as furniture. - DOCX: Includes document footers in text output. - RTF/ODT: Footers already included; this is a no-op when true. - HTML/EPUB: Keeps <footer> element content. Default: false (footers are stripped or excluded).
strip_repeating_text bool true Enable the heuristic cross-page repeating text detector. When true (default), text that repeats verbatim across a supermajority of pages is classified as furniture and stripped. Disable this if brand names or repeated headings are being incorrectly removed by the heuristic. Note: when a layout-detection model is active, the model may independently classify page-header / page-footer regions as furniture on a per-page basis. To preserve those regions, set include_headers = true, include_footers = true, or both, in addition to disabling this flag. Primarily affects PDF extraction. Default: true.
include_watermarks bool false Include watermark text in extraction output. - PDF: Keeps watermark artifacts and arXiv identifiers. - Other formats: No effect currently. Default: false (watermarks are stripped).

Methods

kreuzberg_default()

Signature:

KreuzbergContentFilterConfig kreuzberg_default();

KreuzbergContributorRole

JATS contributor with role.

Field Type Default Description
name const char* The name
role const char** NULL Role

KreuzbergCoreProperties

Dublin Core metadata from docProps/core.xml

Contains standard metadata fields defined by the Dublin Core standard and Office-specific extensions.

Field Type Default Description
title const char** NULL Document title
subject const char** NULL Document subject/topic
creator const char** NULL Document creator/author
keywords const char** NULL Keywords or tags
description const char** NULL Document description/abstract
last_modified_by const char** NULL User who last modified the document
revision const char** NULL Revision number
created const char** NULL Creation timestamp (ISO 8601)
modified const char** NULL Last modification timestamp (ISO 8601)
category const char** NULL Document category
content_status const char** NULL Content status (Draft, Final, etc.)
language const char** NULL Document language
identifier const char** NULL Unique identifier
version const char** NULL Document version
last_printed const char** NULL Last print timestamp (ISO 8601)

KreuzbergCsvMetadata

CSV/TSV file metadata.

Field Type Default Description
row_count uint32_t Number of rows
column_count uint32_t Number of columns
delimiter const char** NULL Delimiter
has_header bool Whether header
column_types const char*** NULL Column types

KreuzbergDbfFieldInfo

dBASE field information.

Field Type Default Description
name const char* The name
field_type const char* Field type

KreuzbergDbfMetadata

dBASE (DBF) file metadata.

Field Type Default Description
record_count uintptr_t Number of records
field_count uintptr_t Number of fields
fields KreuzbergDbfFieldInfo* NULL Fields

KreuzbergDetectResponse

MIME type detection response.

Field Type Default Description
mime_type const char* Detected MIME type
filename const char** NULL Original filename (if provided)

KreuzbergDetectionResult

Page-level detection result containing all detections and page metadata.

Field Type Default Description
page_width uint32_t Page width
page_height uint32_t Page height
detections KreuzbergLayoutDetection* Detections

KreuzbergDjotContent

Comprehensive Djot document structure with semantic preservation.

This type captures the full richness of Djot markup, including:

  • Block-level structures (headings, lists, blockquotes, code blocks, etc.)
  • Inline formatting (emphasis, strong, highlight, subscript, superscript, etc.)
  • Attributes (classes, IDs, key-value pairs)
  • Links, images, footnotes
  • Math expressions (inline and display)
  • Tables with full structure

Available when the djot feature is enabled.

Field Type Default Description
plain_text const char* Plain text representation for backwards compatibility
blocks KreuzbergFormattedBlock* Structured block-level content
metadata KreuzbergMetadata Metadata from YAML frontmatter
tables KreuzbergTable* Extracted tables as structured data
images KreuzbergDjotImage* Extracted images with metadata
links KreuzbergDjotLink* Extracted links with URLs
footnotes KreuzbergFootnote* Footnote definitions
attributes const char** /* serde(default) */ Attributes mapped by element identifier (if present)

KreuzbergDjotImage

Image element in Djot.

Field Type Default Description
src const char* Image source URL or path
alt const char* Alternative text
title const char** NULL Optional title
attributes const char** NULL Element attributes

Link element in Djot.

Field Type Default Description
url const char* Link URL
text const char* Link text content
title const char** NULL Optional title
attributes const char** NULL Element attributes

KreuzbergDocumentExtractor

Trait for document extractor plugins.

Implement this trait to add support for new document formats or to override built-in extraction behavior with custom logic.

Return Type

Extractors return InternalDocument, a flat intermediate representation. The pipeline converts this into the public ExtractionResult via the derivation step.

Priority System

When multiple extractors support the same MIME type, the registry selects the extractor with the highest priority value. Use this to:

  • Override built-in extractors (priority > 50)
  • Provide fallback extractors (priority < 50)
  • Implement specialized extractors for specific use cases

Default priority is 50.

Thread Safety

Extractors must be thread-safe (Send + Sync) to support concurrent extraction.

Methods

kreuzberg_extract_bytes()

Extract content from a byte array.

This is the core extraction method that processes in-memory document data.

Returns:

An InternalDocument containing the extracted elements, metadata, and tables. The pipeline will convert this into the public ExtractionResult.

Errors:

  • KreuzbergError.Parsing - Document parsing failed
  • KreuzbergError.Validation - Invalid document structure
  • KreuzbergError.Io - I/O errors (these always bubble up)
  • KreuzbergError.MissingDependency - Required dependency not available

Signature:

KreuzbergInternalDocument kreuzberg_extract_bytes(const uint8_t* content, const char* mime_type, KreuzbergExtractionConfig config);

kreuzberg_extract_file()

Extract content from a file.

Default implementation reads the file and calls extract_bytes. Override for custom file handling, streaming, or memory optimizations.

Returns:

An InternalDocument containing the extracted elements, metadata, and tables.

Errors:

Same as extract_bytes, plus file I/O errors.

Signature:

KreuzbergInternalDocument kreuzberg_extract_file(const char* path, const char* mime_type, KreuzbergExtractionConfig config);

kreuzberg_supported_mime_types()

Get the list of MIME types supported by this extractor.

Can include exact MIME types and prefix patterns:

  • Exact: "application/pdf", "text/plain"
  • Prefix: "image/*" (matches any image type)

Returns:

A slice of MIME type strings.

Signature:

const char** kreuzberg_supported_mime_types();

kreuzberg_priority()

Get the priority of this extractor.

Higher priority extractors are preferred when multiple extractors support the same MIME type.

Priority Guidelines

  • 0-25: Fallback/low-quality extractors
  • 26-49: Alternative extractors
  • 50: Default priority (built-in extractors)
  • 51-75: Premium/enhanced extractors
  • 76-100: Specialized/high-priority extractors

Returns:

Priority value (default: 50)

Signature:

int32_t kreuzberg_priority();

kreuzberg_can_handle()

Optional: Check if this extractor can handle a specific file.

Allows for more sophisticated detection beyond MIME types. Defaults to true (rely on MIME type matching).

Returns:

true if the extractor can handle this file, false otherwise.

Signature:

bool kreuzberg_can_handle(const char* path, const char* mime_type);

kreuzberg_as_sync_extractor()

Attempt to get a reference to this extractor as a SyncExtractor.

Returns None if the extractor doesn't support synchronous extraction. This is used for WASM and other sync-only environments.

Signature:

KreuzbergSyncExtractor* kreuzberg_as_sync_extractor();

KreuzbergDocumentNode

A single node in the document tree.

Each node has deterministic id, typed content, optional parent/children for tree structure, and metadata like page number, bounding box, and content layer.

Field Type Default Description
id const char* Deterministic identifier (hash of content + position).
content KreuzbergNodeContent Node content — tagged enum, type-specific data only.
parent uint32_t* NULL Parent node index (NULL = root-level node).
children uint32_t* /* serde(default) */ Child node indices in reading order.
content_layer KreuzbergContentLayer /* serde(default) */ Content layer classification.
page uint32_t* NULL Page number where this node starts (1-indexed).
page_end uint32_t* NULL Page number where this node ends (for multi-page tables/sections).
bbox KreuzbergBoundingBox* NULL Bounding box in document coordinates.
annotations KreuzbergTextAnnotation* /* serde(default) */ Inline annotations (formatting, links) on this node's text content. Only meaningful for text-carrying nodes; empty for containers.
attributes void** NULL Format-specific key-value attributes. Extensible bag for miscellaneous data without a dedicated typed field: CSS classes, LaTeX environment names, Excel cell formulas, slide layout names, etc.

KreuzbergDocumentRelationship

A resolved relationship between two nodes in the document tree.

Field Type Default Description
source uint32_t Source node index (the referencing node).
target uint32_t Target node index (the referenced node).
kind KreuzbergRelationshipKind Semantic kind of the relationship.

KreuzbergDocumentStructure

Top-level structured document representation.

A flat array of nodes with index-based parent/child references forming a tree. Root-level nodes have parent: None. Use body_roots() and furniture_roots() to iterate over top-level content by layer.

Validation

Call validate() after construction to verify all node indices are in bounds and parent-child relationships are bidirectionally consistent.

Field Type Default Description
nodes KreuzbergDocumentNode* NULL All nodes in document/reading order.
source_format const char** NULL Origin format identifier (e.g. "docx", "pptx", "html", "pdf"). Allows renderers to apply format-aware heuristics when converting the document tree to output formats.
relationships KreuzbergDocumentRelationship* NULL Resolved relationships between nodes (footnote refs, citations, anchor links, etc.). Populated during derivation from the internal document representation. Empty when no relationships are detected.
node_types const char** NULL Sorted, deduplicated list of node type names present in this document. Each value is the snake_case node_type tag of the corresponding NodeContent variant (e.g. "paragraph", "heading", "table", …). Computed from nodes via DocumentStructure.finalize_node_types. Empty until that method is called (internal construction paths call it at the end of derivation).

Methods

kreuzberg_finalize_node_types()

Compute and populate the node_types field from the current nodes.

Call this after all nodes have been added to the structure. Internal construction paths (builder, derivation) call this automatically.

Signature:

void kreuzberg_finalize_node_types();

kreuzberg_is_empty()

Check if the document structure is empty.

Signature:

bool kreuzberg_is_empty();

kreuzberg_default()

Signature:

KreuzbergDocumentStructure kreuzberg_default();

KreuzbergDocxAppProperties

Application properties from docProps/app.xml for DOCX

Contains Word-specific document statistics and metadata.

Field Type Default Description
application const char** NULL Application name (e.g., "Microsoft Office Word")
app_version const char** NULL Application version
template const char** NULL Template filename
total_time int32_t* NULL Total editing time in minutes
pages int32_t* NULL Number of pages
words int32_t* NULL Number of words
characters int32_t* NULL Number of characters (excluding spaces)
characters_with_spaces int32_t* NULL Number of characters (including spaces)
lines int32_t* NULL Number of lines
paragraphs int32_t* NULL Number of paragraphs
company const char** NULL Company name
doc_security int32_t* NULL Document security level
scale_crop bool* NULL Scale crop flag
links_up_to_date bool* NULL Links up to date flag
shared_doc bool* NULL Shared document flag
hyperlinks_changed bool* NULL Hyperlinks changed flag

KreuzbergDocxMetadata

Word document metadata.

Extracted from DOCX files using shared Office Open XML metadata extraction. Integrates with office_metadata module for core/app/custom properties.

Field Type Default Description
core_properties KreuzbergCoreProperties* NULL Core properties from docProps/core.xml (Dublin Core metadata) Contains title, creator, subject, keywords, dates, etc. Shared format across DOCX/PPTX/XLSX documents.
app_properties KreuzbergDocxAppProperties* NULL Application properties from docProps/app.xml (Word-specific statistics) Contains word count, page count, paragraph count, editing time, etc. DOCX-specific variant of Office application properties.
custom_properties void** NULL Custom properties from docProps/custom.xml (user-defined properties) Contains key-value pairs defined by users or applications. Values can be strings, numbers, booleans, or dates.

KreuzbergElement

Semantic element extracted from document.

Represents a logical unit of content with semantic classification, unique identifier, and metadata for tracking origin and position.

Field Type Default Description
element_id const char* Unique element identifier
element_type KreuzbergElementType Semantic type of this element
text const char* Text content of the element
metadata KreuzbergElementMetadata Metadata about the element

KreuzbergElementMetadata

Metadata for a semantic element.

Field Type Default Description
page_number uint32_t* NULL Page number (1-indexed)
filename const char** NULL Source filename or document name
coordinates KreuzbergBoundingBox* NULL Bounding box coordinates if available
element_index uintptr_t* NULL Position index in the element sequence
additional void* Additional custom metadata

KreuzbergEmailAttachment

Email attachment representation.

Contains metadata and optionally the content of an email attachment.

Field Type Default Description
name const char** NULL Attachment name (from Content-Disposition header)
filename const char** NULL Filename of the attachment
mime_type const char** NULL MIME type of the attachment
size uintptr_t* NULL Size in bytes
is_image bool Whether this attachment is an image
data const uint8_t** NULL Attachment data (if extracted). Uses bytes.Bytes for cheap cloning of large buffers.

KreuzbergEmailConfig

Configuration for email extraction.

Field Type Default Description
msg_fallback_codepage uint32_t* NULL Windows codepage number to use when an MSG file contains no codepage property. Defaults to NULL, which falls back to windows-1252. If an unrecognized or invalid codepage number is supplied (including 0), the behavior silently falls back to windows-1252 — the same as when the MSG file itself contains an unrecognized codepage. No error or warning is emitted. Users should verify output when supplying unusual values. Common values: - 1250: Central European (Polish, Czech, Hungarian, etc.) - 1251: Cyrillic (Russian, Ukrainian, Bulgarian, etc.) - 1252: Western European (default) - 1253: Greek - 1254: Turkish - 1255: Hebrew - 1256: Arabic - 932: Japanese (Shift-JIS) - 936: Simplified Chinese (GBK)

KreuzbergEmailExtractionResult

Email extraction result.

Complete representation of an extracted email message (.eml or .msg) including headers, body content, and attachments.

Field Type Default Description
subject const char** NULL Email subject line
from_email const char** NULL Sender email address
to_emails const char** Primary recipient email addresses
cc_emails const char** CC recipient email addresses
bcc_emails const char** BCC recipient email addresses
date const char** NULL Email date/timestamp
message_id const char** NULL Message-ID header value
plain_text const char** NULL Plain text version of the email body
html_content const char** NULL HTML version of the email body
content const char* Cleaned/processed text content. Aliased as cleaned_text for back-compat.
attachments KreuzbergEmailAttachment* List of email attachments
metadata void* Additional email headers and metadata

KreuzbergEmailMetadata

Email metadata extracted from .eml and .msg files.

Includes sender/recipient information, message ID, and attachment list.

Field Type Default Description
from_email const char** NULL Sender's email address
from_name const char** NULL Sender's display name
to_emails const char** NULL Primary recipients
cc_emails const char** NULL CC recipients
bcc_emails const char** NULL BCC recipients
message_id const char** NULL Message-ID header value
attachments const char** NULL List of attachment filenames

KreuzbergEmbeddedFile

Embedded file descriptor extracted from the PDF name tree.

Field Type Default Description
name const char* The filename as stored in the PDF name tree.
data const uint8_t* Raw file bytes from the embedded stream.
mime_type const char** NULL MIME type if specified in the filespec, otherwise NULL.

KreuzbergEmbeddingBackend

Trait for in-process embedding backend plugins.

Async to match the convention used by OcrBackend, DocumentExtractor, and PostProcessor. Host-language bridges (PyO3, napi-rs, Rustler, extendr, magnus, ext-php-rs, C FFI, etc.) wrap their synchronous host callables in spawn_blocking or the equivalent to satisfy the async signature.

Thread safety

Backends must be Send + Sync + 'static. They are stored in Arc<dyn EmbeddingBackend> and called concurrently from kreuzberg's chunking pipeline. If the backend's underlying model isn't thread-safe, the backend itself must serialize access internally (e.g. via Mutex<Inner>).

Contract

  • embed(texts) MUST return exactly texts.len() vectors, each of length self.dimensions(). The dispatcher in embed_texts validates this before returning to downstream consumers; a non-conforming backend surfaces as a KreuzbergError.Validation, not a panic.

  • embed may be called from any thread. Its future must be Send (enforced by async_trait when #[async_trait] is used on non-WASM targets).

  • dimensions() is called exactly once at registration, immediately after initialize() succeeds. The returned value is cached by the registry and used for all subsequent shape validation. Lazy-loading implementations can defer model loading into initialize() and report the real dimension afterwards. Later mutations of the backend's reported dimension are not observed by kreuzberg — implementations that need to change dimension must unregister and re-register.

  • shutdown() (inherited from Plugin) may be invoked concurrently with an in-flight embed() call. Implementations must tolerate this — e.g. by letting in-flight calls finish using resources held via the Arc<dyn EmbeddingBackend> reference, and only releasing shared state that isn't needed by embed.

Runtime

The synchronous embed_texts entry uses tokio.task.block_in_place to await the trait's async embed, which requires a multi-thread tokio runtime. Callers running inside a current_thread runtime (e.g. #[tokio.test] without flavor = "multi_thread", or tokio.runtime.Builder.new_current_thread()) must use embed_texts_async instead, which awaits directly without block_in_place.

Methods

kreuzberg_dimensions()

Embedding vector dimension. Must be > 0 and must match the length of every vector returned by embed.

Signature:

uintptr_t kreuzberg_dimensions();

kreuzberg_embed()

Embed a batch of texts, returning one vector per input in order.

Errors:

Implementations should return Plugin for backend-specific failures. The dispatcher layers its own validation (length, per-vector dimension) on top.

Signature:

float** kreuzberg_embed(const char** texts);

KreuzbergEmbeddingConfig

Embedding configuration for text chunks.

Configures embedding generation using ONNX models via the vendored embedding engine. Requires the embeddings feature to be enabled.

Field Type Default Description
model KreuzbergEmbeddingModelType KREUZBERG_KREUZBERG_PRESET The embedding model to use (defaults to "balanced" preset if not specified)
normalize bool true Whether to normalize embedding vectors (recommended for cosine similarity)
batch_size uintptr_t 32 Batch size for embedding generation
show_download_progress bool false Show model download progress
cache_dir const char** NULL Custom cache directory for model files Defaults to ~/.cache/kreuzberg/embeddings/ if not specified. Allows full customization of model download location.
acceleration KreuzbergAccelerationConfig* NULL Hardware acceleration for the embedding ONNX model. When set, controls which execution provider (CPU, CUDA, CoreML, TensorRT) is used for inference. Defaults to NULL (auto-select per platform).
max_embed_duration_secs uint64_t* NULL Maximum wall-clock duration (in seconds) for a single embed() call when using EmbeddingModelType.Plugin. Applies only to the in-process plugin path — protects against hung host-language backends (e.g. a Python callback deadlocked on the GIL, a model stuck on CUDA OOM retries, etc.). On timeout, the dispatcher returns Plugin instead of blocking forever. NULL disables the timeout. The default (60 seconds) is conservative for common in-process inference; increase for large batches on slow hardware.

Methods

kreuzberg_default()

Signature:

KreuzbergEmbeddingConfig kreuzberg_default();

KreuzbergEmbeddingPreset

Preset configurations for common RAG use cases.

Each preset combines chunk size, overlap, and embedding model to provide an optimized configuration for specific scenarios.

All string fields are owned String for FFI compatibility — instances are safe to clone and pass across language boundaries.

Field Type Default Description
name const char* The name
chunk_size uintptr_t Chunk size
overlap uintptr_t Overlap
model_repo const char* HuggingFace repository name for the model.
pooling const char* Pooling strategy: "cls" or "mean".
model_file const char* Path to the ONNX model file within the repo.
dimensions uintptr_t Dimensions
description const char* Human-readable description

KreuzbergEpubMetadata

EPUB metadata (Dublin Core extensions).

Field Type Default Description
coverage const char** NULL Coverage
dc_format const char** NULL Dc format
relation const char** NULL Relation
source const char** NULL Source
dc_type const char** NULL Dc type
cover_image const char** NULL Cover image

KreuzbergErrorMetadata

Error metadata (for batch operations).

Field Type Default Description
error_type const char* Error type
message const char* Message

KreuzbergExcelMetadata

Excel/spreadsheet format metadata.

Identifies the document as a spreadsheet source via the FormatMetadata.Excel discriminant. Sheet count and sheet names are stored inside this struct.

Field Type Default Description
sheet_count uint32_t* NULL Number of sheets in the workbook.
sheet_names const char*** NULL Names of all sheets in the workbook.

KreuzbergExcelSheet

Single Excel worksheet.

Represents one sheet from an Excel workbook with its content converted to Markdown format and dimensional statistics.

Field Type Default Description
name const char* Sheet name as it appears in Excel
markdown const char* Sheet content converted to Markdown tables
row_count uintptr_t Number of rows
col_count uintptr_t Number of columns
cell_count uintptr_t Total number of non-empty cells
table_cells const char**** NULL Pre-extracted table cells (2D vector of cell values) Populated during markdown generation to avoid re-parsing markdown. None for empty sheets.

KreuzbergExcelWorkbook

Excel workbook representation.

Contains all sheets from an Excel file (.xlsx, .xls, etc.) with extracted content and metadata.

Field Type Default Description
sheets KreuzbergExcelSheet* All sheets in the workbook
metadata void* Workbook-level metadata (author, creation date, etc.)

KreuzbergExtractedImage

Extracted image from a document.

Contains raw image data, metadata, and optional nested OCR results. Raw bytes allow cross-language compatibility - users can convert to PIL.Image (Python), Sharp (Node.js), or other formats as needed.

Field Type Default Description
data const uint8_t* Raw image data (PNG, JPEG, WebP, etc. bytes). Uses bytes.Bytes for cheap cloning of large buffers.
format const char* Image format (e.g., "jpeg", "png", "webp") Uses Cow<'static, str> to avoid allocation for static literals.
image_index uint32_t Zero-indexed position of this image in the document/page
page_number uint32_t* NULL Page/slide number where image was found (1-indexed)
width uint32_t* NULL Image width in pixels
height uint32_t* NULL Image height in pixels
colorspace const char** NULL Colorspace information (e.g., "RGB", "CMYK", "Gray")
bits_per_component uint32_t* NULL Bits per color component (e.g., 8, 16)
is_mask bool /* serde(default) */ Whether this image is a mask image
description const char** NULL Optional description of the image
ocr_result KreuzbergExtractionResult* NULL Nested OCR extraction result (if image was OCRed) When OCR is performed on this image, the result is embedded here rather than in a separate collection, making the relationship explicit.
bounding_box KreuzbergBoundingBox* /* serde(default) */ Bounding box of the image on the page (PDF coordinates: x0=left, y0=bottom, x1=right, y1=top). Only populated for PDF-extracted images when position data is available from the PDF extractor.
source_path const char** /* serde(default) */ Original source path of the image within the document archive (e.g., "media/image1.png" in DOCX). Used for rendering image references when the binary data is not extracted.
image_kind KreuzbergImageKind* /* serde(default) */ Heuristic classification of what this image likely depicts. NULL if classification was disabled or inconclusive.
kind_confidence float* /* serde(default) */ Confidence score for image_kind, in the range 0.0 to 1.0.
cluster_id uint32_t* /* serde(default) */ Identifier shared across images that form a single logical figure (e.g. all raster tiles of one technical drawing). NULL for singletons.

KreuzbergExtractedImageMetadata

Image metadata extracted from an image file.

Field Type Default Description
width uint32_t Image width in pixels
height uint32_t Image height in pixels
format const char* Image format (e.g., "PNG", "JPEG")
exif_data void* EXIF data if available

KreuzbergExtractionConfig

Main extraction configuration.

This struct contains all configuration options for the extraction process. It can be loaded from TOML, YAML, or JSON files, or created programmatically.

Field Type Default Description
use_cache bool true Enable caching of extraction results
enable_quality_processing bool true Enable quality post-processing
ocr KreuzbergOcrConfig* NULL OCR configuration (None = OCR disabled)
force_ocr bool false Force OCR even for searchable PDFs
force_ocr_pages uint32_t** NULL Force OCR on specific pages only (1-indexed page numbers, must be >= 1). When set, only the listed pages are OCR'd regardless of text layer quality. Unlisted pages use native text extraction. Ignored when force_ocr is true. Only applies to PDF documents. Duplicates are automatically deduplicated. An ocr config is recommended for backend/language selection; defaults are used if absent.
disable_ocr bool false Disable OCR entirely, even for images. When true, OCR is skipped for all document types. Images return metadata only (dimensions, format, EXIF) without text extraction. PDFs use only native text extraction without OCR fallback. Cannot be true simultaneously with force_ocr. Added in v4.7.0.
chunking KreuzbergChunkingConfig* NULL Text chunking configuration (None = chunking disabled)
content_filter KreuzbergContentFilterConfig* NULL Content filtering configuration (None = use extractor defaults). Controls whether document "furniture" (headers, footers, watermarks, repeating text) is included in or stripped from extraction results. See ContentFilterConfig for per-field documentation.
images KreuzbergImageExtractionConfig* NULL Image extraction configuration (None = no image extraction)
pdf_options KreuzbergPdfConfig* NULL PDF-specific options (None = use defaults)
token_reduction KreuzbergTokenReductionOptions* NULL Token reduction configuration (None = no token reduction)
language_detection KreuzbergLanguageDetectionConfig* NULL Language detection configuration (None = no language detection)
pages KreuzbergPageConfig* NULL Page extraction configuration (None = no page tracking)
keywords KreuzbergKeywordConfig* NULL Keyword extraction configuration (None = no keyword extraction)
postprocessor KreuzbergPostProcessorConfig* NULL Post-processor configuration (None = use defaults)
html_options const char** NULL HTML to Markdown conversion options (None = use defaults) Configure how HTML documents are converted to Markdown, including heading styles, list formatting, code block styles, and preprocessing options.
html_output KreuzbergHtmlOutputConfig* NULL Styled HTML output configuration. When set alongside output_format = OutputFormat.Html, the extraction pipeline uses StyledHtmlRenderer which emits stable kb-* CSS class hooks on every structural element and optionally embeds theme CSS or user-supplied CSS in a <style> block. When NULL, the existing plain comrak-based HTML renderer is used.
extraction_timeout_secs uint64_t* NULL Default per-file timeout in seconds for batch extraction. When set, each file in a batch will be canceled after this duration unless overridden by FileExtractionConfig.timeout_secs. NULL means no timeout (unbounded extraction time).
max_concurrent_extractions uintptr_t* NULL Maximum concurrent extractions in batch operations (None = (num_cpus × 1.5).ceil()). Limits parallelism to prevent resource exhaustion when processing large batches. Defaults to (num_cpus × 1.5).ceil() when not set.
result_format KreuzbergResultFormat KREUZBERG_KREUZBERG_UNIFIED Result structure format Controls whether results are returned in unified format (default) with all content in the content field, or element-based format with semantic elements (for Unstructured-compatible output).
security_limits KreuzbergSecurityLimits* NULL Security limits for archive extraction. Controls maximum archive size, compression ratio, file count, and other security thresholds to prevent decompression bomb attacks. Also caps nesting depth, iteration count, entity / token length, total content size, and table cell count for every extraction path that ingests user-controlled bytes. When NULL, default limits are used.
output_format KreuzbergOutputFormat KREUZBERG_KREUZBERG_PLAIN Content text format (default: Plain). Controls the format of the extracted content: - Plain: Raw extracted text (default) - Markdown: Markdown formatted output - Djot: Djot markup format (requires djot feature) - Html: HTML formatted output When set to a structured format, extraction results will include formatted output. The formatted_content field may be populated when format conversion is applied.
layout KreuzbergLayoutDetectionConfig* NULL Layout detection configuration (None = layout detection disabled). When set, PDF pages and images are analyzed for document structure (headings, code, formulas, tables, figures, etc.) using RT-DETR models via ONNX Runtime. For PDFs, layout hints override paragraph classification in the markdown pipeline. For images, per-region OCR is performed with markdown formatting based on detected layout classes. Requires the layout-detection feature to run inference; the field is present whenever the layout-types feature is active (which includes layout-detection as well as the no-ORT target groups).
use_layout_for_markdown bool false Run layout detection on the non-OCR PDF markdown path. When true and layout is Some(_), layout regions inform heading, table, list, and figure detection in the structure pipeline that would otherwise rely on font-clustering heuristics alone. Significantly improves SF1 (structural F1) at the cost of inference latency (~150-300ms/page CPU, ~20-50ms/page GPU). Default: false. Requires the layout-detection feature.
include_document_structure bool false Enable structured document tree output. When true, populates the document field on ExtractionResult with a hierarchical DocumentStructure containing heading-driven section nesting, table grids, content layer classification, and inline annotations. Independent of result_format — can be combined with Unified or ElementBased.
acceleration KreuzbergAccelerationConfig* NULL Hardware acceleration configuration for ONNX Runtime models. Controls execution provider selection for layout detection and embedding models. When NULL, uses platform defaults (CoreML on macOS, CUDA on Linux, CPU on Windows).
cache_namespace const char** NULL Cache namespace for tenant isolation. When set, cache entries are stored under {cache_dir}/{namespace}/. Must be alphanumeric, hyphens, or underscores only (max 64 chars). Different namespaces have isolated cache spaces on the same filesystem.
cache_ttl_secs uint64_t* NULL Per-request cache TTL in seconds. Overrides the global max_age_days for this specific extraction. When 0, caching is completely skipped (no read or write). When NULL, the global TTL applies.
email KreuzbergEmailConfig* NULL Email extraction configuration (None = use defaults). Currently supports configuring the fallback codepage for MSG files that do not specify one. See EmailConfig for details.
concurrency const char** NULL Concurrency limits for constrained environments (None = use defaults). Controls Rayon thread pool size, ONNX Runtime intra-op threads, and (when max_concurrent_extractions is unset) the batch concurrency semaphore. See ConcurrencyConfig for details.
max_archive_depth uintptr_t Maximum recursion depth for archive extraction (default: 3). Set to 0 to disable recursive extraction (legacy behavior).
tree_sitter KreuzbergTreeSitterConfig* NULL Tree-sitter language pack configuration (None = tree-sitter disabled). When set, enables code file extraction using tree-sitter parsers. Controls grammar download behavior and code analysis options.
structured_extraction KreuzbergStructuredExtractionConfig* NULL Structured extraction via LLM (None = disabled). When set, the extracted document content is sent to an LLM with the provided JSON schema. The structured response is stored in ExtractionResult.structured_output.
cancel_token const char** NULL Cancellation token for this extraction (None = no external cancellation). Pass a CancellationToken clone here and call CancellationToken.cancel from another thread / task to abort the extraction in progress. The extractor checks the token at safe checkpoints (before lock acquisition, between pages, between batch items) and returns KreuzbergError.Cancelled when set. The field is excluded from serialization because CancellationToken is a runtime handle, not a configuration value.

Methods

kreuzberg_default()

Signature:

KreuzbergExtractionConfig kreuzberg_default();

kreuzberg_needs_image_processing()

Check if image processing is needed by examining OCR and image extraction settings.

Returns true if either OCR is enabled or image extraction is configured, indicating that image decompression and processing should occur. Returns false if both are disabled, allowing optimization to skip unnecessary image decompression for text-only extraction workflows.

Optimization Impact

For text-only extractions (no OCR, no image extraction), skipping image decompression can improve CPU utilization by 5-10% by avoiding wasteful image I/O and processing when results won't be used.

Signature:

bool kreuzberg_needs_image_processing();

KreuzbergExtractionResult

General extraction result used by the core extraction API.

This is the main result type returned by all extraction functions.

Field Type Default Description
content const char* The extracted text content
mime_type const char* The detected MIME type
metadata KreuzbergMetadata Document metadata
extraction_method KreuzbergExtractionMethod* NULL Extraction strategy used to produce the returned text. Populated when the extractor can reliably distinguish native text extraction, OCR-only extraction, or mixed native/OCR output.
tables KreuzbergTable* NULL Tables extracted from the document
detected_languages const char*** NULL Detected languages
chunks KreuzbergChunk** NULL Text chunks when chunking is enabled. When chunking configuration is provided, the content is split into overlapping chunks for efficient processing. Each chunk contains the text, optional embeddings (if enabled), and metadata about its position.
images KreuzbergExtractedImage** NULL Extracted images from the document. When image extraction is enabled via ImageExtractionConfig, this field contains all images found in the document with their raw data and metadata. Each image may optionally contain a nested ocr_result if OCR was performed.
pages KreuzbergPageContent** NULL Per-page content when page extraction is enabled. When page extraction is configured, the document is split into per-page content with tables and images mapped to their respective pages.
elements KreuzbergElement** NULL Semantic elements when element-based result format is enabled. When result_format is set to ElementBased, this field contains semantic elements with type classification, unique identifiers, and metadata for Unstructured-compatible element-based processing.
djot_content KreuzbergDjotContent* NULL Rich Djot content structure (when extracting Djot documents). When extracting Djot documents with structured extraction enabled, this field contains the full semantic structure including: - Block-level elements with nesting - Inline formatting with attributes - Links, images, footnotes - Math expressions - Complete attribute information The content field still contains plain text for backward compatibility. Always NULL for non-Djot documents.
ocr_elements KreuzbergOcrElement** NULL OCR elements with full spatial and confidence metadata. When OCR is performed with element extraction enabled, this field contains the structured representation of detected text including: - Bounding geometry (rectangles or quadrilaterals) - Confidence scores (detection and recognition) - Rotation information - Hierarchical relationships (Tesseract only) This field preserves all metadata that would otherwise be lost when converting to plain text or markdown output formats. Only populated when OcrElementConfig.include_elements is true.
document KreuzbergDocumentStructure* NULL Structured document tree (when document structure extraction is enabled). When include_document_structure is true in ExtractionConfig, this field contains the full hierarchical representation of the document including: - Heading-driven section nesting - Table grids with cell-level metadata - Content layer classification (body, header, footer, footnote) - Inline text annotations (formatting, links) - Bounding boxes and page numbers Independent of result_format — can be combined with Unified or ElementBased.
extracted_keywords KreuzbergKeyword** NULL Extracted keywords when keyword extraction is enabled. When keyword extraction (RAKE or YAKE) is configured, this field contains the extracted keywords with scores, algorithm info, and position data. Previously stored in metadata.additional["keywords"].
quality_score double* NULL Document quality score from quality analysis. A value between 0.0 and 1.0 indicating the overall text quality. Previously stored in metadata.additional["quality_score"].
processing_warnings KreuzbergProcessingWarning* NULL Non-fatal warnings collected during processing pipeline stages. Captures errors from optional pipeline features (embedding, chunking, language detection, output formatting) that don't prevent extraction but may indicate degraded results. Previously stored as individual keys in metadata.additional.
annotations KreuzbergPdfAnnotation** NULL PDF annotations extracted from the document. When annotation extraction is enabled via PdfConfig.extract_annotations, this field contains text notes, highlights, links, stamps, and other annotations found in PDF documents.
children KreuzbergArchiveEntry** NULL Nested extraction results from archive contents. When extracting archives, each processable file inside produces its own full extraction result. Set to NULL for non-archive formats. Use max_archive_depth in config to control recursion depth.
uris KreuzbergUri** NULL URIs/links discovered during document extraction. Contains hyperlinks, image references, citations, email addresses, and other URI-like references found in the document. Always extracted when present in the source document.
structured_output void** NULL Structured extraction output from LLM-based JSON schema extraction. When structured_extraction is configured in ExtractionConfig, the extracted document content is sent to a VLM with the provided JSON schema. The response is parsed and stored here as a JSON value matching the schema.
code_intelligence void** NULL Code intelligence results from tree-sitter analysis. Populated when extracting source code files with the tree-sitter feature. Contains metrics, structural analysis, imports/exports, comments, docstrings, symbols, diagnostics, and optionally chunked code segments. Stored as an opaque JSON value so that all language bindings (Go, Java, C#, …) can deserialize it as a raw JSON object rather than a typed struct. The underlying type is tree_sitter_language_pack.ProcessResult.
llm_usage KreuzbergLlmUsage** NULL LLM token usage and cost data for all LLM calls made during this extraction. Contains one entry per LLM call. Multiple entries are produced when VLM OCR, structured extraction, or LLM embeddings run during the same extraction. NULL when no LLM was used.
formatted_content const char** NULL Pre-rendered content in the requested output format. Populated during derive_extraction_result before tree derivation consumes element data. apply_output_format swaps this into content at the end of the pipeline, after post-processors have operated on plain text.
ocr_internal_document const char** NULL Structured hOCR document for the OCR+layout pipeline. When tesseract produces hOCR output, the parsed InternalDocument carries paragraph structure with bounding boxes and confidence scores. The layout classification step enriches these elements before final rendering.

Methods

kreuzberg_from_ocr()

Convert from an OCR result.

Signature:

KreuzbergExtractionResult kreuzberg_from_ocr(KreuzbergOcrExtractionResult ocr);

KreuzbergFictionBookMetadata

FictionBook (FB2) metadata.

Field Type Default Description
genres const char** NULL Genres
sequences const char** NULL Sequences
annotation const char** NULL Annotation

KreuzbergFileExtractionConfig

Per-file extraction configuration overrides for batch processing.

All fields are Option<T>NULL means "use the batch-level default." This type is used with batch_extract_files and batch_extract_bytes to allow heterogeneous extraction settings within a single batch.

Excluded Fields

The following ExtractionConfig fields are batch-level only and cannot be overridden per file:

  • max_concurrent_extractions — controls batch parallelism
  • use_cache — global caching policy
  • acceleration — shared ONNX execution provider
  • security_limits — global archive security policy
Field Type Default Description
enable_quality_processing bool* NULL Override quality post-processing for this file.
ocr KreuzbergOcrConfig* NULL Override OCR configuration for this file (None in the Option = use batch default).
force_ocr bool* NULL Override force OCR for this file.
force_ocr_pages uint32_t** NULL Override force OCR pages for this file (1-indexed page numbers).
disable_ocr bool* NULL Override disable OCR for this file.
chunking KreuzbergChunkingConfig* NULL Override chunking configuration for this file.
content_filter KreuzbergContentFilterConfig* NULL Override content filtering configuration for this file.
images KreuzbergImageExtractionConfig* NULL Override image extraction configuration for this file.
pdf_options KreuzbergPdfConfig* NULL Override PDF options for this file.
token_reduction KreuzbergTokenReductionOptions* NULL Override token reduction for this file.
language_detection KreuzbergLanguageDetectionConfig* NULL Override language detection for this file.
pages KreuzbergPageConfig* NULL Override page extraction for this file.
keywords KreuzbergKeywordConfig* NULL Override keyword extraction for this file.
postprocessor KreuzbergPostProcessorConfig* NULL Override post-processor for this file.
html_options const char** NULL Override HTML conversion options for this file.
result_format KreuzbergResultFormat* NULL Override result format for this file.
output_format KreuzbergOutputFormat* NULL Override output content format for this file.
include_document_structure bool* NULL Override document structure output for this file.
layout KreuzbergLayoutDetectionConfig* NULL Override layout detection for this file.
timeout_secs uint64_t* NULL Override per-file extraction timeout in seconds. When set, the extraction for this file will be canceled after the specified duration. A timed-out file produces an error result without affecting other files in the batch.
tree_sitter KreuzbergTreeSitterConfig* NULL Override tree-sitter configuration for this file.
structured_extraction KreuzbergStructuredExtractionConfig* NULL Override structured extraction configuration for this file. When set, enables LLM-based structured extraction with a JSON schema for this specific file. The extracted content is sent to a VLM/LLM and the response is parsed according to the provided schema.

KreuzbergFootnote

Footnote in Djot.

Field Type Default Description
label const char* Footnote label
content KreuzbergFormattedBlock* Footnote content blocks

KreuzbergFormattedBlock

Block-level element in a Djot document.

Represents structural elements like headings, paragraphs, lists, code blocks, etc.

Field Type Default Description
block_type KreuzbergBlockType Type of block element
level uintptr_t* NULL Heading level (1-6) for headings, or nesting level for lists
inline_content KreuzbergInlineElement* Inline content within the block
attributes const char** NULL Element attributes (classes, IDs, key-value pairs)
language const char** NULL Language identifier for code blocks
code const char** NULL Raw code content for code blocks
children KreuzbergFormattedBlock* /* serde(default) */ Nested blocks for containers (blockquotes, list items, divs)

KreuzbergGridCell

Individual grid cell with position and span metadata.

Field Type Default Description
content const char* Cell text content.
row uint32_t Zero-indexed row position.
col uint32_t Zero-indexed column position.
row_span uint32_t /* serde(default) */ Number of rows this cell spans.
col_span uint32_t /* serde(default) */ Number of columns this cell spans.
is_header bool /* serde(default) */ Whether this is a header cell.
bbox KreuzbergBoundingBox* NULL Bounding box for this cell (if available).

KreuzbergHeaderMetadata

Header/heading element metadata.

Field Type Default Description
level uint8_t Header level: 1 (h1) through 6 (h6)
text const char* Normalized text content of the header
id const char** NULL HTML id attribute if present
depth uint32_t Document tree depth at the header element
html_offset uint32_t Byte offset in original HTML document

KreuzbergHeadingContext

Heading context for a chunk within a Markdown document.

Contains the heading hierarchy from document root to this chunk's section.

Field Type Default Description
headings KreuzbergHeadingLevel* The heading hierarchy from document root to this chunk's section. Index 0 is the outermost (h1), last element is the most specific.

KreuzbergHeadingLevel

A single heading in the hierarchy.

Field Type Default Description
level uint8_t Heading depth (1 = h1, 2 = h2, etc.)
text const char* The text content of the heading.

KreuzbergHierarchicalBlock

A text block with hierarchy level assignment.

Represents a block of text with semantic heading information extracted from font size clustering and hierarchical analysis.

Field Type Default Description
text const char* The text content of this block
font_size float The font size of the text in this block
level const char* The hierarchy level of this block (H1-H6 or Body) Levels correspond to HTML heading tags: - "h1": Top-level heading - "h2": Secondary heading - "h3": Tertiary heading - "h4": Quaternary heading - "h5": Quinary heading - "h6": Senary heading - "body": Body text (no heading level)
bbox float** NULL Bounding box information for the block Contains coordinates as (left, top, right, bottom) in PDF units.

KreuzbergHierarchyConfig

Hierarchy extraction configuration for PDF text structure analysis.

Enables extraction of document hierarchy levels (H1-H6) based on font size clustering and semantic analysis. When enabled, hierarchical blocks are included in page content.

Field Type Default Description
enabled bool true Enable hierarchy extraction
k_clusters uintptr_t 3 Number of font size clusters to use for hierarchy levels (1-7) Default: 6, which provides H1-H6 heading levels with body text. Larger values create more fine-grained hierarchy levels.
include_bbox bool true Include bounding box information in hierarchy blocks
ocr_coverage_threshold float* NULL OCR coverage threshold for smart OCR triggering (0.0-1.0) Determines when OCR should be triggered based on text block coverage. OCR is triggered when text blocks cover less than this fraction of the page. Default: 0.5 (trigger OCR if less than 50% of page has text)

Methods

kreuzberg_default()

Signature:

KreuzbergHierarchyConfig kreuzberg_default();

KreuzbergHtmlMetadata

HTML metadata extracted from HTML documents.

Includes document-level metadata, Open Graph data, Twitter Card metadata, and extracted structural elements (headers, links, images, structured data).

Field Type Default Description
title const char** NULL Document title from <title> tag
description const char** NULL Document description from <meta name="description"> tag
keywords const char** NULL Document keywords from <meta name="keywords"> tag, split on commas
author const char** NULL Document author from <meta name="author"> tag
canonical_url const char** NULL Canonical URL from <link rel="canonical"> tag
base_href const char** NULL Base URL from <base href=""> tag for resolving relative URLs
language const char** NULL Document language from lang attribute
text_direction KreuzbergTextDirection* NULL Document text direction from dir attribute
open_graph void* NULL Open Graph metadata (og:* properties) for social media Keys like "title", "description", "image", "url", etc.
twitter_card void* NULL Twitter Card metadata (twitter:* properties) Keys like "card", "site", "creator", "title", "description", "image", etc.
meta_tags void* NULL Additional meta tags not covered by specific fields Keys are meta name/property attributes, values are content
headers KreuzbergHeaderMetadata* NULL Extracted header elements with hierarchy
links KreuzbergLinkMetadata* NULL Extracted hyperlinks with type classification
images KreuzbergImageMetadataType* NULL Extracted images with source and dimensions
structured_data KreuzbergStructuredData* NULL Extracted structured data blocks

KreuzbergHtmlOutputConfig

Configuration for styled HTML output.

When set on ExtractionConfig.html_output alongside output_format = OutputFormat.Html, the pipeline builds a StyledHtmlRenderer instead of the plain comrak-based renderer.

Field Type Default Description
css const char** NULL Inline CSS string injected into the output after the theme stylesheet. Concatenated after css_file content when both are set.
css_file const char** NULL Path to a CSS file loaded once at renderer construction time. Concatenated before css when both are set.
theme KreuzbergHtmlTheme KREUZBERG_KREUZBERG_UNSTYLED Built-in colour/typography theme. Default: HtmlTheme.Unstyled.
class_prefix const char* CSS class prefix applied to every emitted class name. Default: "kb-". Change this if your host application already uses classes that start with kb-.
embed_css bool true When true (default), write the resolved CSS into a <style> block immediately after the opening <div class="{prefix}doc">. Set to false to emit only the structural markup and wire up your own stylesheet targeting the kb-* class names.

Methods

kreuzberg_default()

Signature:

KreuzbergHtmlOutputConfig kreuzberg_default();

KreuzbergImageExtractionConfig

Image extraction configuration.

Field Type Default Description
extract_images bool true Extract images from documents
target_dpi int32_t 300 Target DPI for image normalization
max_image_dimension int32_t 4096 Maximum dimension for images (width or height)
inject_placeholders bool true Whether to inject image reference placeholders into markdown output. When true (default), image references like ![Image 1](embedded:p1_i0) are appended to the markdown. Set to false to extract images as data without polluting the markdown output.
auto_adjust_dpi bool true Automatically adjust DPI based on image content
min_dpi int32_t 72 Minimum DPI threshold
max_dpi int32_t 600 Maximum DPI threshold
max_images_per_page uint32_t* NULL Maximum number of image objects to extract per PDF page. Some PDFs (e.g. technical diagrams stored as thousands of raster fragments) can trigger extremely long or indefinite extraction times when every image object on a dense page is decoded individually via the PDF extractor. Setting this limit causes kreuzberg to stop collecting individual images once the count per page reaches the cap and emit a warning instead. NULL (default) means no limit — all images are extracted.
classify bool true When true (default), extracted images are classified by kind and grouped into clusters where they appear to belong to one figure.

Methods

kreuzberg_default()

Signature:

KreuzbergImageExtractionConfig kreuzberg_default();

KreuzbergImageMetadata

Image metadata extracted from image files.

Includes dimensions, format, and EXIF data.

Field Type Default Description
width uint32_t Image width in pixels
height uint32_t Image height in pixels
format const char* Image format (e.g., "PNG", "JPEG", "TIFF")
exif void* NULL EXIF metadata tags

KreuzbergImageMetadataType

Image element metadata.

Field Type Default Description
src const char* Image source (URL, data URI, or SVG content)
alt const char** NULL Alternative text from alt attribute
title const char** NULL Title attribute
dimensions uint32_t** NULL Image dimensions as (width, height) if available
image_type KreuzbergImageType Image type classification
attributes const char*** Additional attributes as key-value pairs

KreuzbergImagePreprocessingConfig

Image preprocessing configuration for OCR.

These settings control how images are preprocessed before OCR to improve text recognition quality. Different preprocessing strategies work better for different document types.

Field Type Default Description
target_dpi int32_t 300 Target DPI for the image (300 is standard, 600 for small text).
auto_rotate bool true Auto-detect and correct image rotation.
deskew bool true Correct skew (tilted images).
denoise bool false Remove noise from the image.
contrast_enhance bool false Enhance contrast for better text visibility.
binarization_method const char* "otsu" Binarization method: "otsu", "sauvola", "adaptive".
invert_colors bool false Invert colors (white text on black → black on white).

Methods

kreuzberg_default()

Signature:

KreuzbergImagePreprocessingConfig kreuzberg_default();

KreuzbergImagePreprocessingMetadata

Image preprocessing metadata.

Tracks the transformations applied to an image during OCR preprocessing, including DPI normalization, resizing, and resampling.

Field Type Default Description
original_dimensions uintptr_t* Original image dimensions (width, height) in pixels
original_dpi double* Original image DPI (horizontal, vertical)
target_dpi int32_t Target DPI from configuration
scale_factor double Scaling factor applied to the image
auto_adjusted bool Whether DPI was auto-adjusted based on content
final_dpi int32_t Final DPI after processing
new_dimensions uintptr_t** NULL New dimensions after resizing (if resized)
resample_method const char* Resampling algorithm used ("LANCZOS3", "CATMULLROM", etc.)
dimension_clamped bool Whether dimensions were clamped to max_image_dimension
calculated_dpi int32_t* NULL Calculated optimal DPI (if auto_adjust_dpi enabled)
skipped_resize bool Whether resize was skipped (dimensions already optimal)
resize_error const char** NULL Error message if resize failed

KreuzbergInlineElement

Inline element within a block.

Represents text with formatting, links, images, etc.

Field Type Default Description
element_type KreuzbergInlineType Type of inline element
content const char* Text content
attributes const char** NULL Element attributes
metadata void** NULL Additional metadata (e.g., href for links, src/alt for images)

KreuzbergJatsMetadata

JATS (Journal Article Tag Suite) metadata.

Field Type Default Description
copyright const char** NULL Copyright
license const char** NULL License
history_dates void* NULL History dates
contributor_roles KreuzbergContributorRole* NULL Contributor roles

KreuzbergKeyword

Extracted keyword with metadata.

Field Type Default Description
text const char* The keyword text.
score float Relevance score (higher is better, algorithm-specific range).
algorithm KreuzbergKeywordAlgorithm Algorithm that extracted this keyword.
positions uintptr_t** NULL Optional positions where keyword appears in text (character offsets).

KreuzbergKeywordConfig

Keyword extraction configuration.

Field Type Default Description
algorithm KreuzbergKeywordAlgorithm KREUZBERG_KREUZBERG_YAKE Algorithm to use for extraction.
max_keywords uintptr_t 10 Maximum number of keywords to extract (default: 10).
min_score float 0 Minimum score threshold (0.0-1.0, default: 0.0). Keywords with scores below this threshold are filtered out. Note: Score ranges differ between algorithms.
ngram_range uintptr_t* NULL N-gram range for keyword extraction (min, max). (1, 1) = unigrams only (1, 2) = unigrams and bigrams (1, 3) = unigrams, bigrams, and trigrams (default)
language const char** NULL Language code for stopword filtering (e.g., "en", "de", "fr"). If None, no stopword filtering is applied.
yake_params KreuzbergYakeParams* NULL YAKE-specific tuning parameters.
rake_params KreuzbergRakeParams* NULL RAKE-specific tuning parameters.

Methods

kreuzberg_default()

Signature:

KreuzbergKeywordConfig kreuzberg_default();

KreuzbergLanguageDetectionConfig

Language detection configuration.

Field Type Default Description
enabled bool true Enable language detection
min_confidence double 0.8 Minimum confidence threshold (0.0-1.0)
detect_multiple bool false Detect multiple languages in the document

Methods

kreuzberg_default()

Signature:

KreuzbergLanguageDetectionConfig kreuzberg_default();

KreuzbergLayoutDetection

A single layout detection result.

Field Type Default Description
class_name KreuzbergLayoutClass Class name (layout class)
confidence float Confidence
bbox KreuzbergBBox Bbox (b box)

KreuzbergLayoutDetectionConfig

Layout detection configuration.

Controls layout detection behavior in the extraction pipeline. When set on ExtractionConfig, layout detection is enabled for PDF extraction.

Field Type Default Description
confidence_threshold float* NULL Confidence threshold override (None = use model default).
apply_heuristics bool true Whether to apply postprocessing heuristics (default: true).
table_model KreuzbergTableModel KREUZBERG_KREUZBERG_TATR Table structure recognition model. Controls which model is used for table cell detection within layout-detected table regions. Defaults to TableModel.Tatr.
acceleration KreuzbergAccelerationConfig* NULL Hardware acceleration for ONNX models (layout detection + table structure). When set, controls which execution provider (CPU, CUDA, CoreML, TensorRT) is used for inference. Defaults to NULL (auto-select per platform).

Methods

kreuzberg_default()

Signature:

KreuzbergLayoutDetectionConfig kreuzberg_default();

KreuzbergLayoutRegion

A detected layout region on a page.

When layout detection is enabled, each page may have layout regions identifying different content types (text, pictures, tables, etc.) with confidence scores and spatial positions.

Field Type Default Description
class_name const char* Layout class name (e.g. "picture", "table", "text", "section_header").
confidence double Confidence score from the layout detection model (0.0 to 1.0).
bounding_box KreuzbergBoundingBox Bounding box in document coordinate space.
area_fraction double Fraction of the page area covered by this region (0.0 to 1.0).

KreuzbergLinkMetadata

Link element metadata.

Field Type Default Description
href const char* The href URL value
text const char* Link text content (normalized)
title const char** NULL Optional title attribute
link_type KreuzbergLinkType Link type classification
rel const char** Rel attribute values
attributes const char*** Additional attributes as key-value pairs

KreuzbergLlmConfig

Configuration for an LLM provider/model via liter-llm.

Each feature (VLM OCR, VLM embeddings, structured extraction) carries its own LlmConfig, allowing different providers per feature.

Field Type Default Description
model const char* Provider/model string using liter-llm routing format. Examples: "openai/gpt-4o", "anthropic/claude-sonnet-4-20250514", "groq/llama-3.1-70b-versatile".
api_key const char** NULL API key for the provider. When NULL, liter-llm falls back to the provider's standard environment variable (e.g., OPENAI_API_KEY).
base_url const char** NULL Custom base URL override for the provider endpoint.
timeout_secs uint64_t* NULL Request timeout in seconds (default: 60).
max_retries uint32_t* NULL Maximum retry attempts (default: 3).
temperature double* NULL Sampling temperature for generation tasks.
max_tokens uint64_t* NULL Maximum tokens to generate.

KreuzbergLlmUsage

Token usage and cost data for a single LLM call made during extraction.

Populated when VLM OCR, structured extraction, or LLM-based embeddings are used. Multiple entries may be present when multiple LLM calls occur within one extraction (e.g. VLM OCR + structured extraction).

Field Type Default Description
model const char* The LLM model identifier (e.g. "openai/gpt-4o", "anthropic/claude-sonnet-4-20250514").
source const char* The pipeline stage that triggered this LLM call (e.g. "vlm_ocr", "structured_extraction", "embeddings").
input_tokens uint64_t* NULL Number of input/prompt tokens consumed.
output_tokens uint64_t* NULL Number of output/completion tokens generated.
total_tokens uint64_t* NULL Total tokens (input + output).
estimated_cost double* NULL Estimated cost in USD based on the provider's published pricing.
finish_reason const char** NULL Why the model stopped generating (e.g. "stop", "length", "content_filter").

KreuzbergMetadata

Extraction result metadata.

Contains common fields applicable to all formats, format-specific metadata via a discriminated union, and additional custom fields from postprocessors.

Field Type Default Description
title const char** NULL Document title
subject const char** NULL Document subject or description
authors const char*** NULL Primary author(s) - always Vec for consistency
keywords const char*** NULL Keywords/tags - always Vec for consistency
language const char** NULL Primary language (ISO 639 code)
created_at const char** NULL Creation timestamp (ISO 8601 format)
modified_at const char** NULL Last modification timestamp (ISO 8601 format)
created_by const char** NULL User who created the document
modified_by const char** NULL User who last modified the document
pages KreuzbergPageStructure* NULL Page/slide/sheet structure with boundaries
format KreuzbergFormatMetadata* NULL Format-specific metadata (discriminated union) Contains detailed metadata specific to the document format. Serialized as a nested "format" object with a format_type discriminator field.
image_preprocessing KreuzbergImagePreprocessingMetadata* NULL Image preprocessing metadata (when OCR preprocessing was applied)
json_schema void** NULL JSON schema (for structured data extraction)
error KreuzbergErrorMetadata* NULL Error metadata (for batch operations)
extraction_duration_ms uint64_t* NULL Extraction duration in milliseconds (for benchmarking). This field is populated by batch extraction to provide per-file timing information. It's NULL for single-file extraction (which uses external timing).
category const char** NULL Document category (from frontmatter or classification).
tags const char*** NULL Document tags (from frontmatter).
document_version const char** NULL Document version string (from frontmatter).
abstract_text const char** NULL Abstract or summary text (from frontmatter).
output_format const char** NULL Output format identifier (e.g., "markdown", "html", "text"). Set by the output format pipeline stage when format conversion is applied. Previously stored in metadata.additional["output_format"].
ocr_used bool Whether OCR was used during extraction. Set to true whenever the extraction pipeline ran an OCR backend (Tesseract, PaddleOCR, VLM, etc.) and used that output as the primary or fallback text. false means native text extraction was used exclusively.
additional void* NULL Additional custom fields from postprocessors. Serialized as a nested "additional" object (not flattened at root level). Uses Cow<'static, str> keys so static string keys avoid allocation.

Methods

kreuzberg_is_empty()

Returns true when no metadata fields, format-specific metadata, or additional postprocessor fields are populated.

Signature:

bool kreuzberg_is_empty();

KreuzbergModelPaths

Combined paths to all models needed for OCR (backward compatibility).

Field Type Default Description
det_model const char* Path to the detection model directory.
cls_model const char* Path to the classification model directory.
rec_model const char* Path to the recognition model directory.
dict_file const char* Path to the character dictionary file.

KreuzbergOcrBackend

Trait for OCR backend plugins.

Implement this trait to add custom OCR capabilities. OCR backends can be:

  • Native Rust implementations (like Tesseract)
  • FFI bridges to Python libraries (like EasyOCR, PaddleOCR)
  • Cloud-based OCR services (Google Vision, AWS Textract, etc.)

Thread Safety

OCR backends must be thread-safe (Send + Sync) to support concurrent processing.

Methods

kreuzberg_process_image()

Process an image and extract text via OCR.

Returns:

An ExtractionResult containing the extracted text and metadata.

Errors:

  • KreuzbergError.Ocr - OCR processing failed
  • KreuzbergError.Validation - Invalid image format or configuration
  • KreuzbergError.Io - I/O errors (these always bubble up)

Reading backend_options

Backends that support runtime tuning can read config.backend_options and deserialize only the keys they care about. Unknown keys are silently ignored, so multiple backends can coexist in a pipeline without key conflicts.

Signature:

KreuzbergExtractionResult kreuzberg_process_image(const uint8_t* image_bytes, KreuzbergOcrConfig config);

kreuzberg_process_image_file()

Process a file and extract text via OCR.

Default implementation reads the file and calls process_image. Override for custom file handling or optimizations.

Errors:

Same as process_image, plus file I/O errors.

Signature:

KreuzbergExtractionResult kreuzberg_process_image_file(const char* path, KreuzbergOcrConfig config);

kreuzberg_supports_language()

Check if this backend supports a given language code.

Returns:

true if the language is supported, false otherwise.

Signature:

bool kreuzberg_supports_language(const char* lang);

kreuzberg_backend_type()

Get the backend type identifier.

Returns:

The backend type enum value.

Signature:

KreuzbergOcrBackendType kreuzberg_backend_type();

kreuzberg_supported_languages()

Optional: Get a list of all supported languages.

Defaults to empty list. Override to provide comprehensive language support info.

Signature:

const char** kreuzberg_supported_languages();

kreuzberg_supports_table_detection()

Optional: Check if the backend supports table detection.

Defaults to false. Override if your backend can detect and extract tables.

Signature:

bool kreuzberg_supports_table_detection();

kreuzberg_supports_document_processing()

Check if the backend supports direct document-level processing (e.g. for PDFs).

Defaults to false. Override if the backend has optimized document processing.

Signature:

bool kreuzberg_supports_document_processing();

kreuzberg_process_document()

Process a document file directly via OCR.

Only called if supports_document_processing returns true.

Signature:

KreuzbergExtractionResult kreuzberg_process_document(const char* path, KreuzbergOcrConfig config);

KreuzbergOcrCacheStats

Field Type Default Description
total_files uintptr_t Total files
total_size_mb double Total size mb

KreuzbergOcrConfidence

Confidence scores for an OCR element.

Separates detection confidence (how confident that text exists at this location) from recognition confidence (how confident about the actual text content).

Field Type Default Description
detection double* NULL Detection confidence: how confident the OCR engine is that text exists here. PaddleOCR provides this as box_score, Tesseract doesn't have a direct equivalent. Range: 0.0 to 1.0 (or None if not available).
recognition double Recognition confidence: how confident about the text content. Range: 0.0 to 1.0.

KreuzbergOcrConfig

OCR configuration.

Field Type Default Description
enabled bool true Whether OCR is enabled. Setting enabled: false is a shorthand for disable_ocr: true on the parent ExtractionConfig. Images return metadata only; PDFs use native text extraction without OCR fallback. Defaults to true. When false, all other OCR settings are ignored.
backend const char* OCR backend: tesseract, easyocr, paddleocr
language const char* Language code (e.g., "eng", "deu")
tesseract_config KreuzbergTesseractConfig* NULL Tesseract-specific configuration (optional)
output_format KreuzbergOutputFormat* NULL Output format for OCR results (optional, for format conversion)
paddle_ocr_config void** NULL PaddleOCR-specific configuration (optional, JSON passthrough)
backend_options void** NULL Arbitrary per-call options passed through to the backend unchanged. Custom OCR backends and built-in backends that support runtime tuning can read this value and deserialize the keys they care about. Keys unknown to the backend are silently ignored. This is the recommended extension point for per-call parameters that are not covered by the typed fields above (e.g. mode switching, preprocessing flags, inference batch size). Scope: when pipeline is NULL, this value is propagated to the primary stage of the auto-constructed pipeline. When pipeline is explicitly set, this field has no effect — the caller must set OcrPipelineStage.backend_options directly on the relevant stage(s) instead. Example: json { "mode": "fast", "enable_layout": true, "timeout_ms": 5000 }
element_config KreuzbergOcrElementConfig* NULL OCR element extraction configuration
quality_thresholds KreuzbergOcrQualityThresholds* NULL Quality thresholds for the native-text-to-OCR fallback decision. When None, uses compiled defaults (matching previous hardcoded behavior).
pipeline KreuzbergOcrPipelineConfig* NULL Multi-backend OCR pipeline configuration. When set, enables weighted fallback across multiple OCR backends based on output quality. When None, uses the single backend field (same as today).
auto_rotate bool false Enable automatic page rotation based on orientation detection. When enabled, uses Tesseract's DetectOrientationScript() to detect page orientation (0/90/180/270 degrees) before OCR. If the page is rotated with high confidence, the image is corrected before recognition. This is critical for handling rotated scanned documents.
vlm_config KreuzbergLlmConfig* NULL VLM (Vision Language Model) OCR configuration. Required when backend is "vlm". Uses liter-llm to send page images to a vision model for text extraction.
vlm_prompt const char** NULL Custom Jinja2 prompt template for VLM OCR. When NULL, uses the default template. Available variables: - {{ language }} — The document language code (e.g., "eng", "deu").
acceleration KreuzbergAccelerationConfig* NULL Hardware acceleration for ONNX Runtime models (e.g. PaddleOCR, layout detection). Not user-configurable via config files — injected at runtime from ExtractionConfig.acceleration before each process_image call.
tessdata_bytes void** NULL Caller-supplied Tesseract traineddata bytes per language code. Primary use case is the WASM build, which has no filesystem and cannot download tessdata at runtime. Native builds typically rely on TessdataManager and ignore this field. When present, the WASM Tesseract backend prefers these bytes over its compile-time-bundled English data. Skipped by serde to keep config files small — supply via the typed API at runtime.

Methods

kreuzberg_default()

Signature:

KreuzbergOcrConfig kreuzberg_default();

KreuzbergOcrElement

A unified OCR element representing detected text with full metadata.

This is the primary type for structured OCR output, preserving all information from both Tesseract and PaddleOCR backends.

Field Type Default Description
text const char* The recognized text content.
geometry KreuzbergOcrBoundingGeometry KREUZBERG_KREUZBERG_RECTANGLE Bounding geometry (rectangle or quadrilateral).
confidence KreuzbergOcrConfidence Confidence scores for detection and recognition.
level KreuzbergOcrElementLevel KREUZBERG_KREUZBERG_LINE Hierarchical level (word, line, block, page).
rotation KreuzbergOcrRotation* NULL Rotation information (if detected).
page_number uint32_t Page number (1-indexed).
parent_id const char** NULL Parent element ID for hierarchical relationships. Only used for Tesseract output which has word -> line -> block hierarchy.
backend_metadata void* NULL Backend-specific metadata that doesn't fit the unified schema.

KreuzbergOcrElementConfig

Configuration for OCR element extraction.

Controls how OCR elements are extracted and filtered.

Field Type Default Description
include_elements bool Whether to include OCR elements in the extraction result. When true, the ocr_elements field in ExtractionResult will be populated.
min_level KreuzbergOcrElementLevel KREUZBERG_KREUZBERG_LINE Minimum hierarchical level to include. Elements below this level (e.g., words when min_level is Line) will be excluded.
min_confidence double Minimum recognition confidence threshold (0.0-1.0). Elements with confidence below this threshold will be filtered out.
build_hierarchy bool Whether to build hierarchical relationships between elements. When true, parent_id fields will be populated based on spatial containment. Only meaningful for Tesseract output.

KreuzbergOcrExtractionResult

OCR extraction result.

Result of performing OCR on an image or scanned document, including recognized text and detected tables.

Field Type Default Description
content const char* Recognized text content
mime_type const char* Original MIME type of the processed image
metadata void* OCR processing metadata (confidence scores, language, etc.)
tables KreuzbergOcrTable* Tables detected and extracted via OCR
ocr_elements KreuzbergOcrElement** /* serde(default) */ Structured OCR elements with bounding boxes and confidence scores. Available when TSV output is requested or table detection is enabled.
internal_document const char** NULL Structured document produced from hOCR parsing. Carries paragraph structure, bounding boxes, and confidence scores that the flattened content string discards.

KreuzbergOcrMetadata

OCR processing metadata.

Captures information about OCR processing configuration and results.

Field Type Default Description
language const char* OCR language code(s) used
psm int32_t Tesseract Page Segmentation Mode (PSM)
output_format const char* Output format (e.g., "text", "hocr")
table_count uint32_t Number of tables detected
table_rows uint32_t* NULL Table rows
table_cols uint32_t* NULL Table cols

KreuzbergOcrPipelineConfig

Multi-backend OCR pipeline with quality-based fallback.

Backends are tried in priority order (highest first). After each backend produces output, quality is evaluated. If it meets quality_thresholds.pipeline_min_quality, the result is accepted. Otherwise the next backend is tried.

Field Type Default Description
stages KreuzbergOcrPipelineStage* Ordered list of backends to try. Sorted by priority (descending) at runtime.
quality_thresholds KreuzbergOcrQualityThresholds /* serde(default) */ Quality thresholds for deciding whether to accept a result or try the next backend.

KreuzbergOcrPipelineStage

A single backend stage in the OCR pipeline.

Field Type Default Description
backend const char* Backend name: "tesseract", "paddleocr", "easyocr", or a custom registered name.
priority uint32_t /* serde(default) */ Priority weight (higher = tried first). Stages are sorted by priority descending.
language const char** /* serde(default) */ Language override for this stage (None = use parent OcrConfig.language).
tesseract_config KreuzbergTesseractConfig* /* serde(default) */ Tesseract-specific config override for this stage.
paddle_ocr_config void** /* serde(default) */ PaddleOCR-specific config for this stage.
vlm_config KreuzbergLlmConfig* /* serde(default) */ VLM config override for this pipeline stage.
backend_options void** /* serde(default) */ Arbitrary per-call options passed through to the backend unchanged. Backends that support runtime tuning (mode switching, preprocessing flags, inference parameters, etc.) read this value and deserialize the keys they care about. Keys unknown to the backend are silently ignored, so options from different backends can coexist in the same config without conflict. Example (custom backend): json { "mode": "fast", "enable_layout": true }

KreuzbergOcrQualityThresholds

Quality thresholds for OCR fallback decisions and pipeline quality gating.

All fields default to the values that match the previous hardcoded behavior, so OcrQualityThresholds.default() preserves existing semantics exactly.

Field Type Default Description
min_total_non_whitespace uintptr_t 64 Minimum total non-whitespace characters to consider text substantive.
min_non_whitespace_per_page double 32 Minimum non-whitespace characters per page on average.
min_meaningful_word_len uintptr_t 4 Minimum character count for a word to be "meaningful".
min_meaningful_words uintptr_t 3 Minimum count of meaningful words before text is accepted.
min_alnum_ratio double 0.3 Minimum alphanumeric ratio (non-whitespace chars that are alphanumeric).
min_garbage_chars uintptr_t 5 Minimum Unicode replacement characters (U+FFFD) to trigger OCR fallback.
max_fragmented_word_ratio double 0.6 Maximum fraction of short (1-2 char) words before text is considered fragmented.
critical_fragmented_word_ratio double 0.8 Critical fragmentation threshold — triggers OCR regardless of meaningful words. Normal English text has ~20-30% short words. 80%+ is definitive garbage.
min_avg_word_length double 2 Minimum average word length. Below this with enough words indicates garbled extraction.
min_words_for_avg_length_check uintptr_t 50 Minimum word count before average word length check applies.
min_consecutive_repeat_ratio double 0.08 Minimum consecutive word repetition ratio to detect column scrambling.
min_words_for_repeat_check uintptr_t 50 Minimum word count before consecutive repetition check is applied.
substantive_min_chars uintptr_t 100 Minimum character count for "substantive markdown" OCR skip gate.
non_text_min_chars uintptr_t 20 Minimum character count for "non-text content" OCR skip gate.
alnum_ws_ratio_threshold double 0.4 Alphanumeric+whitespace ratio threshold for skip decisions.
pipeline_min_quality double 0.5 Minimum quality score (0.0-1.0) for a pipeline stage result to be accepted. If the result from a backend scores below this, try the next backend.

Methods

kreuzberg_default()

Signature:

KreuzbergOcrQualityThresholds kreuzberg_default();

KreuzbergOcrRotation

Rotation information for an OCR element.

Field Type Default Description
angle_degrees double Rotation angle in degrees (0, 90, 180, 270 for PaddleOCR).
confidence double* NULL Confidence score for the rotation detection.

KreuzbergOcrTable

Table detected via OCR.

Represents a table structure recognized during OCR processing.

Field Type Default Description
cells const char*** Table cells as a 2D vector (rows × columns)
markdown const char* Markdown representation of the table
page_number uint32_t Page number where the table was found (1-indexed)
bounding_box KreuzbergOcrTableBoundingBox* /* serde(default) */ Bounding box of the table in pixel coordinates (from OCR word positions).

KreuzbergOcrTableBoundingBox

Bounding box for an OCR-detected table in pixel coordinates.

Field Type Default Description
left uint32_t Left x-coordinate (pixels)
top uint32_t Top y-coordinate (pixels)
right uint32_t Right x-coordinate (pixels)
bottom uint32_t Bottom y-coordinate (pixels)

KreuzbergOrientationResult

Document orientation detection result.

Field Type Default Description
degrees uint32_t Detected orientation in degrees (0, 90, 180, or 270).
confidence float Confidence score (0.0-1.0).

KreuzbergPaddleOcrConfig

Configuration for PaddleOCR backend.

Configures PaddleOCR text detection and recognition with multi-language support. Uses a builder pattern for convenient configuration.

Field Type Default Description
language const char* Language code (e.g., "en", "ch", "jpn", "kor", "deu", "fra")
cache_dir const char** NULL Optional custom cache directory for model files
use_angle_cls bool Enable angle classification for rotated text (default: false). Can misfire on short text regions, rotating crops incorrectly before recognition.
enable_table_detection bool Enable table structure detection (default: false)
det_db_thresh float Database threshold for text detection (default: 0.3) Range: 0.0-1.0, higher values require more confident detections
det_db_box_thresh float Box threshold for text bounding box refinement (default: 0.5) Range: 0.0-1.0
det_db_unclip_ratio float Unclip ratio for expanding text bounding boxes (default: 1.6) Controls the expansion of detected text regions
det_limit_side_len uint32_t Maximum side length for detection image (default: 960) Larger images may be resized to this limit for faster inference
rec_batch_num uint32_t Batch size for recognition inference (default: 6) Number of text regions to process simultaneously
padding uint32_t Padding in pixels added around the image before detection (default: 10). Large values can include surrounding content like table gridlines.
drop_score float Minimum recognition confidence score for text lines (default: 0.5). Text regions with recognition confidence below this threshold are discarded. Matches PaddleOCR Python's drop_score parameter. Range: 0.0-1.0
model_tier const char* Model tier controlling detection/recognition model size and accuracy trade-off. - "mobile" (default): Lightweight models (~4.5MB detection, ~16.5MB recognition), fast download and inference - "server": Large, high-accuracy models (~88MB detection, ~84MB recognition), best for GPU or complex documents

Methods

kreuzberg_with_cache_dir()

Sets a custom cache directory for model files.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_cache_dir(const char* path);

kreuzberg_with_table_detection()

Enables or disables table structure detection.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_table_detection(bool enable);

kreuzberg_with_angle_cls()

Enables or disables angle classification for rotated text.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_angle_cls(bool enable);

kreuzberg_with_det_db_thresh()

Sets the database threshold for text detection.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_det_db_thresh(float threshold);

kreuzberg_with_det_db_box_thresh()

Sets the box threshold for text bounding box refinement.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_det_db_box_thresh(float threshold);

kreuzberg_with_det_db_unclip_ratio()

Sets the unclip ratio for expanding text bounding boxes.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_det_db_unclip_ratio(float ratio);

kreuzberg_with_det_limit_side_len()

Sets the maximum side length for detection images.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_det_limit_side_len(uint32_t length);

kreuzberg_with_rec_batch_num()

Sets the batch size for recognition inference.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_rec_batch_num(uint32_t batch_size);

kreuzberg_with_drop_score()

Sets the minimum recognition confidence threshold.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_drop_score(float score);

kreuzberg_with_padding()

Sets padding in pixels added around images before detection.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_padding(uint32_t padding);

kreuzberg_with_model_tier()

Sets the model tier controlling detection/recognition model size.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_with_model_tier(const char* tier);

kreuzberg_default()

Creates a default configuration with English language support.

Signature:

KreuzbergPaddleOcrConfig kreuzberg_default();

KreuzbergPageBoundary

Byte offset boundary for a page.

Tracks where a specific page's content starts and ends in the main content string, enabling mapping from byte positions to page numbers. Offsets are guaranteed to be at valid UTF-8 character boundaries when using standard String methods (push_str, push, etc.).

Field Type Default Description
byte_start uintptr_t Byte offset where this page starts in the content string (UTF-8 valid boundary, inclusive)
byte_end uintptr_t Byte offset where this page ends in the content string (UTF-8 valid boundary, exclusive)
page_number uint32_t Page number (1-indexed)

KreuzbergPageConfig

Page extraction and tracking configuration.

Controls how pages are extracted, tracked, and represented in the extraction results. When NULL, page tracking is disabled.

Page range tracking in chunk metadata (first_page/last_page) is automatically enabled when page boundaries are available and chunking is configured.

Field Type Default Description
extract_pages bool false Extract pages as separate array (ExtractionResult.pages)
insert_page_markers bool false Insert page markers in main content string
marker_format const char* `"

"` | Page marker format (use {page_num} placeholder) Default: "\n\n\n\n" |

Methods

kreuzberg_default()

Signature:

KreuzbergPageConfig kreuzberg_default();

KreuzbergPageContent

Content for a single page/slide.

When page extraction is enabled, documents are split into per-page content with associated tables and images mapped to each page.

Performance

Uses Arc-wrapped tables and images for memory efficiency:

  • Vec<Arc<Table>> enables zero-copy sharing of table data
  • Vec<Arc<ExtractedImage>> enables zero-copy sharing of image data
  • Maintains exact JSON compatibility via custom Serialize/Deserialize

This reduces memory overhead for documents with shared tables/images by avoiding redundant copies during serialization.

Field Type Default Description
page_number uint32_t Page number (1-indexed)
content const char* Text content for this page
tables KreuzbergTable* /* serde(default) */ Tables found on this page (uses Arc for memory efficiency) Serializes as Vec for JSON compatibility while maintaining Arc semantics in-memory for zero-copy sharing.
image_indices uint32_t* /* serde(default) */ Indices into ExtractionResult.images for images found on this page. Each value is a zero-based index into the top-level images collection. Only populated when extract_images = true in the extraction config.
hierarchy KreuzbergPageHierarchy* NULL Hierarchy information for the page (when hierarchy extraction is enabled) Contains text hierarchy levels (H1-H6) extracted from the page content.
is_blank bool* NULL Whether this page is blank (no meaningful text content) Determined during extraction based on text content analysis. A page is blank if it has fewer than 3 non-whitespace characters and contains no tables or images.
layout_regions KreuzbergLayoutRegion** NULL Layout detection regions for this page (when layout detection is enabled). Contains detected layout regions with class, confidence, bounding box, and area fraction. Only populated when layout detection is configured.

KreuzbergPageHierarchy

Page hierarchy structure containing heading levels and block information.

Used when PDF text hierarchy extraction is enabled. Contains hierarchical blocks with heading levels (H1-H6) for semantic document structure.

Field Type Default Description
block_count uint32_t Number of hierarchy blocks on this page
blocks KreuzbergHierarchicalBlock* /* serde(default) */ Hierarchical blocks with heading levels

KreuzbergPageInfo

Metadata for individual page/slide/sheet.

Captures per-page information including dimensions, content counts, and visibility state (for presentations).

Field Type Default Description
number uint32_t Page number (1-indexed)
title const char** NULL Page title (usually for presentations)
dimensions double** NULL Dimensions in points (PDF) or pixels (images): (width, height)
image_count uint32_t* NULL Number of images on this page
table_count uint32_t* NULL Number of tables on this page
hidden bool* NULL Whether this page is hidden (e.g., in presentations)
is_blank bool* NULL Whether this page is blank (no meaningful text, no images, no tables) A page is considered blank if it has fewer than 3 non-whitespace characters and contains no tables or images. This is useful for filtering out empty pages in scanned documents or PDFs with blank separator pages.
has_vector_graphics bool /* serde(default) */ Whether this page contains non-trivial vector graphics (paths, shapes, curves) Indicates the presence of vector-drawn content such as charts, diagrams, or geometric shapes (e.g., from Adobe InDesign, LaTeX TikZ). These are invisible to ExtractionResult.images since they are not embedded as raster XObjects. Set to true when path count exceeds a heuristic threshold, signaling that downstream consumers may want to rasterize the page to capture this content. Only populated for PDFs; NULL for other document types.

KreuzbergPageStructure

Unified page structure for documents.

Supports different page types (PDF pages, PPTX slides, Excel sheets) with character offset boundaries for chunk-to-page mapping.

Field Type Default Description
total_count uint32_t Total number of pages/slides/sheets
unit_type KreuzbergPageUnitType Type of paginated unit
boundaries KreuzbergPageBoundary** NULL Character offset boundaries for each page Maps character ranges in the extracted content to page numbers. Used for chunk page range calculation.
pages KreuzbergPageInfo** NULL Detailed per-page metadata (optional, only when needed)

KreuzbergPdfAnnotation

A PDF annotation extracted from a document page.

Field Type Default Description
annotation_type KreuzbergPdfAnnotationType The type of annotation.
content const char** NULL Text content of the annotation (e.g., comment text, link URL).
page_number uint32_t Page number where the annotation appears (1-indexed).
bounding_box KreuzbergBoundingBox* NULL Bounding box of the annotation on the page.

KreuzbergPdfConfig

PDF-specific configuration.

Field Type Default Description
extract_images bool false Extract images from PDF
extract_tables bool true Extract tables from PDF. When true (default), runs pdf_oxide's native grid detector and, if it finds nothing, falls back to the heuristic text-layer reconstruction in pdf.oxide.table.extract_tables_heuristic. Set to false to skip both passes — tables will then be empty in the result.
passwords const char*** NULL List of passwords to try when opening encrypted PDFs
extract_metadata bool true Extract PDF metadata
hierarchy KreuzbergHierarchyConfig* NULL Hierarchy extraction configuration (None = hierarchy extraction disabled)
extract_annotations bool false Extract PDF annotations (text notes, highlights, links, stamps). Default: false
top_margin_fraction float* NULL Top margin fraction (0.0–1.0) of page height to exclude headers/running heads. Default: 0.06 (6%)
bottom_margin_fraction float* NULL Bottom margin fraction (0.0–1.0) of page height to exclude footers/page numbers. Default: 0.05 (5%)
allow_single_column_tables bool false Allow single-column pseudo tables in extraction results. By default, tables with fewer than 2 columns (layout-guided) or 3 columns (heuristic) are rejected. When true, the minimum column count is relaxed to 1, allowing single-column structured data (glossaries, itemized lists) to be emitted as tables. Other quality filters (density, sparsity, prose detection) still apply.
ocr_inline_images bool false Perform OCR on inline images extracted from PDF pages and attach the recognized text to each ExtractedImage.ocr_result. Requires Tesseract to be available; if ExtractionConfig.ocr is NULL the extractor falls back to TesseractConfig.default(). Per-image failures degrade gracefully (the image is returned without OCR text rather than failing the whole extraction). Default: false.

Methods

kreuzberg_default()

Signature:

KreuzbergPdfConfig kreuzberg_default();

KreuzbergPdfMetadata

PDF-specific metadata.

Contains metadata fields specific to PDF documents that are not in the common Metadata structure. Common fields like title, authors, keywords, and dates are at the Metadata level.

Field Type Default Description
pdf_version const char** NULL PDF version (e.g., "1.7", "2.0")
producer const char** NULL PDF producer (application that created the PDF)
is_encrypted bool* NULL Whether the PDF is encrypted/password-protected
width int64_t* NULL First page width in points (1/72 inch)
height int64_t* NULL First page height in points (1/72 inch)
page_count uint32_t* NULL Total number of pages in the PDF document

KreuzbergPlugin

Base trait that all plugins must implement.

This trait provides common functionality for plugin lifecycle management, identification, and metadata.

Thread Safety

All plugins must be Send + Sync to support concurrent usage across threads.

Methods

kreuzberg_name()

Returns the unique name/identifier for this plugin.

The name should be:

  • Unique across all plugins
  • Lowercase with hyphens (e.g., "my-custom-plugin")
  • URL-safe characters only

Signature:

const char* kreuzberg_name();

kreuzberg_version()

Returns the semantic version of this plugin.

Should follow semver format: MAJOR.MINOR.PATCH

Defaults to the kreuzberg crate version.

Signature:

const char* kreuzberg_version();

kreuzberg_initialize()

Initialize the plugin.

Called once when the plugin is registered. Use this to:

  • Load configuration
  • Initialize resources (connections, caches, etc.)
  • Validate dependencies

Thread Safety

This method takes &self instead of &mut self to work with Arc<dyn Plugin>. Plugins needing mutable state during initialization should use interior mutability patterns (Mutex, RwLock, OnceCell, etc.).

Errors:

Should return an error if initialization fails. The plugin will not be registered if this method returns an error.

Defaults to a no-op for stateless plugins.

Signature:

void kreuzberg_initialize();

kreuzberg_shutdown()

Shutdown the plugin.

Called when the plugin is being unregistered or the application is shutting down. Use this to:

  • Close connections
  • Flush caches
  • Release resources

Thread Safety

This method takes &self instead of &mut self to work with Arc<dyn Plugin>. Plugins needing mutable state during shutdown should use interior mutability patterns (Mutex, RwLock, etc.).

Errors:

Errors during shutdown are logged but don't prevent the shutdown process.

Defaults to a no-op for stateless plugins.

Signature:

void kreuzberg_shutdown();

kreuzberg_description()

Optional plugin description for debugging and logging.

Defaults to empty string if not overridden.

Signature:

const char* kreuzberg_description();

kreuzberg_author()

Optional plugin author information.

Defaults to empty string if not overridden.

Signature:

const char* kreuzberg_author();

KreuzbergPostProcessor

Trait for post-processor plugins.

Post-processors transform or enrich extraction results after the initial extraction is complete. They can:

  • Clean and normalize text
  • Add metadata (language, keywords, entities)
  • Split content into chunks
  • Score quality
  • Apply custom transformations

Processing Order

Post-processors are executed in stage order:

  1. Early - Language detection, entity extraction
  2. Middle - Keyword extraction, token reduction
  3. Late - Custom hooks, final validation

Within each stage, processors are executed in registration order.

Error Handling

Post-processor errors are non-fatal by default - they're captured in metadata and execution continues. To make errors fatal, return an error from process().

Thread Safety

Post-processors must be thread-safe (Send + Sync).

Methods

kreuzberg_process()

Process an extraction result.

Transform or enrich the extraction result. Can modify:

  • content - The extracted text
  • metadata - Add or update metadata fields
  • tables - Modify or enhance table data

Returns:

Ok(()) if processing succeeded, Err(...) for fatal failures.

Errors:

Return errors for fatal processing failures. Non-fatal errors should be captured in metadata directly on the result.

Performance

This signature avoids unnecessary cloning of large extraction results by taking a mutable reference instead of ownership. Processors modify the result in place.

Example - Language Detection

Example - Text Cleaning

async fn process(&self, result: &mut ExtractionResult, config: &ExtractionConfig)
    -> Result<()> {
    // Remove excessive whitespace
    result.content = result
        .content
        .split_whitespace()
        .collect::<Vec<_>>()
        .join(" ");

    Ok(())
}

Signature:

void kreuzberg_process(KreuzbergExtractionResult result, KreuzbergExtractionConfig config);

kreuzberg_processing_stage()

Get the processing stage for this post-processor.

Determines when this processor runs in the pipeline.

Returns:

The ProcessingStage (Early, Middle, or Late).

Signature:

KreuzbergProcessingStage kreuzberg_processing_stage();

kreuzberg_should_process()

Optional: Check if this processor should run for a given result.

Allows conditional processing based on MIME type, metadata, or content. Defaults to true (always run).

Returns:

true if the processor should run, false to skip.

Signature:

bool kreuzberg_should_process(KreuzbergExtractionResult result, KreuzbergExtractionConfig config);

kreuzberg_estimated_duration_ms()

Optional: Estimate processing time in milliseconds.

Used for logging and debugging. Defaults to 0 (unknown).

Returns:

Estimated processing time in milliseconds.

Signature:

uint64_t kreuzberg_estimated_duration_ms(KreuzbergExtractionResult result);

kreuzberg_priority()

Execution priority within the processing stage.

Higher values run first within the same ProcessingStage. Defaults to 50. Use 0-49 for fallback processors, 50 for normal processors, and 51-255 for high-priority processors that should run early in their stage.

Signature:

int32_t kreuzberg_priority();

KreuzbergPostProcessorConfig

Post-processor configuration.

Field Type Default Description
enabled bool true Enable post-processors
enabled_processors const char*** NULL Whitelist of processor names to run (None = all enabled)
disabled_processors const char*** NULL Blacklist of processor names to skip (None = none disabled)
enabled_set const char*** NULL Pre-computed AHashSet for O(1) enabled processor lookup
disabled_set const char*** NULL Pre-computed AHashSet for O(1) disabled processor lookup

Methods

kreuzberg_default()

Signature:

KreuzbergPostProcessorConfig kreuzberg_default();

KreuzbergPptxAppProperties

Application properties from docProps/app.xml for PPTX

Contains PowerPoint-specific document metadata.

Field Type Default Description
application const char** NULL Application name (e.g., "Microsoft Office PowerPoint")
app_version const char** NULL Application version
total_time int32_t* NULL Total editing time in minutes
company const char** NULL Company name
doc_security int32_t* NULL Document security level
scale_crop bool* NULL Scale crop flag
links_up_to_date bool* NULL Links up to date flag
shared_doc bool* NULL Shared document flag
hyperlinks_changed bool* NULL Hyperlinks changed flag
slides int32_t* NULL Number of slides
notes int32_t* NULL Number of notes
hidden_slides int32_t* NULL Number of hidden slides
multimedia_clips int32_t* NULL Number of multimedia clips
presentation_format const char** NULL Presentation format (e.g., "Widescreen", "Standard")
slide_titles const char** NULL Slide titles

KreuzbergPptxExtractionResult

PowerPoint (PPTX) extraction result.

Contains extracted slide content, metadata, and embedded images/tables.

Field Type Default Description
content const char* Extracted text content from all slides
metadata KreuzbergPptxMetadata Presentation metadata
slide_count uintptr_t Total number of slides
image_count uintptr_t Total number of embedded images
table_count uintptr_t Total number of tables
images KreuzbergExtractedImage* Extracted images from the presentation
page_structure KreuzbergPageStructure* NULL Slide structure with boundaries (when page tracking is enabled)
page_contents KreuzbergPageContent** NULL Per-slide content (when page tracking is enabled)
document KreuzbergDocumentStructure* NULL Structured document representation
hyperlinks const char** /* serde(default) */ Hyperlinks discovered in slides as (url, optional_label) pairs.
office_metadata void* /* serde(default) */ Office metadata extracted from docProps/core.xml and docProps/app.xml. Contains keys like "title", "author", "created_by", "subject", "keywords", "modified_by", "created_at", "modified_at", etc.

KreuzbergPptxMetadata

PowerPoint presentation metadata.

Extracted from PPTX files containing slide counts and presentation details.

Field Type Default Description
slide_count uint32_t Total number of slides in the presentation
slide_names const char** NULL Names of slides (if available)
image_count uint32_t* NULL Number of embedded images
table_count uint32_t* NULL Number of tables

KreuzbergProcessingWarning

A non-fatal warning from a processing pipeline stage.

Captures errors from optional features that don't prevent extraction but may indicate degraded results.

Field Type Default Description
source const char* The pipeline stage or feature that produced this warning (e.g., "embedding", "chunking", "language_detection", "output_format").
message const char* Human-readable description of what went wrong.

KreuzbergPstMetadata

Outlook PST archive metadata.

Field Type Default Description
message_count uintptr_t Number of messages

KreuzbergRakeParams

RAKE-specific parameters.

Field Type Default Description
min_word_length uintptr_t 1 Minimum word length to consider (default: 1).
max_words_per_phrase uintptr_t 3 Maximum words in a keyword phrase (default: 3).

Methods

kreuzberg_default()

Signature:

KreuzbergRakeParams kreuzberg_default();

KreuzbergRecognizedTable

Pre-computed table markdown for a table detection region.

Produced by the TATR-based table structure recognizer and surfaced as part of layout-aware OCR results. The struct lives here (under layout-types, pure-Rust) so that consumers who do not enable layout-detection (ORT) can still reference the type in their own code.

Field Type Default Description
detection_bbox KreuzbergBBox Detection bbox that this table corresponds to (for matching).
cells const char*** Table cells as a 2D vector (rows × columns).
markdown const char* Rendered markdown table.

KreuzbergRenderer

Trait for document renderers that convert InternalDocument to output strings.

Renderers are typically stateless converters that transform the internal document representation into a specific output format (Markdown, HTML, Djot, plain text, etc.). They participate in the standard Plugin lifecycle so custom renderers can be registered from any supported binding language.

The format name is exposed via Plugin.name. For stateless renderers the Plugin lifecycle methods (version, initialize, shutdown) all take no-op defaults and need not be overridden.

Thread Safety

Renderers must be Send + Sync (inherited from Plugin).

Methods

kreuzberg_render()

Render an InternalDocument to the output format.

Returns:

The rendered output as a string.

Errors:

Returns an error if rendering fails.

Signature:

const char* kreuzberg_render(KreuzbergInternalDocument doc);

KreuzbergSecurityLimits

Configuration for security limits across extractors.

All limits are intentionally conservative to prevent DoS attacks while still supporting legitimate documents.

Field Type Default Description
max_archive_size uintptr_t 524288000 Maximum uncompressed size for archives (500 MB)
max_compression_ratio uintptr_t 100 Maximum compression ratio before flagging as potential bomb (100:1)
max_files_in_archive uintptr_t 10000 Maximum number of files in archive (10,000)
max_nesting_depth uintptr_t 1024 Maximum nesting depth for structures (100)
max_entity_length uintptr_t 1048576 Maximum length of any single XML entity / attribute / token (1 MiB). This is a per-token cap, NOT a total cap — billion-laughs class attacks where a single entity expands to hundreds of MB are caught here, while normal long text content (a paragraph, a CDATA block) is caught by max_content_size instead.
max_content_size uintptr_t 104857600 Maximum string growth per document (100 MB)
max_iterations uintptr_t 10000000 Maximum iterations per operation
max_xml_depth uintptr_t 1024 Maximum XML depth (100 levels)
max_table_cells uintptr_t 100000 Maximum cells per table (100,000)

Methods

kreuzberg_default()

Signature:

KreuzbergSecurityLimits kreuzberg_default();

KreuzbergServerConfig

API server configuration.

This struct holds all configuration options for the Kreuzberg API server, including host/port settings, CORS configuration, and upload limits.

Defaults

  • host: "127.0.0.1" (localhost only)
  • port: 8000
  • cors_origins: empty vector (allows all origins)
  • max_request_body_bytes: 104_857_600 (100 MB)
  • max_multipart_field_bytes: 104_857_600 (100 MB)
Field Type Default Description
host const char* Server host address (e.g., "127.0.0.1", "0.0.0.0")
port uint16_t Server port number
cors_origins const char** NULL CORS allowed origins. Empty vector means allow all origins. If this is an empty vector, the server will accept requests from any origin. If populated with specific origins (e.g., "<https://example.com">), only those origins will be allowed.
max_request_body_bytes uintptr_t Maximum size of request body in bytes (default: 100 MB)
max_multipart_field_bytes uintptr_t Maximum size of multipart fields in bytes (default: 100 MB)

Methods

kreuzberg_default()

Signature:

KreuzbergServerConfig kreuzberg_default();

kreuzberg_listen_addr()

Get the server listen address (host:port).

Signature:

const char* kreuzberg_listen_addr();

kreuzberg_cors_allows_all()

Check if CORS allows all origins.

Returns true if the cors_origins vector is empty, meaning all origins are allowed. Returns false if specific origins are configured.

Signature:

bool kreuzberg_cors_allows_all();

kreuzberg_is_origin_allowed()

Check if a given origin is allowed by CORS configuration.

Returns true if:

  • CORS allows all origins (empty origins list), or
  • The given origin is in the allowed origins list

Signature:

bool kreuzberg_is_origin_allowed(const char* origin);

kreuzberg_max_request_body_mb()

Get maximum request body size in megabytes (rounded up).

Signature:

uintptr_t kreuzberg_max_request_body_mb();

kreuzberg_max_multipart_field_mb()

Get maximum multipart field size in megabytes (rounded up).

Signature:

uintptr_t kreuzberg_max_multipart_field_mb();

KreuzbergStructuredData

Structured data (Schema.org, microdata, RDFa) block.

Field Type Default Description
data_type KreuzbergStructuredDataType Type of structured data
raw_json const char* Raw JSON string representation
schema_type const char** NULL Schema type if detectable (e.g., "Article", "Event", "Product")

KreuzbergStructuredDataResult

Field Type Default Description
content const char* The extracted text content
format const char* Format
metadata void* Document metadata
text_fields const char** Text fields

KreuzbergStructuredExtractionConfig

Configuration for LLM-based structured data extraction.

Sends extracted document content to a VLM with a JSON schema, returning structured data that conforms to the schema.

Field Type Default Description
schema void* JSON Schema defining the desired output structure.
schema_name const char* /* serde(default) */ Schema name passed to the LLM's structured output mode.
schema_description const char** /* serde(default) */ Optional schema description for the LLM.
strict bool /* serde(default) */ Enable strict mode — output must exactly match the schema.
prompt const char** /* serde(default) */ Custom Jinja2 extraction prompt template. When NULL, a default template is used. Available template variables: - {{ content }} — The extracted document text. - {{ schema }} — The JSON schema as a formatted string. - {{ schema_name }} — The schema name. - {{ schema_description }} — The schema description (may be empty).
llm KreuzbergLlmConfig LLM configuration for the extraction.

KreuzbergSupportedFormat

A supported document format entry.

Represents a file extension and its corresponding MIME type that Kreuzberg can process.

Field Type Default Description
extension const char* File extension (without leading dot), e.g., "pdf", "docx"
mime_type const char* MIME type string, e.g., "application/pdf"

KreuzbergTable

Extracted table structure.

Represents a table detected and extracted from a document (PDF, image, etc.). Tables are converted to both structured cell data and Markdown format.

Field Type Default Description
cells const char*** NULL Table cells as a 2D vector (rows × columns)
markdown const char* Markdown representation of the table
page_number uint32_t Page number where the table was found (1-indexed)
bounding_box KreuzbergBoundingBox* NULL Bounding box of the table on the page (PDF coordinates: x0=left, y0=bottom, x1=right, y1=top). Only populated for PDF-extracted tables when position data is available.

KreuzbergTableCell

Individual table cell with content and optional styling.

Future extension point for rich table support with cell-level metadata.

Field Type Default Description
content const char* Cell content as text
row_span uint32_t Row span (number of rows this cell spans)
col_span uint32_t Column span (number of columns this cell spans)
is_header bool Whether this is a header cell

KreuzbergTableGrid

Structured table grid with cell-level metadata.

Stores row/column dimensions and a flat list of cells with position info.

Field Type Default Description
rows uint32_t Number of rows in the table.
cols uint32_t Number of columns in the table.
cells KreuzbergGridCell* NULL All cells in row-major order.

KreuzbergTesseractConfig

Tesseract OCR configuration.

Provides fine-grained control over Tesseract OCR engine parameters. Most users can use the defaults, but these settings allow optimization for specific document types (invoices, handwriting, etc.).

Field Type Default Description
language const char* "eng" Language code (e.g., "eng", "deu", "fra")
psm int32_t 3 Page Segmentation Mode (0-13). Common values: - 3: Fully automatic page segmentation (native default) - 6: Assume a single uniform block of text (WASM default — avoids layout-analysis hang) - 11: Sparse text with no particular order
output_format const char* "markdown" Output format ("text" or "markdown")
oem int32_t 3 OCR Engine Mode (0-3). - 0: Legacy engine only - 1: Neural nets (LSTM) only (usually best) - 2: Legacy + LSTM - 3: Default (based on what's available)
min_confidence double 0 Minimum confidence threshold (0.0-100.0). Words with confidence below this threshold may be rejected or flagged.
preprocessing KreuzbergImagePreprocessingConfig* NULL Image preprocessing configuration. Controls how images are preprocessed before OCR. Can significantly improve quality for scanned documents or low-quality images.
enable_table_detection bool true Enable automatic table detection and reconstruction
table_min_confidence double 0 Minimum confidence threshold for table detection (0.0-1.0)
table_column_threshold int32_t 50 Column threshold for table detection (pixels)
table_row_threshold_ratio double 0.5 Row threshold ratio for table detection (0.0-1.0)
use_cache bool true Enable OCR result caching
classify_use_pre_adapted_templates bool true Use pre-adapted templates for character classification
language_model_ngram_on bool false Enable N-gram language model
tessedit_dont_blkrej_good_wds bool true Don't reject good words during block-level processing
tessedit_dont_rowrej_good_wds bool true Don't reject good words during row-level processing
tessedit_enable_dict_correction bool true Enable dictionary correction
tessedit_char_whitelist const char* "" Whitelist of allowed characters (empty = all allowed)
tessedit_char_blacklist const char* "" Blacklist of forbidden characters (empty = none forbidden)
tessedit_use_primary_params_model bool true Use primary language params model
textord_space_size_is_variable bool true Variable-width space detection
thresholding_method bool false Use adaptive thresholding method

Methods

kreuzberg_default()

Signature:

KreuzbergTesseractConfig kreuzberg_default();

KreuzbergTextAnnotation

Inline text annotation — byte-range based formatting and links.

Annotations reference byte offsets into the node's text content, enabling precise identification of formatted regions.

Field Type Default Description
start uint32_t Start byte offset in the node's text content (inclusive).
end uint32_t End byte offset in the node's text content (exclusive).
kind KreuzbergAnnotationKind Annotation type.

KreuzbergTextExtractionResult

Plain text and Markdown extraction result.

Contains the extracted text along with statistics and, for Markdown files, structural elements like headers and links.

Field Type Default Description
content const char* Extracted text content
line_count uintptr_t Number of lines
word_count uintptr_t Number of words
character_count uintptr_t Number of characters
headers const char*** NULL Markdown headers (text only, Markdown files only)
links const char**** NULL Markdown links as (text, URL) tuples (Markdown files only)
code_blocks const char**** NULL Code blocks as (language, code) tuples (Markdown files only)

KreuzbergTextMetadata

Text/Markdown metadata.

Extracted from plain text and Markdown files. Includes word counts and, for Markdown, structural elements like headers and links.

Field Type Default Description
line_count uint32_t Number of lines in the document
word_count uint32_t Number of words
character_count uint32_t Number of characters
headers const char*** NULL Markdown headers (headings text only, for Markdown files)
links const char**** NULL Markdown links as (text, url) tuples (for Markdown files)
code_blocks const char**** NULL Code blocks as (language, code) tuples (for Markdown files)

KreuzbergTokenReductionConfig

Field Type Default Description
level KreuzbergReductionLevel KREUZBERG_KREUZBERG_MODERATE Level (reduction level)
language_hint const char** NULL Language hint
preserve_markdown bool false Preserve markdown
preserve_code bool true Preserve code
semantic_threshold float 0.3 Semantic threshold
enable_parallel bool true Enable parallel
use_simd bool true Use simd
custom_stopwords void** NULL Custom stopwords
preserve_patterns const char** NULL Preserve patterns
target_reduction float* NULL Target reduction
enable_semantic_clustering bool false Enable semantic clustering

Methods

kreuzberg_default()

Signature:

KreuzbergTokenReductionConfig kreuzberg_default();

KreuzbergTokenReductionOptions

Token reduction configuration.

Field Type Default Description
mode const char* Reduction mode: "off", "light", "moderate", "aggressive", "maximum"
preserve_important_words bool true Preserve important words (capitalized, technical terms)

Methods

kreuzberg_default()

Signature:

KreuzbergTokenReductionOptions kreuzberg_default();

KreuzbergTreeSitterConfig

Configuration for tree-sitter language pack integration.

Controls grammar download behavior and code analysis options.

Example (TOML)

[tree_sitter]
languages = ["python", "rust"]
groups = ["web"]

[tree_sitter.process]
structure = true
comments = true
docstrings = true
Field Type Default Description
enabled bool true Enable code intelligence processing (default: true). When false, tree-sitter analysis is completely skipped even if the config section is present.
cache_dir const char** NULL Custom cache directory for downloaded grammars. When NULL, uses the default: ~/.cache/tree-sitter-language-pack/v{version}/libs/.
languages const char*** NULL Languages to pre-download on init (e.g., ["python", "rust"]).
groups const char*** NULL Language groups to pre-download (e.g., ["web", "systems", "scripting"]).
process KreuzbergTreeSitterProcessConfig Processing options for code analysis.

Methods

kreuzberg_default()

Signature:

KreuzbergTreeSitterConfig kreuzberg_default();

KreuzbergTreeSitterProcessConfig

Processing options for tree-sitter code analysis.

Controls which analysis features are enabled when extracting code files.

Field Type Default Description
structure bool true Extract structural items (functions, classes, structs, etc.). Default: true.
imports bool true Extract import statements. Default: true.
exports bool true Extract export statements. Default: true.
comments bool false Extract comments. Default: false.
docstrings bool false Extract docstrings. Default: false.
symbols bool false Extract symbol definitions. Default: false.
diagnostics bool false Include parse diagnostics. Default: false.
chunk_max_size uintptr_t* NULL Maximum chunk size in bytes. NULL disables chunking.
content_mode KreuzbergCodeContentMode KREUZBERG_KREUZBERG_CHUNKS Content rendering mode for code extraction.

Methods

kreuzberg_default()

Signature:

KreuzbergTreeSitterProcessConfig kreuzberg_default();

KreuzbergUri

A URI extracted from a document.

Represents any link, reference, or resource pointer found during extraction. The kind field classifies the URI semantically, while label carries optional human-readable display text.

Field Type Default Description
url const char* The URL or path string.
label const char** NULL Optional display text / label for the link.
page uint32_t* NULL Optional page number where the URI was found (1-indexed).
kind KreuzbergUriKind Semantic classification of the URI.

KreuzbergValidator

Trait for validator plugins.

Validators check extraction results for quality, completeness, or correctness. Unlike post-processors, validator errors fail fast - if a validator returns an error, the extraction fails immediately.

Use Cases

  • Quality Gates: Ensure extracted content meets minimum quality standards
  • Compliance: Verify content meets regulatory requirements
  • Content Filtering: Reject documents containing unwanted content
  • Format Validation: Verify extracted content structure
  • Security Checks: Scan for malicious content

Error Handling

Validator errors are fatal - they cause the extraction to fail and bubble up to the caller. Use validators for hard requirements that must be met.

For non-fatal checks, use post-processors instead.

Thread Safety

Validators must be thread-safe (Send + Sync).

Methods

kreuzberg_validate()

Validate an extraction result.

Check the extraction result and return Ok(()) if valid, or an error if validation fails.

Returns:

  • Ok(()) if validation passes
  • Err(...) if validation fails (extraction will fail)

Errors:

  • KreuzbergError.Validation - Validation failed
  • Any other error type appropriate for the failure

Example - Content Length Validation

async fn validate(&self, result: &ExtractionResult, config: &ExtractionConfig)
    -> Result<()> {
    let length = result.content.len();

    if length < self.min {
        return Err(KreuzbergError::validation(format!(
            "Content too short: {} < {} characters",
            length, self.min
        )));
    }

    if length > self.max {
        return Err(KreuzbergError::validation(format!(
            "Content too long: {} > {} characters",
            length, self.max
        )));
    }

    Ok(())
}

Example - Quality Score Validation

async fn validate(&self, result: &ExtractionResult, config: &ExtractionConfig)
    -> Result<()> {
    // Check if quality_score exists in metadata
    let score = result.metadata
        .additional
        .get("quality_score")
        .and_then(|v| v.as_f64())
        .unwrap_or(0.0);

    if score < self.min_score {
        return Err(KreuzbergError::validation(format!(
            "Quality score too low: {} < {}",
            score, self.min_score
        )));
    }

    Ok(())
}

Example - Security Validation

async fn validate(&self, result: &ExtractionResult, config: &ExtractionConfig)
    -> Result<()> {
    // Check for blocked patterns
    for pattern in &self.blocked_patterns {
        if result.content.contains(pattern) {
            return Err(KreuzbergError::validation(format!(
                "Content contains blocked pattern: {}",
                pattern
            )));
        }
    }

    Ok(())
}

Signature:

void kreuzberg_validate(KreuzbergExtractionResult result, KreuzbergExtractionConfig config);

kreuzberg_should_validate()

Optional: Check if this validator should run for a given result.

Allows conditional validation based on MIME type, metadata, or content. Defaults to true (always run).

Returns:

true if the validator should run, false to skip.

Signature:

bool kreuzberg_should_validate(KreuzbergExtractionResult result, KreuzbergExtractionConfig config);

kreuzberg_priority()

Optional: Get the validation priority.

Higher priority validators run first. Useful for ordering validation checks (e.g., run cheap validations before expensive ones).

Default priority is 50.

Returns:

Priority value (higher = runs earlier).

Signature:

int32_t kreuzberg_priority();

KreuzbergXlsxAppProperties

Application properties from docProps/app.xml for XLSX

Contains Excel-specific document metadata.

Field Type Default Description
application const char** NULL Application name (e.g., "Microsoft Excel")
app_version const char** NULL Application version
doc_security int32_t* NULL Document security level
scale_crop bool* NULL Scale crop flag
links_up_to_date bool* NULL Links up to date flag
shared_doc bool* NULL Shared document flag
hyperlinks_changed bool* NULL Hyperlinks changed flag
company const char** NULL Company name
worksheet_names const char** NULL Worksheet names

KreuzbergXmlExtractionResult

XML extraction result.

Contains extracted text content from XML files along with structural statistics about the XML document.

Field Type Default Description
content const char* Extracted text content (XML structure filtered out)
element_count uintptr_t Total number of XML elements processed
unique_elements const char** List of unique element names found (sorted)

KreuzbergXmlMetadata

XML metadata extracted during XML parsing.

Provides statistics about XML document structure.

Field Type Default Description
element_count uint32_t Total number of XML elements processed
unique_elements const char** NULL List of unique element tag names (sorted)

KreuzbergYakeParams

YAKE-specific parameters.

Field Type Default Description
window_size uintptr_t 2 Window size for co-occurrence analysis (default: 2). Controls the context window for computing co-occurrence statistics.

Methods

kreuzberg_default()

Signature:

KreuzbergYakeParams kreuzberg_default();

KreuzbergYearRange

Year range for bibliographic metadata.

Field Type Default Description
min uint32_t* NULL Min
max uint32_t* NULL Max
years uint32_t* /* serde(default) */ Years

Enums

KreuzbergExecutionProviderType

ONNX Runtime execution provider type.

Determines which hardware backend is used for model inference. Auto (default) selects the best available provider per platform.

Value Description
KREUZBERG_AUTO Auto-select: CoreML on macOS, CUDA on Linux, CPU elsewhere.
KREUZBERG_CPU CPU execution provider (always available).
KREUZBERG_CORE_ML Apple CoreML (macOS/iOS Neural Engine + GPU).
KREUZBERG_CUDA NVIDIA CUDA GPU acceleration.
KREUZBERG_TENSOR_RT NVIDIA TensorRT (optimized CUDA inference).

KreuzbergOutputFormat

Output format for extraction results.

Controls the format of the content field in ExtractionResult. When set to Markdown, Djot, or Html, the output uses that format. Plain returns the raw extracted text. Structured returns JSON with full OCR element data including bounding boxes and confidence scores.

Value Description
KREUZBERG_PLAIN Plain text content only (default)
KREUZBERG_MARKDOWN Markdown format
KREUZBERG_DJOT Djot markup format
KREUZBERG_HTML HTML format
KREUZBERG_JSON JSON tree format with heading-driven sections.
KREUZBERG_STRUCTURED Structured JSON format with full OCR element metadata.
KREUZBERG_CUSTOM Custom renderer registered via the RendererRegistry. The string is the renderer name (e.g., "docx", "latex"). — Fields: 0: const char*

KreuzbergHtmlTheme

Built-in HTML theme selection.

Value Description
KREUZBERG_DEFAULT Sensible defaults: system font stack, neutral colours, readable line measure. CSS custom properties (--kb-*) are all defined so user CSS can override individual values.
KREUZBERG_GIT_HUB GitHub Markdown-inspired palette and spacing.
KREUZBERG_DARK Dark background, light text.
KREUZBERG_LIGHT Minimal light theme with generous whitespace.
KREUZBERG_UNSTYLED No built-in stylesheet emitted. CSS custom properties are still defined on :root so user stylesheets can reference var(--kb-*) tokens.

KreuzbergTableModel

Which table structure recognition model to use.

Controls the model used for table cell detection within layout-detected table regions. Wire format is snake_case in all serializers (JSON, TOML, YAML).

Value Description
KREUZBERG_TATR TATR (Table Transformer) -- default, 30MB, DETR-based row/column detection.
KREUZBERG_SLANET_WIRED SLANeXT wired variant -- 365MB, optimized for bordered tables.
KREUZBERG_SLANET_WIRELESS SLANeXT wireless variant -- 365MB, optimized for borderless tables.
KREUZBERG_SLANET_PLUS SLANet-plus -- 7.78MB, lightweight general-purpose.
KREUZBERG_SLANET_AUTO Classifier-routed SLANeXT: auto-select wired/wireless per table. Uses PP-LCNet classifier (6.78MB) + both SLANeXT variants (730MB total).
KREUZBERG_DISABLED Disable table structure model inference entirely; use heuristic path only.

KreuzbergChunkerType

Type of text chunker to use.

Variants

  • Text - Generic text splitter, splits on whitespace and punctuation
  • Markdown - Markdown-aware splitter, preserves formatting and structure
  • Yaml - YAML-aware splitter, creates one chunk per top-level key
  • Semantic - Topic-aware chunker. With an EmbeddingConfig, splits at embedding-based topic shifts tuned by topic_threshold (default 0.75, lower = more splits). Without an embedding, falls back to a structural-boundary heuristic (ALL-CAPS headers, numbered sections, blank-line paragraphs) and merges groups into chunks capped at max_characters (default 1000). topic_threshold has no effect in the fallback path. For best results, pair with an embedding model.
Value Description
KREUZBERG_TEXT Text format
KREUZBERG_MARKDOWN Markdown format
KREUZBERG_YAML Yaml format
KREUZBERG_SEMANTIC Semantic

KreuzbergChunkSizing

How chunk size is measured.

Defaults to Characters (Unicode character count). When using token-based sizing, chunks are sized by token count according to the specified tokenizer.

Token-based sizing uses HuggingFace tokenizers loaded at runtime. Any tokenizer available on HuggingFace Hub can be used, including OpenAI-compatible tokenizers (e.g., Xenova/gpt-4o, Xenova/cl100k_base).

Value Description
KREUZBERG_CHARACTERS Size measured in Unicode characters (default).
KREUZBERG_TOKENIZER Size measured in tokens from a HuggingFace tokenizer. — Fields: model: const char*, cache_dir: const char*

KreuzbergEmbeddingModelType

Embedding model types supported by Kreuzberg.

Value Description
KREUZBERG_PRESET Use a preset model configuration (recommended) — Fields: name: const char*
KREUZBERG_CUSTOM Use a custom ONNX model from HuggingFace — Fields: model_id: const char*, dimensions: uintptr_t
KREUZBERG_LLM Provider-hosted embedding model via liter-llm. Uses the model specified in the nested LlmConfig (e.g., "openai/text-embedding-3-small"). — Fields: llm: KreuzbergLlmConfig
KREUZBERG_PLUGIN In-process embedding backend registered via the plugin system. The caller registers an EmbeddingBackend once (e.g. a wrapper around an already-loaded llama-cpp-python, sentence-transformers, or tuned ONNX model), then references it by name in config. Kreuzberg calls back into the registered backend during chunking and standalone embed requests — no HuggingFace download, no ONNX Runtime requirement, no HTTP sidecar. When this variant is selected, only the following EmbeddingConfig fields apply: normalize (post-call L2 normalization) and max_embed_duration_secs (dispatcher timeout). Model-loading fields (batch_size, cache_dir, show_download_progress, acceleration) are ignored — the host owns the model lifecycle. Semantic chunking falls back to ChunkingConfig.max_characters when this variant is used, since there is no preset to look a chunk-size ceiling up against — size your context window via max_characters directly. See register_embedding_backend. — Fields: name: const char*

KreuzbergCodeContentMode

Content rendering mode for code extraction.

Controls how extracted code content is represented in the content field of ExtractionResult.

Value Description
KREUZBERG_CHUNKS Use TSLP semantic chunks as content (default).
KREUZBERG_RAW Use raw source code as content.
KREUZBERG_STRUCTURE Emit function/class headings + docstrings (no code bodies).

KreuzbergListType

Type of list detection.

Value Description
KREUZBERG_BULLET Bullet points (-, *, •, etc.)
KREUZBERG_NUMBERED Numbered lists (1., 2., etc.)
KREUZBERG_LETTERED Lettered lists (a., b., A., B., etc.)
KREUZBERG_INDENTED Indented items

KreuzbergFracType

Value Description
KREUZBERG_BAR Bar
KREUZBERG_NO_BAR No bar
KREUZBERG_LINEAR Linear
KREUZBERG_SKEWED Skewed

KreuzbergOcrBackendType

OCR backend types.

Value Description
KREUZBERG_TESSERACT Tesseract OCR (native Rust binding)
KREUZBERG_EASY_OCR EasyOCR (Python-based, via FFI)
KREUZBERG_PADDLE_OCR PaddleOCR (Python-based, via FFI)
KREUZBERG_CUSTOM Custom/third-party OCR backend

KreuzbergProcessingStage

Processing stages for post-processors.

Post-processors are executed in stage order (Early → Middle → Late). Use stages to control the order of post-processing operations.

Value Description
KREUZBERG_EARLY Early stage - foundational processing. Use for: - Language detection - Character encoding normalization - Entity extraction (NER) - Text quality scoring
KREUZBERG_MIDDLE Middle stage - content transformation. Use for: - Keyword extraction - Token reduction - Text summarization - Semantic analysis
KREUZBERG_LATE Late stage - final enrichment. Use for: - Custom user hooks - Analytics/logging - Final validation - Output formatting

KreuzbergReductionLevel

Value Description
KREUZBERG_OFF Off
KREUZBERG_LIGHT Light
KREUZBERG_MODERATE Moderate
KREUZBERG_AGGRESSIVE Aggressive
KREUZBERG_MAXIMUM Maximum

KreuzbergPdfAnnotationType

Type of PDF annotation.

Value Description
KREUZBERG_TEXT Sticky note / text annotation
KREUZBERG_HIGHLIGHT Highlighted text region
KREUZBERG_LINK Hyperlink annotation
KREUZBERG_STAMP Rubber stamp annotation
KREUZBERG_UNDERLINE Underline text markup
KREUZBERG_STRIKE_OUT Strikeout text markup
KREUZBERG_OTHER Any other annotation type

KreuzbergBlockType

Types of block-level elements in Djot.

Value Description
KREUZBERG_PARAGRAPH Paragraph element
KREUZBERG_HEADING Heading element
KREUZBERG_BLOCKQUOTE Blockquote element
KREUZBERG_CODE_BLOCK Code block
KREUZBERG_LIST_ITEM List item
KREUZBERG_ORDERED_LIST Ordered list
KREUZBERG_BULLET_LIST Bullet list
KREUZBERG_TASK_LIST Task list
KREUZBERG_DEFINITION_LIST Definition list
KREUZBERG_DEFINITION_TERM Definition term
KREUZBERG_DEFINITION_DESCRIPTION Definition description
KREUZBERG_DIV Div
KREUZBERG_SECTION Section element
KREUZBERG_THEMATIC_BREAK Thematic break
KREUZBERG_RAW_BLOCK Raw block
KREUZBERG_MATH_DISPLAY Math display

KreuzbergInlineType

Types of inline elements in Djot.

Value Description
KREUZBERG_TEXT Text format
KREUZBERG_STRONG Strong
KREUZBERG_EMPHASIS Emphasis
KREUZBERG_HIGHLIGHT Highlight
KREUZBERG_SUBSCRIPT Subscript
KREUZBERG_SUPERSCRIPT Superscript
KREUZBERG_INSERT Insert
KREUZBERG_DELETE Delete
KREUZBERG_CODE Code
KREUZBERG_LINK Link
KREUZBERG_IMAGE Image element
KREUZBERG_SPAN Span
KREUZBERG_MATH Math
KREUZBERG_RAW_INLINE Raw inline
KREUZBERG_FOOTNOTE_REF Footnote ref
KREUZBERG_SYMBOL Symbol

KreuzbergRelationshipKind

Semantic kind of a relationship between document elements.

Value Description
KREUZBERG_FOOTNOTE_REFERENCE Footnote marker -> footnote definition.
KREUZBERG_CITATION_REFERENCE Citation marker -> bibliography entry.
KREUZBERG_INTERNAL_LINK Internal anchor link (#id) -> target heading/element.
KREUZBERG_CAPTION Caption paragraph -> figure/table it describes.
KREUZBERG_LABEL Label -> labeled element (HTML <label for>, LaTeX \label{}).
KREUZBERG_TOC_ENTRY TOC entry -> target section.
KREUZBERG_CROSS_REFERENCE Cross-reference (LaTeX \ref{}, DOCX cross-reference field).

KreuzbergContentLayer

Content layer classification for document nodes.

Replaces separate body/furniture arrays with per-node granularity.

Value Description
KREUZBERG_BODY Main document body content.
KREUZBERG_HEADER Page/section header (running header).
KREUZBERG_FOOTER Page/section footer (running footer).
KREUZBERG_FOOTNOTE Footnote content.

KreuzbergNodeContent

Tagged enum for node content. Each variant carries only type-specific data.

Uses #[serde(tag = "node_type")] to avoid "type" keyword collision in Go/Java/TypeScript bindings.

Value Description
KREUZBERG_TITLE Document title. — Fields: text: const char*
KREUZBERG_HEADING Section heading with level (1-6). — Fields: level: uint8_t, text: const char*
KREUZBERG_PARAGRAPH Body text paragraph. — Fields: text: const char*
KREUZBERG_LIST List container — children are ListItem nodes. — Fields: ordered: bool
KREUZBERG_LIST_ITEM Individual list item. — Fields: text: const char*
KREUZBERG_TABLE Table with structured cell grid. — Fields: grid: KreuzbergTableGrid
KREUZBERG_IMAGE Image reference. — Fields: description: const char*, image_index: uint32_t, src: const char*
KREUZBERG_CODE Code block. — Fields: text: const char*, language: const char*
KREUZBERG_QUOTE Block quote — container, children carry the quoted content.
KREUZBERG_FORMULA Mathematical formula / equation. — Fields: text: const char*
KREUZBERG_FOOTNOTE Footnote reference content. — Fields: text: const char*
KREUZBERG_GROUP Logical grouping container (section, key-value area). heading_level + heading_text capture the section heading directly rather than relying on a first-child positional convention. — Fields: label: const char*, heading_level: uint8_t, heading_text: const char*
KREUZBERG_PAGE_BREAK Page break marker.
KREUZBERG_SLIDE Presentation slide container — children are the slide's content nodes. — Fields: number: uint32_t, title: const char*
KREUZBERG_DEFINITION_LIST Definition list container — children are DefinitionItem nodes.
KREUZBERG_DEFINITION_ITEM Individual definition list entry with term and definition. — Fields: term: const char*, definition: const char*
KREUZBERG_CITATION Citation or bibliographic reference. — Fields: key: const char*, text: const char*
KREUZBERG_ADMONITION Admonition / callout container (note, warning, tip, etc.). Children carry the admonition body content. — Fields: kind: const char*, title: const char*
KREUZBERG_RAW_BLOCK Raw block preserved verbatim from the source format. Used for content that cannot be mapped to a semantic node type (e.g. JSX in MDX, raw LaTeX in markdown, embedded HTML). — Fields: format: const char*, content: const char*
KREUZBERG_METADATA_BLOCK Structured metadata block (email headers, YAML frontmatter, etc.). — Fields: entries: const char***

KreuzbergAnnotationKind

Types of inline text annotations.

Value Description
KREUZBERG_BOLD Bold
KREUZBERG_ITALIC Italic
KREUZBERG_UNDERLINE Underline
KREUZBERG_STRIKETHROUGH Strikethrough
KREUZBERG_CODE Code
KREUZBERG_SUBSCRIPT Subscript
KREUZBERG_SUPERSCRIPT Superscript
KREUZBERG_LINK Link — Fields: url: const char*, title: const char*
KREUZBERG_HIGHLIGHT Highlighted text (PDF highlights, HTML <mark>).
KREUZBERG_COLOR Text color (CSS-compatible value, e.g. "#ff0000", "red"). — Fields: value: const char*
KREUZBERG_FONT_SIZE Font size with units (e.g. "12pt", "1.2em", "16px"). — Fields: value: const char*
KREUZBERG_CUSTOM Extensible annotation for format-specific styling. — Fields: name: const char*, value: const char*

KreuzbergExtractionMethod

How the extracted text was produced.

Value Description
KREUZBERG_NATIVE Native
KREUZBERG_OCR Ocr
KREUZBERG_MIXED Mixed

KreuzbergChunkType

Semantic structural classification of a text chunk.

Assigned by the heuristic classifier in chunking.classifier. Defaults to Unknown when no rule matches. Designed to be extended in future versions without breaking changes.

Value Description
KREUZBERG_HEADING Section heading or document title.
KREUZBERG_PARTY_LIST Party list: names, addresses, and signatories.
KREUZBERG_DEFINITIONS Definition clause ("X means…", "X shall mean…").
KREUZBERG_OPERATIVE_CLAUSE Operative clause containing legal/contractual action verbs.
KREUZBERG_SIGNATURE_BLOCK Signature block with signatures, names, and dates.
KREUZBERG_SCHEDULE Schedule, annex, appendix, or exhibit section.
KREUZBERG_TABLE_LIKE Table-like content with aligned columns or repeated patterns.
KREUZBERG_FORMULA Mathematical formula or equation.
KREUZBERG_CODE_BLOCK Code block or preformatted content.
KREUZBERG_IMAGE Embedded or referenced image content.
KREUZBERG_ORG_CHART Organizational chart or hierarchy diagram.
KREUZBERG_DIAGRAM Diagram, figure, or visual illustration.
KREUZBERG_UNKNOWN Unclassified or mixed content.

KreuzbergImageKind

Heuristic classification of what an image likely depicts.

Value Description
KREUZBERG_PHOTOGRAPH Photographic image (natural scene, photograph)
KREUZBERG_DIAGRAM Technical or schematic diagram
KREUZBERG_CHART Chart, graph, or plot
KREUZBERG_DRAWING Freehand or technical drawing
KREUZBERG_TEXT_BLOCK Text-heavy image (scanned text, document)
KREUZBERG_DECORATION Decorative element or border
KREUZBERG_LOGO Logo or brand mark
KREUZBERG_ICON Small icon
KREUZBERG_TILE_FRAGMENT Fragment of a larger tiled image (tile of a technical drawing)
KREUZBERG_MASK Mask or transparency map
KREUZBERG_UNKNOWN Could not classify with reasonable confidence

KreuzbergResultFormat

Result-shape selection for extraction results.

Distinct from OutputFormat (which controls rendering — Plain, Markdown, HTML, etc.). ResultFormat controls the shape of the result: a unified content blob vs. an element-based decomposition.

Value Description
KREUZBERG_UNIFIED Unified format with all content in content field
KREUZBERG_ELEMENT_BASED Element-based format with semantic element extraction

KreuzbergElementType

Semantic element type classification.

Categorizes text content into semantic units for downstream processing. Supports the element types commonly found in Unstructured documents.

Value Description
KREUZBERG_TITLE Document title
KREUZBERG_NARRATIVE_TEXT Main narrative text body
KREUZBERG_HEADING Section heading
KREUZBERG_LIST_ITEM List item (bullet, numbered, etc.)
KREUZBERG_TABLE Table element
KREUZBERG_IMAGE Image element
KREUZBERG_PAGE_BREAK Page break marker
KREUZBERG_CODE_BLOCK Code block
KREUZBERG_BLOCK_QUOTE Block quote
KREUZBERG_FOOTER Footer text
KREUZBERG_HEADER Header text

KreuzbergFormatMetadata

Format-specific metadata (discriminated union).

Only one format type can exist per extraction result. This provides type-safe, clean metadata without nested optionals.

Value Description
KREUZBERG_PDF Pdf format — Fields: 0: KreuzbergPdfMetadata
KREUZBERG_DOCX Docx format — Fields: 0: KreuzbergDocxMetadata
KREUZBERG_EXCEL Excel — Fields: 0: KreuzbergExcelMetadata
KREUZBERG_EMAIL Email — Fields: 0: KreuzbergEmailMetadata
KREUZBERG_PPTX Pptx format — Fields: 0: KreuzbergPptxMetadata
KREUZBERG_ARCHIVE Archive — Fields: 0: KreuzbergArchiveMetadata
KREUZBERG_IMAGE Image element — Fields: 0: KreuzbergImageMetadata
KREUZBERG_XML Xml format — Fields: 0: KreuzbergXmlMetadata
KREUZBERG_TEXT Text format — Fields: 0: KreuzbergTextMetadata
KREUZBERG_HTML Preserve as HTML <mark> tags — Fields: 0: KreuzbergHtmlMetadata
KREUZBERG_OCR Ocr — Fields: 0: KreuzbergOcrMetadata
KREUZBERG_CSV Csv format — Fields: 0: KreuzbergCsvMetadata
KREUZBERG_BIBTEX Bibtex — Fields: 0: KreuzbergBibtexMetadata
KREUZBERG_CITATION Citation — Fields: 0: KreuzbergCitationMetadata
KREUZBERG_FICTION_BOOK Fiction book — Fields: 0: KreuzbergFictionBookMetadata
KREUZBERG_DBF Dbf — Fields: 0: KreuzbergDbfMetadata
KREUZBERG_JATS Jats — Fields: 0: KreuzbergJatsMetadata
KREUZBERG_EPUB Epub format — Fields: 0: KreuzbergEpubMetadata
KREUZBERG_PST Pst — Fields: 0: KreuzbergPstMetadata
KREUZBERG_CODE Code — Fields: 0: const char*

KreuzbergTextDirection

Text direction enumeration for HTML documents.

Value Description
KREUZBERG_LEFT_TO_RIGHT Left-to-right text direction
KREUZBERG_RIGHT_TO_LEFT Right-to-left text direction
KREUZBERG_AUTO Automatic text direction detection

KreuzbergLinkType

Link type classification.

Value Description
KREUZBERG_ANCHOR Anchor link (#section)
KREUZBERG_INTERNAL Internal link (same domain)
KREUZBERG_EXTERNAL External link (different domain)
KREUZBERG_EMAIL Email link (mailto:)
KREUZBERG_PHONE Phone link (tel:)
KREUZBERG_OTHER Other link type

KreuzbergImageType

Image type classification.

Value Description
KREUZBERG_DATA_URI Data URI image
KREUZBERG_INLINE_SVG Inline SVG
KREUZBERG_EXTERNAL External image URL
KREUZBERG_RELATIVE Relative path image

KreuzbergStructuredDataType

Structured data type classification.

Value Description
KREUZBERG_JSON_LD JSON-LD structured data
KREUZBERG_MICRODATA Microdata
KREUZBERG_RDFA RDFa

KreuzbergOcrBoundingGeometry

Bounding geometry for an OCR element.

Supports both axis-aligned rectangles (from Tesseract) and 4-point quadrilaterals (from PaddleOCR and rotated text detection).

Value Description
KREUZBERG_RECTANGLE Axis-aligned bounding box (typical for Tesseract output). — Fields: left: uint32_t, top: uint32_t, width: uint32_t, height: uint32_t
KREUZBERG_QUADRILATERAL 4-point quadrilateral for rotated/skewed text (PaddleOCR). Points are in clockwise order starting from top-left: [top_left, top_right, bottom_right, bottom_left] — Fields: points: const char*

KreuzbergOcrElementLevel

Hierarchical level of an OCR element.

Maps to Tesseract's page segmentation hierarchy and provides equivalent semantics for PaddleOCR.

Value Description
KREUZBERG_WORD Individual word
KREUZBERG_LINE Line of text (default for PaddleOCR)
KREUZBERG_BLOCK Paragraph or text block
KREUZBERG_PAGE Page-level element

KreuzbergPageUnitType

Type of paginated unit in a document.

Distinguishes between different types of "pages" (PDF pages, presentation slides, spreadsheet sheets).

Value Description
KREUZBERG_PAGE Standard document pages (PDF, DOCX, images)
KREUZBERG_SLIDE Presentation slides (PPTX, ODP)
KREUZBERG_SHEET Spreadsheet sheets (XLSX, ODS)

KreuzbergUriKind

Semantic classification of an extracted URI.

Value Description
KREUZBERG_HYPERLINK A clickable hyperlink (web URL, file link).
KREUZBERG_IMAGE An image or media resource reference.
KREUZBERG_ANCHOR An internal anchor or cross-reference target.
KREUZBERG_CITATION A citation or bibliographic reference (DOI, academic ref).
KREUZBERG_REFERENCE A general reference (e.g. \ref{} in LaTeX, :ref: in RST).
KREUZBERG_EMAIL An email address (mailto: link or bare email).

KreuzbergKeywordAlgorithm

Keyword algorithm selection.

Value Description
KREUZBERG_YAKE YAKE (Yet Another Keyword Extractor) - statistical approach
KREUZBERG_RAKE RAKE (Rapid Automatic Keyword Extraction) - co-occurrence based

KreuzbergPsmMode

Page Segmentation Mode for Tesseract OCR

Value Description
KREUZBERG_OSD_ONLY Osd only
KREUZBERG_AUTO_OSD Auto osd
KREUZBERG_AUTO_ONLY Auto only
KREUZBERG_AUTO Auto
KREUZBERG_SINGLE_COLUMN Single column
KREUZBERG_SINGLE_BLOCK_VERTICAL Single block vertical
KREUZBERG_SINGLE_BLOCK Single block
KREUZBERG_SINGLE_LINE Single line
KREUZBERG_SINGLE_WORD Single word
KREUZBERG_CIRCLE_WORD Circle word
KREUZBERG_SINGLE_CHAR Single char

KreuzbergPaddleLanguage

Supported languages in PaddleOCR.

Maps user-friendly language codes to paddle-ocr-rs language identifiers.

Value Description
KREUZBERG_ENGLISH English
KREUZBERG_CHINESE Simplified Chinese
KREUZBERG_JAPANESE Japanese
KREUZBERG_KOREAN Korean
KREUZBERG_GERMAN German
KREUZBERG_FRENCH French
KREUZBERG_LATIN Latin script (covers most European languages)
KREUZBERG_CYRILLIC Cyrillic (Russian and related)
KREUZBERG_TRADITIONAL_CHINESE Traditional Chinese
KREUZBERG_THAI Thai
KREUZBERG_GREEK Greek
KREUZBERG_EAST_SLAVIC East Slavic (Russian, Ukrainian, Belarusian)
KREUZBERG_ARABIC Arabic (Arabic, Persian, Urdu)
KREUZBERG_DEVANAGARI Devanagari (Hindi, Marathi, Sanskrit, Nepali)
KREUZBERG_TAMIL Tamil
KREUZBERG_TELUGU Telugu

KreuzbergLayoutClass

The 17 canonical document layout classes.

All model backends (RT-DETR, YOLO, etc.) map their native class IDs to this shared set. Models with fewer classes (DocLayNet: 11, PubLayNet: 5) map to the closest equivalent.

Wire format is snake_case in all serializers (JSON, TOML, YAML).

Value Description
KREUZBERG_CAPTION Caption element
KREUZBERG_FOOTNOTE Footnote element
KREUZBERG_FORMULA Formula
KREUZBERG_LIST_ITEM List item
KREUZBERG_PAGE_FOOTER Page footer
KREUZBERG_PAGE_HEADER Page header
KREUZBERG_PICTURE Picture
KREUZBERG_SECTION_HEADER Section header
KREUZBERG_TABLE Table element
KREUZBERG_TEXT Text format
KREUZBERG_TITLE Title element
KREUZBERG_DOCUMENT_INDEX Document index
KREUZBERG_CODE Code
KREUZBERG_CHECKBOX_SELECTED Checkbox selected
KREUZBERG_CHECKBOX_UNSELECTED Checkbox unselected
KREUZBERG_FORM Form
KREUZBERG_KEY_VALUE_REGION Key value region

Errors

KreuzbergKreuzbergError

Main error type for all Kreuzberg operations.

All errors in Kreuzberg use this enum, which preserves error chains and provides context for debugging.

Variants

  • Io - File system and I/O errors (always bubble up)
  • Parsing - Document parsing errors (corrupt files, unsupported features)
  • Ocr - OCR processing errors
  • Validation - Input validation errors (invalid paths, config, parameters)
  • Cache - Cache operation errors (non-fatal, can be ignored)
  • ImageProcessing - Image manipulation errors
  • Serialization - JSON/MessagePack serialization errors
  • MissingDependency - Missing optional dependencies (tesseract, etc.)
  • Plugin - Plugin-specific errors
  • LockPoisoned - Mutex/RwLock poisoning (should not happen in normal operation)
  • UnsupportedFormat - Unsupported MIME type or file format
  • Other - Catch-all for uncommon errors
Variant Description
KREUZBERG_IO IO error:
KREUZBERG_PARSING Parsing error:
KREUZBERG_OCR OCR error:
KREUZBERG_VALIDATION Validation error:
KREUZBERG_CACHE Cache error:
KREUZBERG_IMAGE_PROCESSING Image processing error:
KREUZBERG_SERIALIZATION Serialization error:
KREUZBERG_MISSING_DEPENDENCY Missing dependency:
KREUZBERG_PLUGIN Plugin error in '{plugin_name}':
KREUZBERG_LOCK_POISONED Lock poisoned:
KREUZBERG_UNSUPPORTED_FORMAT Unsupported format:
KREUZBERG_EMBEDDING Embedding error:
KREUZBERG_TIMEOUT Extraction timed out after {elapsed_ms}ms (limit: {limit_ms}ms)
KREUZBERG_CANCELLED Extraction cancelled
KREUZBERG_SECURITY Security violation:
KREUZBERG_OTHER {0}

Edit this page on GitHub