Elixir API Reference v4.0.0¶
Complete reference for the Kreuzberg Elixir API.
Installation¶
Add to your mix.exs:
Or install from Git:
Core Functions¶
Kreuzberg.extract/3¶
Extract content from binary document data.
Signature:
@spec extract(
binary(),
String.t(),
ExtractionConfig.t() | map() | keyword() | nil
) :: {:ok, ExtractionResult.t()} | {:error, String.t()}
Parameters:
input(binary): Binary document data to extract frommime_type(String): MIME type of the document (for example, "application/pdf", "text/plain")config(ExtractionConfig | map | keyword | nil): Extraction configuration. Uses defaults if nil
Returns:
{:ok, ExtractionResult.t()}: Successfully extracted content with metadata{:error, reason}: Extraction failed with error message
Example - Basic usage:
{:ok, pdf_binary} = File.read("document.pdf")
{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf")
IO.puts(result.content)
IO.puts("Pages: #{result.metadata["page_count"]}")
Example - With configuration struct:
config = %Kreuzberg.ExtractionConfig{
ocr: %{"enabled" => true, "language" => "eng"}
}
{:ok, result} = Kreuzberg.extract(pdf_binary, "application/pdf", config)
Example - With keyword list configuration:
{:ok, result} = Kreuzberg.extract(
pdf_binary,
"application/pdf",
ocr: %{"enabled" => true, "language" => "eng"}
)
Kreuzberg.extract!/3¶
Extract content from binary data, raising on error.
Signature:
@spec extract!(
binary(),
String.t(),
ExtractionConfig.t() | map() | keyword() | nil
) :: ExtractionResult.t()
Parameters:
Same as extract/3.
Returns:
ExtractionResult.t(): Successfully extracted content
Raises:
Kreuzberg.Error: If extraction fails
Example:
result = Kreuzberg.extract!(pdf_binary, "application/pdf")
IO.puts(result.content)
Kreuzberg.extract_file/3¶
Extract content from a file at the given path.
Signature:
@spec extract_file(
String.t() | Path.t(),
String.t() | nil,
ExtractionConfig.t() | map() | keyword() | nil
) :: {:ok, ExtractionResult.t()} | {:error, String.t()}
Parameters:
path(String | Path): File path to extractmime_type(String | nil): Optional MIME type hint. If nil, MIME type is auto-detectedconfig(ExtractionConfig | map | keyword | nil): Extraction configuration
Returns:
{:ok, ExtractionResult.t()}: Successfully extracted content{:error, reason}: Extraction failed with error message
Example - Basic usage:
Example - With explicit MIME type:
Example - With configuration:
config = %Kreuzberg.ExtractionConfig{
ocr: %{"language" => "eng"}
}
{:ok, result} = Kreuzberg.extract_file("scanned.pdf", nil, config)
Kreuzberg.extract_file!/3¶
Extract content from a file, raising on error.
Signature:
@spec extract_file!(
String.t() | Path.t(),
String.t() | nil,
ExtractionConfig.t() | map() | keyword() | nil
) :: ExtractionResult.t()
Parameters:
Same as extract_file/3.
Returns:
ExtractionResult.t(): Successfully extracted content
Raises:
Kreuzberg.Error: If extraction fails
Example:
result = Kreuzberg.extract_file!("document.pdf")
IO.puts("Content: #{String.length(result.content)} characters")
Batch Operations¶
Kreuzberg.batch_extract_files/3¶
Extract content from multiple files in a batch operation.
Signature:
@spec batch_extract_files(
[String.t() | Path.t()],
String.t() | nil,
ExtractionConfig.t() | map() | keyword() | nil
) :: {:ok, [ExtractionResult.t()]} | {:error, String.t()}
Parameters:
paths(list): List of file paths to extractmime_type(String | nil): MIME type for all files (optional, defaults to nil for auto-detection)config(ExtractionConfig | map | keyword | nil): Extraction configuration applied to all files
Returns:
{:ok, results}: List of ExtractionResult structs{:error, reason}: Error message if batch extraction fails
Example:
paths = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
{:ok, results} = Kreuzberg.batch_extract_files(paths)
Enum.each(results, fn result ->
IO.puts("Content: #{String.length(result.content)} characters")
end)
Example - With configuration:
config = %Kreuzberg.ExtractionConfig{
images: %{"enabled" => true}
}
{:ok, results} = Kreuzberg.batch_extract_files(paths, "application/pdf", config)
Kreuzberg.batch_extract_files!/3¶
Extract content from multiple files, raising on error.
Signature:
@spec batch_extract_files!(
[String.t() | Path.t()],
String.t() | nil,
ExtractionConfig.t() | map() | keyword() | nil
) :: [ExtractionResult.t()]
Parameters:
Same as batch_extract_files/3.
Returns:
- List of ExtractionResult structs
Raises:
Kreuzberg.Error: If batch extraction fails
Kreuzberg.batch_extract_bytes/3¶
Extract content from multiple binary inputs in a batch operation.
Signature:
@spec batch_extract_bytes(
[binary()],
String.t() | [String.t()],
ExtractionConfig.t() | map() | keyword() | nil
) :: {:ok, [ExtractionResult.t()]} | {:error, String.t()}
Parameters:
data_list(list): List of binary data inputsmime_types(String | list): List of MIME types (one per input) or single MIME type for allconfig(ExtractionConfig | map | keyword | nil): Extraction configuration applied to all items
Returns:
{:ok, results}: List of ExtractionResult structs{:error, reason}: Error message if batch extraction fails
Example - Single MIME type:
data_list = [pdf_binary1, pdf_binary2, pdf_binary3]
{:ok, results} = Kreuzberg.batch_extract_bytes(data_list, "application/pdf")
Example - Multiple MIME types:
data_list = [pdf_binary, docx_binary, txt_binary]
mime_types = [
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"text/plain"
]
{:ok, results} = Kreuzberg.batch_extract_bytes(data_list, mime_types)
Kreuzberg.batch_extract_bytes!/3¶
Extract content from multiple binary inputs, raising on error.
Signature:
@spec batch_extract_bytes!(
[binary()],
String.t() | [String.t()],
ExtractionConfig.t() | map() | keyword() | nil
) :: [ExtractionResult.t()]
Parameters:
Same as batch_extract_bytes/3.
Returns:
- List of ExtractionResult structs
Raises:
Kreuzberg.Error: If batch extraction fails
Async Operations¶
Kreuzberg.extract_async/3¶
Extract content from binary data asynchronously using Tasks.
Signature:
@spec extract_async(
binary(),
String.t(),
ExtractionConfig.t() | map() | keyword() | nil
) :: Task.t({:ok, ExtractionResult.t()} | {:error, String.t()})
Parameters:
input(binary): Binary data to extract frommime_type(String): MIME type of the dataconfig(ExtractionConfig | map | keyword | nil): Optional extraction configuration
Returns:
- A Task that will resolve to
{:ok, ExtractionResult.t()}or{:error, String.t()}
Example:
task = Kreuzberg.extract_async(pdf_binary, "application/pdf")
{:ok, result} = Task.await(task)
IO.puts(result.content)
Example - Multiple concurrent extractions:
tasks = [
Kreuzberg.extract_async(pdf1, "application/pdf"),
Kreuzberg.extract_async(pdf2, "application/pdf"),
Kreuzberg.extract_async(pdf3, "application/pdf")
]
results = Task.await_many(tasks)
Kreuzberg.extract_file_async/3¶
Extract content from a file asynchronously using Tasks.
Signature:
@spec extract_file_async(
String.t() | Path.t(),
String.t() | nil,
ExtractionConfig.t() | map() | keyword() | nil
) :: Task.t({:ok, ExtractionResult.t()} | {:error, String.t()})
Parameters:
path(String | Path): File path as String or Path.t()mime_type(String | nil): Optional MIME type (defaults to nil for auto-detection)config(ExtractionConfig | map | keyword | nil): Optional extraction configuration
Returns:
- A Task that will resolve to
{:ok, ExtractionResult.t()}or{:error, String.t()}
Example:
Example - Process multiple files concurrently:
tasks = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
|> Enum.map(&Kreuzberg.extract_file_async/1)
results = Task.await_many(tasks)
Kreuzberg.batch_extract_files_async/3¶
Batch extract content from multiple files asynchronously.
Signature:
@spec batch_extract_files_async(
[String.t() | Path.t()],
String.t() | nil,
ExtractionConfig.t() | map() | keyword() | nil
) :: Task.t({:ok, [ExtractionResult.t()]} | {:error, String.t()})
Parameters:
paths(list): List of file pathsmime_type(String | nil): Optional MIME type for all filesconfig(ExtractionConfig | map | keyword | nil): Optional extraction configuration
Returns:
- A Task that will resolve to
{:ok, [ExtractionResult.t()]}or{:error, String.t()}
Example:
paths = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
task = Kreuzberg.batch_extract_files_async(paths)
{:ok, results} = Task.await(task)
Kreuzberg.batch_extract_bytes_async/3¶
Batch extract content from multiple binary inputs asynchronously.
Signature:
@spec batch_extract_bytes_async(
[binary()],
String.t() | [String.t()],
ExtractionConfig.t() | map() | keyword() | nil
) :: Task.t({:ok, [ExtractionResult.t()]} | {:error, String.t()})
Parameters:
data_list(list): List of binary data inputsmime_types(String | list): Single MIME type string or list of MIME typesconfig(ExtractionConfig | map | keyword | nil): Optional extraction configuration
Returns:
- A Task that will resolve to
{:ok, [ExtractionResult.t()]}or{:error, String.t()}
Example:
data_list = [pdf1_binary, pdf2_binary, pdf3_binary]
task = Kreuzberg.batch_extract_bytes_async(data_list, "application/pdf")
{:ok, results} = Task.await(task)
Cache Management¶
Kreuzberg.cache_stats/0¶
Retrieve statistics about the extraction cache.
Signature:
Returns:
{:ok, stats}: Map with cache statistics"total_files"- Number of cached extraction results"total_size_mb"- Total size of cache in megabytes"available_space_mb"- Available disk space in megabytes"oldest_file_age_days"- Age of oldest cached file in days"newest_file_age_days"- Age of newest cached file in days{:error, reason}: Error message if retrieval fails
Example:
{:ok, stats} = Kreuzberg.cache_stats()
IO.puts("Cache entries: #{stats["total_files"]}")
IO.puts("Cache size: #{stats["total_size_mb"]} MB")
Kreuzberg.cache_stats!/0¶
Retrieve cache statistics, raising on error.
Signature:
Returns:
- Map with cache statistics
Raises:
Kreuzberg.Error: If cache statistics retrieval fails
Example:
stats = Kreuzberg.cache_stats!()
IO.puts("Total cached files: #{stats["total_files"]}")
Kreuzberg.clear_cache/0¶
Clear the extraction cache, removing all cached results.
Signature:
Returns:
:ok: Cache cleared successfully{:error, reason}: Error message if clearing fails
Example:
:ok = Kreuzberg.clear_cache()
{:ok, stats} = Kreuzberg.cache_stats()
IO.puts("Files after clear: #{stats["total_files"]}")
Kreuzberg.clear_cache!/0¶
Clear the cache, raising on error.
Signature:
Raises:
Kreuzberg.Error: If cache clearing fails
Example:
Utility Functions¶
Kreuzberg.detect_mime_type/1¶
Detect the MIME type of binary data using content inspection.
Signature:
Parameters:
data(binary): Binary data to analyze
Returns:
{:ok, mime_type}: Detected MIME type as a string{:error, reason}: Error if detection fails
Example:
{:ok, pdf_binary} = File.read("document.pdf")
{:ok, mime_type} = Kreuzberg.detect_mime_type(pdf_binary)
IO.puts("Detected MIME type: #{mime_type}")
# => "application/pdf"
Kreuzberg.detect_mime_type_from_path/1¶
Detect the MIME type of a file using its path and extension.
Signature:
@spec detect_mime_type_from_path(String.t() | Path.t()) ::
{:ok, String.t()} | {:error, String.t()}
Parameters:
path(String | Path): File path to analyze
Returns:
{:ok, mime_type}: Detected MIME type as a string{:error, reason}: Error if detection fails
Example:
{:ok, mime_type} = Kreuzberg.detect_mime_type_from_path("document.pdf")
IO.puts("MIME type: #{mime_type}")
# => "application/pdf"
Kreuzberg.validate_mime_type/1¶
Validate that a MIME type string is supported by Kreuzberg.
Signature:
Parameters:
mime_type(String): MIME type string to validate
Returns:
{:ok, mime_type}: Returns the MIME type if valid{:error, reason}: Error if MIME type is not supported
Example:
{:ok, _} = Kreuzberg.validate_mime_type("application/pdf")
{:error, reason} = Kreuzberg.validate_mime_type("application/invalid")
Kreuzberg.get_extensions_for_mime/1¶
Get all file extensions associated with a given MIME type.
Signature:
Parameters:
mime_type(String): MIME type string
Returns:
{:ok, extensions}: List of file extensions (without dot){:error, reason}: Error if MIME type is not found
Example:
{:ok, exts} = Kreuzberg.get_extensions_for_mime("application/pdf")
IO.inspect(exts)
# => ["pdf"]
{:ok, exts} = Kreuzberg.get_extensions_for_mime("image/jpeg")
IO.inspect(exts)
# => ["jpg", "jpeg"]
Kreuzberg.list_embedding_presets/0¶
List all available embedding model presets.
Signature:
Returns:
{:ok, presets}: List of preset names as strings{:error, reason}: Error if retrieval fails
Example:
{:ok, presets} = Kreuzberg.list_embedding_presets()
IO.inspect(presets)
# => ["balanced", "fast", "quality", "multilingual"]
Kreuzberg.get_embedding_preset/1¶
Get detailed information about a specific embedding preset.
Signature:
Parameters:
preset_name(String): Name of the embedding preset
Returns:
{:ok, preset_info}: Map containing preset details"name"- Preset name"chunk_size"- Chunk size in tokens"overlap"- Chunk overlap in tokens"dimensions"- Embedding vector dimension"description"- Human-readable description{:error, reason}: Error if preset not found
Example:
{:ok, preset} = Kreuzberg.get_embedding_preset("fast")
IO.puts("Preset: #{preset["name"]}")
IO.puts("Dimensions: #{preset["dimensions"]}")
Kreuzberg.classify_error/1¶
Classify an error message into a semantic error category.
Signature:
Parameters:
error_message(String): Error message string to classify
Returns:
- Error category atom:
:io_error- File I/O related errors:invalid_format- File format errors:invalid_config- Configuration or parameter errors:ocr_error- OCR engine or processing errors:extraction_error- General extraction failures:unknown_error- Errors that don't match other categories
Example:
atom = Kreuzberg.classify_error("File not found: /path/to/file.pdf")
IO.inspect(atom)
# => :io_error
atom = Kreuzberg.classify_error("Invalid PDF format")
IO.inspect(atom)
# => :invalid_format
Configuration¶
Deprecated API
The force_ocr parameter has been deprecated in favor of the new ocr configuration object.
Old pattern (no longer supported):
New pattern:
The new approach provides more granular control over OCR behavior through the OcrConfig struct.
ExtractionConfig¶
Main configuration struct for extraction operations.
Type:
@type t :: %Kreuzberg.ExtractionConfig{
chunking: map() | nil,
concurrency: map() | nil,
enable_quality_processing: boolean(),
force_ocr: boolean(),
html_options: map() | nil,
images: map() | nil,
include_document_structure: boolean(),
keywords: map() | nil,
language_detection: map() | nil,
layout: map() | nil,
max_concurrent_extractions: non_neg_integer() | nil,
ocr: map() | nil,
output_format: String.t(),
pages: map() | nil,
pdf_options: map() | nil,
postprocessor: map() | nil,
result_format: String.t(),
security_limits: map() | nil,
token_reduction: map() | nil,
use_cache: boolean(),
}
Fields:
chunking(map | nil): Text chunking configuration.concurrency(map | nil): Concurrency configuration for extraction parallelization.enable_quality_processing(boolean): Enable quality post-processing (default: true).force_ocr(boolean): Force OCR even for searchable PDFs (default: false).html_options(map | nil): HTML to Markdown conversion options.images(map | nil): Image extraction configuration.include_document_structure(boolean): Include hierarchical document structure in results (default: false).keywords(map | nil): Keyword extraction configuration.language_detection(map | nil): Language detection settings.layout(map | nil): Layout detection configuration.max_concurrent_extractions(non_neg_integer | nil): Maximum concurrent extractions in batch operations.ocr(map | nil): OCR configuration.output_format(String): Content text format —"plain","markdown","djot","html"(default:"plain").pages(map | nil): Page-level extraction configuration.pdf_options(map | nil): PDF-specific options.postprocessor(map | nil): Post-processor configuration.result_format(String): Result structure format —"unified","element_based"(default:"unified").security_limits(map | nil): Security limits for extraction.token_reduction(map | nil): Token reduction settings.use_cache(boolean): Enable result caching (default: true).
Example - Basic configuration:
config = %Kreuzberg.ExtractionConfig{
use_cache: true,
ocr: %Kreuzberg.OcrConfig{backend: "tesseract"}
}
{:ok, result} = Kreuzberg.extract_file("document.pdf", nil, config)
Example - OCR configuration:
config = %Kreuzberg.ExtractionConfig{
ocr: %{
"enabled" => true,
"language" => "eng",
"backend" => "tesseract"
}
}
{:ok, result} = Kreuzberg.extract_file("scanned.pdf", nil, config)
PaddleOCR-specific fields: v4.5.0
When using PaddleOCR, the ocr map supports:
"model_tier"(String): Model tier: "mobile" (lightweight, ~21MB total, fast) or "server" (high accuracy, ~172MB, best with GPU). Default: "mobile""padding"(Integer): Padding in pixels (0-100) added around the image before detection. Default: 10
config = %Kreuzberg.ExtractionConfig{
ocr: %{
"backend" => "paddle-ocr",
"language" => "en",
"model_tier" => "server",
"padding" => 10
}
}
Example - Chunking configuration:
config = %Kreuzberg.ExtractionConfig{
chunking: %{
"enabled" => true,
"chunk_size" => 1024,
"chunk_overlap" => 200,
"chunking_strategy" => "semantic"
}
}
{:ok, result} = Kreuzberg.extract_file("document.pdf", nil, config)
Example - Page extraction:
config = %Kreuzberg.ExtractionConfig{
pages: %{
"extract_pages" => true,
"insert_page_markers" => true,
"marker_format" => "\n\n--- Page {page_num} ---\n\n"
}
}
{:ok, result} = Kreuzberg.extract_file("document.pdf", nil, config)
if result.pages do
Enum.each(result.pages, fn page ->
IO.puts("Page #{page["page_number"]}: #{String.length(page["content"])} chars")
end)
end
Example - Image extraction:
config = %Kreuzberg.ExtractionConfig{
images: %{
"enabled" => true,
"min_width" => 100,
"min_height" => 100,
"format" => "png"
}
}
{:ok, result} = Kreuzberg.extract_file("document.pdf", nil, config)
if result.images do
IO.puts("Extracted #{length(result.images)} images")
end
Example - PDF options with concurrency:
config = %Kreuzberg.ExtractionConfig{
pdf_options: %{
"extract_images" => true,
"extract_annotations" => true,
"allow_single_column_tables" => true
},
concurrency: %{
"max_threads" => 4
}
}
{:ok, result} = Kreuzberg.extract_file("document.pdf", nil, config)
PDF Options Fields:
When configuring pdf_options map:
"allow_single_column_tables"(Boolean): v4.5.0 Allow extraction of single-column tables. Default: false"extract_annotations"(Boolean): Extract PDF annotations. Default: false"extract_images"(Boolean): Extract images from PDF. Default: false"extract_metadata"(Boolean): Extract PDF metadata. Default: true"passwords"(List): Passwords to try for encrypted PDFs. Default: nil
Concurrency Configuration: v4.5.0
When configuring concurrency map:
"max_threads"(Integer): Maximum number of threads for parallel extraction. Default: nil (system default)
LayoutDetectionConfig v4.5.0¶
Configure layout detection for document structure analysis.
Fields:
| Field | Type | Default | Description |
|---|---|---|---|
"confidence_threshold" |
Float|nil | nil | Minimum confidence score (0.0-1.0) for layout detection results. If nil, no filtering applied |
"apply_heuristics" |
Boolean | true | Apply post-processing heuristics to refine layout results |
"table_model" |
String|nil | nil | Table structure recognition model: "tatr" (default), "slanet_wired", "slanet_wireless", "slanet_plus", "slanet_auto" |
Example:
config = %Kreuzberg.ExtractionConfig{
layout: %{
"confidence_threshold" => 0.5,
"apply_heuristics" => true
}
}
{:ok, result} = Kreuzberg.extract_file("document.pdf", nil, config)
if result.document do
IO.puts("Document structure detected")
IO.puts("Sections: #{length(result.document["sections"])}")
end
Results & Types¶
ExtractionResult¶
Result struct returned by all extraction functions.
Type:
@type t :: %Kreuzberg.ExtractionResult{
annotations: [Kreuzberg.DocumentTextAnnotation.t()] | nil,
chunks: [Kreuzberg.Chunk.t()] | nil,
content: String.t(),
detected_languages: [String.t()] | nil,
djot_content: Kreuzberg.DjotContent.t() | nil,
document: Kreuzberg.DocumentStructure.t() | nil,
elements: [Kreuzberg.Element.t()] | nil,
extracted_keywords: [Kreuzberg.Keyword.t()] | nil,
images: [Kreuzberg.Image.t()] | nil,
metadata: Kreuzberg.Metadata.t(),
mime_type: String.t(),
ocr_elements: [Kreuzberg.OcrElement.t()] | nil,
pages: [Kreuzberg.Page.t()] | nil,
processing_warnings: [Kreuzberg.ProcessingWarning.t()],
quality_score: float() | nil,
tables: [Kreuzberg.Table.t()],
}
Fields:
annotations(list | nil): PDF annotations (text notes, highlights, links, stamps) asKreuzberg.DocumentTextAnnotationstructs.chunks(list | nil): List of text chunks with embeddings (Kreuzberg.Chunk) when chunking is enabled.content(String): Extracted text content.detected_languages(list | nil): List of detected language codes (ISO 639-1) if language detection is enabled.djot_content(Kreuzberg.DjotContent | nil): Rich Djot content structure.document(Kreuzberg.DocumentStructure | nil): Hierarchical document structure wheninclude_document_structureis true.elements(list | nil): Semantic elements (Kreuzberg.Element) whenresult_formatis"element_based".extracted_keywords(list | nil): Extracted keywords with scores (Kreuzberg.Keyword).images(list | nil): List of extracted images (Kreuzberg.Image).metadata(Kreuzberg.Metadata): Document metadata.mime_type(String): MIME type of the processed document.ocr_elements(list | nil): OCR elements (Kreuzberg.OcrElement) with bounding geometry and confidence.pages(list | nil): Per-page extracted content (Kreuzberg.Page).processing_warnings(list): Non-fatal warnings (Kreuzberg.ProcessingWarning) from processing pipeline stages.quality_score(float | nil): Document quality score between 0.0 and 1.0.tables(list): List of extracted tables (Kreuzberg.Table).
Example - Basic result access:
{:ok, result} = Kreuzberg.extract_file("document.pdf")
IO.puts("Content: #{result.content}")
IO.puts("MIME type: #{result.mime_type}")
IO.puts("Page count: #{result.metadata["page_count"]}")
IO.puts("Tables: #{length(result.tables)}")
if result.detected_languages do
IO.puts("Languages: #{Enum.join(result.detected_languages, ", ")}")
end
Example - Processing tables:
{:ok, result} = Kreuzberg.extract_file("invoice.pdf")
Enum.each(result.tables, fn table ->
IO.puts("Table on page #{table["page_number"]}:")
IO.puts(table["markdown"])
IO.puts("")
end)
Example - Page iteration:
config = %Kreuzberg.ExtractionConfig{
pages: %{"extract_pages" => true}
}
{:ok, result} = Kreuzberg.extract_file("document.pdf", nil, config)
if result.pages do
Enum.each(result.pages, fn page ->
IO.puts("Page #{page["page_number"]}:")
IO.puts(" Content: #{String.length(page["content"])} chars")
IO.puts(" Tables: #{length(page["tables"])}")
IO.puts(" Images: #{length(page["images"])}")
end)
end
Metadata¶
Document metadata dictionary. Fields vary by document format.
Common Fields:
title(String | nil): Document title.subject(String | nil): Document subject or description.authors(list(String) | nil): List of author names.keywords(list(String) | nil): List of keywords.language(String | nil): Primary language (ISO 639-1 code).created_at(String | nil): Creation date (ISO 8601).modified_at(String | nil): Last modification date (ISO 8601).created_by(String | nil): Application that created the document.modified_by(String | nil): Application that last modified the document.pages(Kreuzberg.PageStructure | nil): Page structure information.format(map | nil): Format-specific metadata (flattened fields).image_preprocessing(Kreuzberg.ImagePreprocessingMetadata | nil): Image preprocessing metadata.json_schema(map | nil): JSON schema if applicable.error(Kreuzberg.ErrorMetadata | nil): Error metadata if extraction partially failed.category(String | nil): Document category classification.tags(list(String) | nil): List of document tags.document_version(String | nil): Version of the document.abstract_text(String | nil): Abstract or summary of the document.output_format(String | nil): Output format used for extraction.extraction_duration_ms(non_neg_integer | nil): Time taken for extraction in milliseconds.additional(map): Additional metadata fields.
Example:
{:ok, result} = Kreuzberg.extract_file("document.pdf")
metadata = result.metadata
if metadata["format_type"] == "pdf" do
IO.puts("Title: #{metadata["title"]}")
IO.puts("Author: #{metadata["author"]}")
IO.puts("Pages: #{metadata["page_count"]}")
end
Table¶
Extracted table structure.
Fields:
cells(list(list(String))): 2D array of table cells (rows x columns).markdown(String): Table rendered as markdown.page_number(non_neg_integer): Page number where table was found (0-indexed).bounding_box(Kreuzberg.BoundingBox | nil): Bounding box coordinates if available.
Example:
{:ok, result} = Kreuzberg.extract_file("invoice.pdf")
Enum.each(result.tables, fn table ->
IO.puts("Table on page #{table.page_number}:")
IO.puts(table.markdown)
# Access raw cell data
Enum.each(table.cells, fn row ->
IO.inspect(row)
end)
end)
Plugin System¶
Kreuzberg.extract_with_plugins/4¶
Extract content with plugin processing support.
Signature:
@spec extract_with_plugins(
binary(),
String.t(),
ExtractionConfig.t() | map() | keyword() | nil,
keyword()
) :: {:ok, ExtractionResult.t()} | {:error, String.t()}
Parameters:
input(binary): Binary document data to extract frommime_type(String): MIME type of the documentconfig(ExtractionConfig | map | keyword | nil): Extraction configuration (optional)plugin_opts(keyword): Plugin options (optional)::validators- List of validator modules to run before extraction:post_processors- Map of stage atoms to lists of post-processor modules:early- Applied first to extraction result:middle- Applied after early processors:late- Applied last before final validators
:final_validators- List of validator modules to run after post-processing
Returns:
{:ok, ExtractionResult.t()}: Successfully extracted and processed content{:error, reason}: Extraction or processing failed
Example - With validators and post-processors:
{:ok, result} = Kreuzberg.extract_with_plugins(
pdf_binary,
"application/pdf",
nil,
validators: [MyApp.InputValidator],
post_processors: %{
early: [MyApp.EarlyProcessor],
middle: [MyApp.MiddleProcessor],
late: [MyApp.FinalProcessor]
},
final_validators: [MyApp.ResultValidator]
)
Example - With only post-processors:
{:ok, result} = Kreuzberg.extract_with_plugins(
pdf_binary,
"application/pdf",
%{use_cache: true},
post_processors: %{
early: [MyApp.Processor1, MyApp.Processor2]
}
)
Kreuzberg.Plugin.register_post_processor/2¶
Register a custom post-processor plugin.
Signature:
Parameters:
name(atom): Unique identifier for the post-processormodule(module): Module implementing the post-processor interface
Module Interface:
The post-processor module should implement:
process(data)- Applies custom processing to extraction result data
Example:
defmodule MyApp.TextNormalizer do
def process(result) do
normalized_content = result.content
|> String.trim()
|> String.downcase()
{:ok, %{result | content: normalized_content}}
end
end
:ok = Kreuzberg.Plugin.register_post_processor(:normalizer, MyApp.TextNormalizer)
Kreuzberg.Plugin.register_validator/1¶
Register a custom validator plugin.
Signature:
Parameters:
module(module): Module implementing the validator interface
Module Interface:
The validator module should implement:
validate(data)- Validates data and returns:okor{:error, reason}
Example:
defmodule MyApp.StrictValidator do
def validate(data) do
if data && data != "" do
:ok
else
{:error, "Data is empty"}
end
end
end
:ok = Kreuzberg.Plugin.register_validator(MyApp.StrictValidator)
Kreuzberg.Plugin.register_ocr_backend/1¶
Register a custom OCR backend plugin.
Signature:
Parameters:
module(module): Module implementing the OCR backend interface
Module Interface:
The OCR backend module should implement:
recognize(image_data, language)- Performs OCR on image datasupported_languages()- Returns list of supported language codes
Example:
defmodule MyApp.CustomOCRBackend do
def recognize(image_data, language) do
# Custom OCR logic
{:ok, "Extracted text"}
end
def supported_languages do
["en", "de", "fr"]
end
end
:ok = Kreuzberg.Plugin.register_ocr_backend(MyApp.CustomOCRBackend)
Validation¶
Kreuzberg.validate_chunking_params/1¶
Validate chunking configuration parameters.
Signature:
Parameters:
params(map): Map with keys:"max_chars"or:max_chars- Maximum characters per chunk (required)"max_overlap"or:max_overlap- Overlap between chunks (required)
Returns:
:ok: Parameters are valid{:error, reason}: Parameters are invalid
Example:
:ok = Kreuzberg.validate_chunking_params(%{
"max_chars" => 1000,
"max_overlap" => 200
})
Kreuzberg.validate_language_code/1¶
Validate an ISO 639 language code.
Signature:
Parameters:
code(String): Language code string (for example, "en", "eng", "de", "deu")
Returns:
:ok: Language code is valid{:error, reason}: Language code is invalid
Example:
:ok = Kreuzberg.validate_language_code("en")
:ok = Kreuzberg.validate_language_code("eng")
{:error, _} = Kreuzberg.validate_language_code("invalid")
Kreuzberg.validate_dpi/1¶
Validate a DPI (dots per inch) value.
Signature:
Parameters:
dpi(integer): Positive integer representing DPI
Returns:
:ok: DPI value is valid{:error, reason}: DPI value is invalid
Example:
Kreuzberg.validate_confidence/1¶
Validate a confidence threshold value.
Signature:
Parameters:
confidence(float): Confidence threshold between 0.0 and 1.0
Returns:
:ok: Confidence value is valid{:error, reason}: Confidence value is invalid
Example:
:ok = Kreuzberg.validate_confidence(0.5)
{:error, _} = Kreuzberg.validate_confidence(1.5)
Kreuzberg.validate_ocr_backend/1¶
Validate an OCR backend name.
Signature:
Parameters:
backend(String): OCR backend name
Valid Backends:
- "tesseract"
- "easyocr"
- "paddleocr"
Example:
:ok = Kreuzberg.validate_ocr_backend("tesseract")
{:error, _} = Kreuzberg.validate_ocr_backend("invalid_backend")
PDF Rendering¶
Added in v4.6.2
Kreuzberg.render_pdf_page/3¶
Render a single page of a PDF as a PNG image.
Signature:
@spec render_pdf_page(String.t(), non_neg_integer(), keyword()) :: {:ok, binary()} | {:error, term()}
def render_pdf_page(path, page_index, opts \\ [])
Parameters:
path(String.t()): Path to the PDF filepage_index(non_neg_integer()): Zero-based page index to render
Options:
:dpi(integer): Resolution for rendering (default 150)
Returns:
{:ok, binary()}: PNG-encoded binary for the requested page{:error, reason}: Error tuple if file cannot be read, rendered, or page index is out of bounds
Example:
{:ok, png} = Kreuzberg.render_pdf_page("document.pdf", 0)
File.write!("first_page.png", png)
Error Handling¶
Kreuzberg.Error¶
Exception struct for Kreuzberg extraction errors.
Type:
@type t :: %Kreuzberg.Error{
message: String.t() | nil,
reason: atom() | nil,
context: map() | nil
}
Error Reasons:
:invalid_format- File format errors:invalid_config- Configuration or parameter errors:ocr_error- OCR engine or processing errors:extraction_error- General extraction failures:io_error- File I/O related errors:nif_error- NIF-related errors:unknown_error- Errors that don't match other categories
Example - Basic error handling:
try do
result = Kreuzberg.extract_file!("document.pdf")
IO.puts(result.content)
rescue
e in Kreuzberg.Error ->
IO.puts("Extraction failed: #{e.message}")
IO.puts("Error type: #{e.reason}")
end
Example - Pattern matching:
case Kreuzberg.extract_file("document.pdf") do
{:ok, result} ->
IO.puts(result.content)
{:error, reason} ->
error_type = Kreuzberg.classify_error(reason)
case error_type do
:io_error ->
IO.puts("File not found or cannot be read")
:invalid_format ->
IO.puts("Unsupported or corrupted file format")
:ocr_error ->
IO.puts("OCR processing failed")
_ ->
IO.puts("Extraction failed: #{reason}")
end
end
Advanced Usage¶
Concurrent Processing with Tasks¶
Process multiple documents concurrently using Elixir's Task module:
files = ["doc1.pdf", "doc2.pdf", "doc3.pdf", "doc4.pdf"]
# Create tasks for concurrent processing
tasks = Enum.map(files, fn file ->
Task.async(fn ->
case Kreuzberg.extract_file(file) do
{:ok, result} -> {file, :ok, result}
{:error, reason} -> {file, :error, reason}
end
end)
end)
# Wait for all tasks to complete
results = Task.await_many(tasks, :timer.minutes(5))
# Process results
Enum.each(results, fn
{file, :ok, result} ->
IO.puts("#{file}: #{String.length(result.content)} characters")
{file, :error, reason} ->
IO.puts("#{file}: Failed - #{reason}")
end)
Streaming Large Documents¶
For very large documents, consider processing in chunks:
config = %Kreuzberg.ExtractionConfig{
chunking: %{
"enabled" => true,
"chunk_size" => 1024,
"chunk_overlap" => 200
}
}
{:ok, result} = Kreuzberg.extract_file("large_document.pdf", nil, config)
# Process chunks individually
if result.chunks do
result.chunks
|> Stream.each(fn chunk ->
# Process each chunk
content = chunk["content"]
metadata = chunk["metadata"]
IO.puts("Chunk: #{String.length(content)} chars")
IO.puts("Byte range: #{metadata["byte_start"]}-#{metadata["byte_end"]}")
end)
|> Stream.run()
end
Custom Post-Processing Pipeline¶
Build a custom processing pipeline using plugins:
defmodule MyApp.Pipeline do
# Early processor - clean HTML
defmodule HTMLCleaner do
def process(result) do
cleaned_content = result.content
|> String.replace(~r/<[^>]+>/, "")
|> String.trim()
{:ok, %{result | content: cleaned_content}}
end
end
# Middle processor - normalize whitespace
defmodule WhitespaceNormalizer do
def process(result) do
normalized_content = result.content
|> String.replace(~r/\s+/, " ")
|> String.trim()
{:ok, %{result | content: normalized_content}}
end
end
# Late processor - add metadata
defmodule MetadataEnricher do
def process(result) do
enriched_metadata = Map.merge(result.metadata, %{
"processed_at" => DateTime.utc_now() |> DateTime.to_iso8601(),
"word_count" => result.content |> String.split() |> length()
})
{:ok, %{result | metadata: enriched_metadata}}
end
end
def extract_with_pipeline(path) do
Kreuzberg.extract_with_plugins(
File.read!(path),
"application/pdf",
nil,
post_processors: %{
early: [HTMLCleaner],
middle: [WhitespaceNormalizer],
late: [MetadataEnricher]
}
)
end
end
# Use the pipeline
{:ok, result} = MyApp.Pipeline.extract_with_pipeline("document.pdf")
IO.puts("Word count: #{result.metadata["word_count"]}")
Type Reference¶
BoundingBox¶
Coordinates for element positioning.
Fields:
x0(float): Left x-coordinate.y0(float): Bottom y-coordinate.x1(float): Right x-coordinate.y1(float): Top y-coordinate.
Chunk¶
A text fragment with embedding.
Fields:
content(String): Text content.embedding(list(float) | nil): Vector embedding.metadata(Kreuzberg.ChunkMetadata): Positioning metadata.
ChunkMetadata¶
Positioning and context for a chunk.
Fields:
byte_start(non_neg_integer): Start byte offset.byte_end(non_neg_integer): End byte offset.chunk_index(non_neg_integer): Position index.first_page(non_neg_integer | nil): First page number.heading_context(map | nil): Hierarchical heading context.last_page(non_neg_integer | nil): Last page number.token_count(non_neg_integer | nil): Number of tokens.total_chunks(non_neg_integer): Total chunk count.
DjotContent¶
Structured Djot document.
Fields:
attributes(list): Element attributes.blocks(list(Kreuzberg.DjotFormattedBlock)): Block-level elements.footnotes(list(Kreuzberg.DjotFootnote)): Footnote definitions.images(list(Kreuzberg.DjotImage)): Extracted images.links(list(Kreuzberg.DjotLink)): Extracted links.metadata(Kreuzberg.Metadata): Document metadata.plain_text(String): Plain text fallback.tables(list(Kreuzberg.Table)): Extracted tables.
DocumentNode¶
A node in the hierarchical document tree.
Fields:
annotations(list(Kreuzberg.DocumentTextAnnotation)): Inline annotations.bbox(Kreuzberg.BoundingBox | nil): Node bounding box.children(list(non_neg_integer)): Indices of child nodes.content(map): Type-specific content data.content_layer(String | nil): Layer (body, header, etc.).id(String): Unique node identifier.node_type(String): Semantic type description.page_number(non_neg_integer | nil): Starting page.page_number_end(non_neg_integer | nil): Ending page.parent(non_neg_integer | nil): Index of parent node.
DocumentStructure¶
Hierarchical representation of the document.
Fields:
nodes(list(Kreuzberg.DocumentNode)): Flat list of nodes forming a tree.
Element¶
Semantic element (Unstructured compatibility).
Fields:
element_id(String): Deterministic unique ID.element_type(atom): Semantic type (for example,:narrative_text).metadata(Kreuzberg.ElementMetadata): Positioning and source metadata.text(String): Text content.
Image¶
Extracted image with metadata.
Fields:
bits_per_component(non_neg_integer | nil): Bit depth.bounding_box(map | nil): Box coordinates.colorspace(String | nil): Color space (for example, "RGB").data(binary): Raw image data.description(String | nil): Alt text or description.format(String): Format ("png", "jpeg", etc.).height(non_neg_integer | nil): Pixel height.image_index(non_neg_integer): Position in document.is_mask(boolean): Whether it's a mask image.ocr_result(Kreuzberg.ExtractionResult | nil): Recursive OCR results.page_number(non_neg_integer | nil): Page where found.width(non_neg_integer | nil): Pixel width.
Keyword¶
Extracted keyword with relevance score.
Fields:
score(float): Relevance score (0.0 to 1.0).text(String): Keyword text.
OcrBoundingGeometry¶
Geometry for OCR elements.
Fields:
height(float): Height in pixels.left(float): X-coordinate.top(float): Y-coordinate.type(String): Geometry type ("rect", etc.).width(float): Width in pixels.
OcrElement¶
Detailed OCR element.
Fields:
backend_metadata(map | nil): Backend-specific data.confidence(Kreuzberg.OcrConfidence | nil): Confidence scores.geometry(Kreuzberg.OcrBoundingGeometry | nil): Relative positioning.level(String | nil): Hierarchy level ("word", "line", etc.).page_number(non_neg_integer | nil): Page where found.parent_id(String | nil): ID of parent element.rotation(map | nil): Angle and orientation.text(String): Recognized text.
Page¶
Single page from a document.
Fields:
content(String): Page text.hierarchy(map | nil): Structural hierarchy.images(list(Kreuzberg.Image)): Page images.is_blank(boolean | nil): Blank page detection.page_number(non_neg_integer): 0-indexed page number.tables(list(Kreuzberg.Table)): Page tables.
ProcessingWarning¶
Warning from the extraction pipeline.
Fields:
message(String): Warning description.source(String): Component that issued the warning.
Complete Type Specifications¶
# Core types
@type extraction_result :: {:ok, ExtractionResult.t()} | {:error, String.t()}
@type batch_result :: {:ok, [ExtractionResult.t()]} | {:error, String.t()}
# Config & Options
@type config_option :: ExtractionConfig.t() | map() | keyword() | nil
# Async Task types
@type async_result :: Task.t(extraction_result())
@type async_batch_result :: Task.t(batch_result())
# Plugin types
@type ocr_backend :: module()
@type post_processor :: module()
@type validator :: module()
# Status & Stats
@type cache_stats :: %{
required(String.t()) => non_neg_integer() | float()
}
# Errors
@type error_reason ::
:extraction_error |
:invalid_config |
:invalid_format |
:io_error |
:nif_error |
:ocr_error |
:unknown_error
System Requirements¶
Elixir: 1.12 or higher
Erlang/OTP: 24 or higher
Native Dependencies:
- Tesseract OCR (for OCR support):
brew install tesseract(macOS) orapt-get install tesseract-ocr(Ubuntu)
Platforms:
- Linux (x64, arm64)
- MacOS (x64, arm64)
- Windows (x64)
Thread Safety¶
All Kreuzberg functions are process-safe and can be called from multiple Elixir processes concurrently. The underlying NIF implementation uses a thread pool for parallel processing.
Example - Concurrent extraction:
files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
tasks = Enum.map(files, fn file ->
Task.async(fn ->
Kreuzberg.extract_file(file)
end)
end)
results = Task.await_many(tasks)
However, for better performance with multiple files, use the batch API:
# More efficient approach
files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
{:ok, results} = Kreuzberg.batch_extract_files(files)
LLM Integration¶
Kreuzberg integrates with LLMs via the liter-llm crate for structured extraction and VLM-based OCR. The Elixir binding passes LLM configuration as map options through the Rustler NIF layer. See the LLM Integration Guide for full details.
Structured Extraction¶
Pass structured_extraction config to extract structured data from documents using an LLM:
config = %{
structured_extraction: %{
schema: %{
"type" => "object",
"properties" => %{
"title" => %{"type" => "string"},
"authors" => %{"type" => "array", "items" => %{"type" => "string"}},
"date" => %{"type" => "string"}
},
"required" => ["title", "authors", "date"],
"additionalProperties" => false
},
llm: %{model: "openai/gpt-4o-mini"},
strict: true
}
}
{:ok, result} = Kreuzberg.extract_file("paper.pdf", config)
case result.structured_output do
nil -> IO.puts("No structured output")
output -> IO.puts(output)
end
VLM OCR¶
Use a vision-language model as an OCR backend:
config = %{
force_ocr: true,
ocr: %{
backend: "vlm",
vlm_config: %{model: "openai/gpt-4o-mini"}
}
}
{:ok, result} = Kreuzberg.extract_file("scan.pdf", config)
For configuration details including API keys, model selection, and provider setup, see the LLM Integration Guide.
Version Information¶
Check the Kreuzberg version:
Additional Resources¶
License¶
Kreuzberg Elixir package is released under the same license as the main Kreuzberg project.