Rust API Reference¶
Complete reference for the Kreuzberg Rust API.
Installation¶
Add to your Cargo.toml:
With specific features:
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf", "ocr", "chunking", "api"] }
Available features:
pdf- PDF extraction support (enabled by default)ocr- OCR support with Tesseractchunking- Text chunking algorithmslanguage-detection- Language detectionkeywords-yake- YAKE keyword extractionkeywords-rake- RAKE keyword extractionapi- HTTP API server supportmcp- Model Context Protocol server support
Core Functions¶
extract_file_sync()¶
Extract content from a file (synchronous, blocking).
Signature:
pub fn extract_file_sync(
file_path: impl AsRef<Path>,
mime_type: Option<&str>,
config: &ExtractionConfig
) -> Result<ExtractionResult>
Parameters:
file_path(impl AsRef): Path to the file to extract mime_type(Option<&str>): Optional MIME type hint. If None, MIME type is auto-detectedconfig(&ExtractionConfig): Extraction configuration reference
Returns:
Result<ExtractionResult>: Result containing extraction result or error
Errors:
KreuzbergError::Io- File system errors (file not found, permission denied, etc.)KreuzbergError::Validation- Invalid configuration or file pathKreuzbergError::Parsing- Document parsing failureKreuzbergError::Ocr- OCR processing failureKreuzbergError::MissingDependency- Required system dependency not found
Examples:
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
// Extract a document synchronously with default configuration
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("Content: {}", result.content);
println!("Pages: {}", result.metadata.page_count.unwrap_or(0));
Ok(())
}
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
// Configure OCR for scanned documents
let config = ExtractionConfig {
ocr: Some(OcrConfig::default()),
force_ocr: false,
..Default::default()
};
let result = extract_file_sync("scanned.pdf", None, &config)?;
println!("Extracted: {}", result.content);
Ok(())
}
extract_file()¶
Extract content from a file (asynchronous).
Signature:
pub async fn extract_file(
file_path: impl AsRef<Path>,
mime_type: Option<&str>,
config: &ExtractionConfig
) -> Result<ExtractionResult>
Parameters:
Same as extract_file_sync().
Returns:
Result<ExtractionResult>: Result containing extraction result or error
Examples:
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
// Extract a document asynchronously
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("Content: {}", result.content);
Ok(())
}
extract_bytes_sync()¶
Extract content from bytes (synchronous, blocking).
Signature:
pub fn extract_bytes_sync(
data: &[u8],
mime_type: &str,
config: &ExtractionConfig
) -> Result<ExtractionResult>
Parameters:
data(&[u8]): File content as byte slicemime_type(&str): MIME type of the data (required for format detection)config(&ExtractionConfig): Extraction configuration reference
Returns:
Result<ExtractionResult>: Result containing extraction result or error
Examples:
use kreuzberg::{extract_bytes_sync, ExtractionConfig};
use std::fs;
fn main() -> kreuzberg::Result<()> {
// Extract from in-memory byte array
let data = fs::read("document.pdf")?;
let config = ExtractionConfig::default();
let result = extract_bytes_sync(&data, "application/pdf", &config)?;
println!("Content: {}", result.content);
Ok(())
}
extract_bytes()¶
Extract content from bytes (asynchronous).
Signature:
pub async fn extract_bytes(
data: &[u8],
mime_type: &str,
config: &ExtractionConfig
) -> Result<ExtractionResult>
Parameters:
Same as extract_bytes_sync().
Returns:
Result<ExtractionResult>: Result containing extraction result or error
batch_extract_file_sync()¶
Extract content from multiple files in parallel (synchronous, blocking).
Signature:
pub fn batch_extract_file_sync(
paths: &[impl AsRef<Path>],
mime_types: Option<&[&str]>,
config: &ExtractionConfig
) -> Result<Vec<ExtractionResult>>
Parameters:
paths(&[impl AsRef]): Slice of file paths to extract mime_types(Option<&[&str]>): Optional MIME type hints (must match paths length if provided)config(&ExtractionConfig): Extraction configuration applied to all files
Returns:
Result<Vec<ExtractionResult>>: Result containing vector of extraction results
Examples:
use kreuzberg::{batch_extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
// Process multiple files in parallel for better performance
let paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
let config = ExtractionConfig::default();
let results = batch_extract_file_sync(&paths, None, &config)?;
// Display results for each file
for (i, result) in results.iter().enumerate() {
println!("{}: {} characters", paths[i], result.content.len());
}
Ok(())
}
batch_extract_file()¶
Extract content from multiple files in parallel (asynchronous).
Signature:
pub async fn batch_extract_file(
paths: &[impl AsRef<Path>],
mime_types: Option<&[&str]>,
config: &ExtractionConfig
) -> Result<Vec<ExtractionResult>>
Parameters:
Same as batch_extract_file_sync().
Returns:
Result<Vec<ExtractionResult>>: Result containing vector of extraction results
Examples:
use kreuzberg::{batch_extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
// Process multiple files asynchronously in parallel
let files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
let config = ExtractionConfig::default();
let results = batch_extract_file(&files, None, &config).await?;
// Print extracted content from each file
for result in results {
println!("{}", result.content);
}
Ok(())
}
batch_extract_bytes_sync()¶
Extract content from multiple byte arrays in parallel (synchronous, blocking).
Signature:
pub fn batch_extract_bytes_sync(
data_list: &[&[u8]],
mime_types: &[&str],
config: &ExtractionConfig
) -> Result<Vec<ExtractionResult>>
Parameters:
data_list(&[&[u8]]): Slice of file contents as byte slicesmime_types(&[&str]): Slice of MIME types (must match data_list length)config(&ExtractionConfig): Extraction configuration applied to all items
Returns:
Result<Vec<ExtractionResult>>: Result containing vector of extraction results
batch_extract_bytes()¶
Extract content from multiple byte arrays in parallel (asynchronous).
Signature:
pub async fn batch_extract_bytes(
data_list: &[&[u8]],
mime_types: &[&str],
config: &ExtractionConfig
) -> Result<Vec<ExtractionResult>>
Parameters:
Same as batch_extract_bytes_sync().
Returns:
Result<Vec<ExtractionResult>>: Result containing vector of extraction results
Configuration¶
ExtractionConfig¶
Main configuration struct for extraction operations.
Definition:
#[derive(Debug, Clone, Default)]
pub struct ExtractionConfig {
pub ocr: Option<OcrConfig>,
pub force_ocr: bool,
pub pdf_options: Option<PdfConfig>,
pub chunking: Option<ChunkingConfig>,
pub language_detection: Option<LanguageDetectionConfig>,
pub token_reduction: Option<TokenReductionConfig>,
pub image_extraction: Option<ImageExtractionConfig>,
pub post_processor: Option<PostProcessorConfig>,
}
Fields:
ocr(Option): OCR configuration. Default: None (no OCR) force_ocr(bool): Force OCR even for text-based PDFs. Default: falsepdf_options(Option): PDF-specific configuration. Default: None chunking(Option): Text chunking configuration. Default: None language_detection(Option): Language detection configuration. Default: None token_reduction(Option): Token reduction configuration. Default: None image_extraction(Option): Image extraction from documents. Default: None post_processor(Option): Post-processing configuration. Default: None
Example:
use kreuzberg::{ExtractionConfig, OcrConfig, PdfConfig};
// Configure extraction with OCR and PDF-specific options
let config = ExtractionConfig {
ocr: Some(OcrConfig::default()),
force_ocr: false,
pdf_options: Some(PdfConfig {
passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
extract_images: true,
image_dpi: 300,
}),
..Default::default()
};
OcrConfig¶
OCR processing configuration.
Definition:
#[derive(Debug, Clone)]
pub struct OcrConfig {
pub backend: String,
pub language: String,
pub tesseract_config: Option<TesseractConfig>,
}
Fields:
backend(String): OCR backend to use. Options: "tesseract". Default: "tesseract"language(String): Language code for OCR (ISO 639-3). Default: "eng"tesseract_config(Option): Tesseract-specific configuration. Default: None
Example:
use kreuzberg::OcrConfig;
// Configure OCR backend and language settings
let ocr_config = OcrConfig {
backend: "tesseract".to_string(),
language: "eng".to_string(),
tesseract_config: None,
};
TesseractConfig¶
Tesseract OCR backend configuration.
Definition:
#[derive(Debug, Clone)]
pub struct TesseractConfig {
pub psm: i32,
pub oem: i32,
pub enable_table_detection: bool,
pub tessedit_char_whitelist: Option<String>,
pub tessedit_char_blacklist: Option<String>,
}
Fields:
psm(i32): Page segmentation mode (0-13). Default: 3 (auto)oem(i32): OCR engine mode (0-3). Default: 3 (LSTM only)enable_table_detection(bool): Enable table detection and extraction. Default: falsetessedit_char_whitelist(Option): Character whitelist. Default: None tessedit_char_blacklist(Option): Character blacklist. Default: None
Example:
use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig};
// Configure Tesseract with custom settings for numeric extraction
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "eng".to_string(),
tesseract_config: Some(TesseractConfig {
psm: 6, // Assume uniform block of text
oem: 3, // LSTM neural net mode
enable_table_detection: true,
tessedit_char_whitelist: Some("0123456789".to_string()), // Only recognize digits
tessedit_char_blacklist: None,
}),
}),
..Default::default()
};
PdfConfig¶
PDF-specific configuration.
Definition:
#[derive(Debug, Clone, Default)]
pub struct PdfConfig {
pub passwords: Option<Vec<String>>,
pub extract_images: bool,
pub image_dpi: u32,
}
Fields:
passwords(Option>): List of passwords to try for encrypted PDFs. Default: None extract_images(bool): Extract images from PDF. Default: falseimage_dpi(u32): DPI for image extraction. Default: 300
Example:
use kreuzberg::PdfConfig;
// Configure PDF extraction with passwords and image settings
let pdf_config = PdfConfig {
passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
extract_images: true,
image_dpi: 300,
};
ChunkingConfig¶
Text chunking configuration for splitting long documents.
Definition:
#[derive(Debug, Clone)]
pub struct ChunkingConfig {
pub chunk_size: usize,
pub chunk_overlap: usize,
pub chunking_strategy: String,
}
Fields:
chunk_size(usize): Maximum chunk size in tokens. Default: 512chunk_overlap(usize): Overlap between chunks in tokens. Default: 50chunking_strategy(String): Chunking strategy. Options: "fixed", "semantic". Default: "fixed"
LanguageDetectionConfig¶
Language detection configuration.
Definition:
#[derive(Debug, Clone)]
pub struct LanguageDetectionConfig {
pub enabled: bool,
pub confidence_threshold: f64,
}
Fields:
enabled(bool): Enable language detection. Default: trueconfidence_threshold(f64): Minimum confidence threshold (0.0-1.0). Default: 0.5
Results & Types¶
ExtractionResult¶
Result struct returned by all extraction functions.
Definition:
#[derive(Debug, Clone)]
pub struct ExtractionResult {
pub content: String,
pub mime_type: String,
pub metadata: Metadata,
pub tables: Vec<Table>,
pub detected_languages: Option<Vec<String>>,
}
Fields:
content(String): Extracted text contentmime_type(String): MIME type of the processed documentmetadata(Metadata): Document metadata (format-specific fields)tables(Vec): Vector of extracted tables
detected_languages(Option>): Vector of detected language codes (ISO 639-1) if language detection is enabled pages(Option>): Per-page extracted content when page extraction is enabled via PageConfig.extract_pages = trueExample:
result_access.rsuse kreuzberg::{extract_file_sync, ExtractionConfig}; fn main() -> kreuzberg::Result<()> { let config = ExtractionConfig::default(); let result = extract_file_sync("document.pdf", None, &config)?; // Access extraction result fields println!("Content: {}", result.content); println!("MIME type: {}", result.mime_type); println!("Tables: {}", result.tables.len()); // Display detected languages if available if let Some(langs) = result.detected_languages { println!("Languages: {}", langs.join(", ")); } Ok(()) }pages¶
Type:
Option<Vec<PageContent>>Per-page extracted content when page extraction is enabled via
PageConfig.extract_pages = true.Each page contains: - Page number (1-indexed) - Text content for that page - Tables on that page - Images on that page
Example:
page_extraction.rsuse kreuzberg::{extract_file_sync, ExtractionConfig, PageConfig}; fn main() -> kreuzberg::Result<()> { let config = ExtractionConfig { pages: Some(PageConfig { extract_pages: true, ..Default::default() }), ..Default::default() }; let result = extract_file_sync("document.pdf", None, &config)?; if let Some(pages) = result.pages { for page in pages { println!("Page {}:", page.page_number); println!(" Content: {} chars", page.content.len()); println!(" Tables: {}", page.tables.len()); println!(" Images: {}", page.images.len()); } } Ok(()) }
Accessing Per-Page Content¶
When page extraction is enabled, access individual pages and iterate over them:
iterate_pages.rsuse kreuzberg::{extract_file_sync, ExtractionConfig, PageConfig}; fn main() -> kreuzberg::Result<()> { let config = ExtractionConfig { pages: Some(PageConfig { extract_pages: true, insert_page_markers: true, marker_format: "\n\n--- Page {page_num} ---\n\n".to_string(), }), ..Default::default() }; let result = extract_file_sync("document.pdf", None, &config)?; // Access combined content with page markers println!("Combined content with markers:"); println!("{}", &result.content[..result.content.len().min(500)]); println!(); // Access per-page content if let Some(pages) = result.pages { for page in pages { println!("Page {}:", page.page_number); println!(" {}", &page.content[..page.content.len().min(100)]); if !page.tables.is_empty() { println!(" Found {} table(s)", page.tables.len()); } if !page.images.is_empty() { println!(" Found {} image(s)", page.images.len()); } } } Ok(()) }
Metadata¶
Document metadata with format-specific fields.
Definition:
Rust#[derive(Debug, Clone, Default)] pub struct Metadata { // Common fields pub title: Option<String>, pub subject: Option<String>, pub authors: Option<Vec<String>>, pub keywords: Option<Vec<String>>, pub language: Option<String>, pub created_at: Option<String>, pub modified_at: Option<String>, pub created_by: Option<String>, pub modified_by: Option<String>, pub pages: Option<PageStructure>, pub format: Option<FormatMetadata>, pub image_preprocessing: Option<ImagePreprocessingMetadata>, pub json_schema: Option<serde_json::Value>, pub error: Option<ErrorMetadata>, pub additional: HashMap<String, serde_json::Value>, }Example:
metadata_access.rslet result = extract_file_sync("document.pdf", None, &config)?; let metadata = &result.metadata; // Access PDF-specific metadata fields if metadata.format_type.as_deref() == Some("pdf") { if let Some(title) = &metadata.title { println!("Title: {}", title); } if let Some(pages) = metadata.page_count { println!("Pages: {}", pages); } }See the Types Reference for complete metadata field documentation.
Table¶
Extracted table structure.
Definition:
Rust#[derive(Debug, Clone)] pub struct Table { pub cells: Vec<Vec<String>>, pub markdown: String, pub page_number: usize, }Fields:
cells(Vec>): 2D vector of table cells (rows x columns) markdown(String): Table rendered as markdownpage_number(usize): Page number where table was found
Example:
table_processing.rslet result = extract_file_sync("invoice.pdf", None, &config)?; // Process all extracted tables for table in &result.tables { println!("Table on page {}:", table.page_number); println!("{}", table.markdown); println!(); }
ChunkMetadata¶
Metadata for a single text chunk.
Definition:
Rustpub struct ChunkMetadata { pub byte_start: usize, pub byte_end: usize, pub char_count: usize, pub token_count: Option<usize>, pub first_page: Option<usize>, pub last_page: Option<usize>, }Fields:
byte_start(usize): UTF-8 byte offset in content (inclusive)byte_end(usize): UTF-8 byte offset in content (exclusive)char_count(usize): Number of characters in chunktoken_count(Option): Estimated token count (if configured) first_page(Option): First page this chunk appears on (1-indexed, only when page boundaries available) last_page(Option): Last page this chunk appears on (1-indexed, only when page boundaries available)
Page tracking: When
PageStructure.boundariesis available and chunking is enabled,first_pageandlast_pageare automatically calculated based on byte offsets.Example:
chunk_metadata.rsuse kreuzberg::{extract_file_sync, ExtractionConfig, ChunkingConfig, PageConfig}; fn main() -> kreuzberg::Result<()> { let config = ExtractionConfig { chunking: Some(ChunkingConfig { chunk_size: 500, chunk_overlap: 50, ..Default::default() }), pages: Some(PageConfig { extract_pages: true, ..Default::default() }), ..Default::default() }; let result = extract_file_sync("document.pdf", None, &config)?; if let Some(chunks) = result.chunks { for chunk in chunks { let meta = &chunk.metadata; let page_info = match (meta.first_page, meta.last_page) { (Some(first), Some(last)) if first == last => { format!(" (page {})", first) } (Some(first), Some(last)) => { format!(" (pages {}-{})", first, last) } _ => String::new(), }; println!( "Chunk [{}:{}]: {} chars{}", meta.byte_start, meta.byte_end, meta.char_count, page_info ); } } Ok(()) }
Error Handling¶
KreuzbergError¶
All errors are returned as
KreuzbergErrorenum.Definition:
error_handling.rs#[derive(Debug, thiserror::Error)] pub enum KreuzbergError { #[error("IO error: {0}")] Io(#[from] std::io::Error), #[error("Validation error: {0}")] Validation(String), #[error("Parsing error: {0}")] Parsing(String), #[error("OCR error: {0}")] Ocr(String), #[error("Missing dependency: {0}")] MissingDependency(String), // ... additional variants }Error Handling:
error_handling.rsuse kreuzberg::{extract_file_sync, ExtractionConfig, KreuzbergError}; fn process_file(path: &str) -> kreuzberg::Result<String> { let config = ExtractionConfig::default(); // Pattern match on specific error types for custom handling match extract_file_sync(path, None, &config) { Ok(result) => Ok(result.content), Err(KreuzbergError::Io(e)) => { eprintln!("File system error: {}", e); Err(KreuzbergError::Io(e)) } Err(KreuzbergError::Validation(msg)) => { eprintln!("Invalid input: {}", msg); Err(KreuzbergError::Validation(msg)) } Err(KreuzbergError::Parsing(msg)) => { eprintln!("Failed to parse document: {}", msg); Err(KreuzbergError::Parsing(msg)) } Err(e) => Err(e), } }Using the
?operator:simple_error_handling.rsfn main() -> kreuzberg::Result<()> { // Use ? operator for simple error propagation let config = ExtractionConfig::default(); let result = extract_file_sync("document.pdf", None, &config)?; println!("{}", result.content); Ok(()) }See Error Handling Reference for detailed error documentation.
Plugin System¶
Document Extractors¶
Register custom document extractors for new file formats.
Trait:
Rust#[async_trait] pub trait DocumentExtractor: Send + Sync { fn name(&self) -> &str; fn mime_types(&self) -> &[&str]; fn priority(&self) -> i32; async fn extract( &self, data: &[u8], mime_type: &str, config: &ExtractionConfig ) -> Result<ExtractionResult>; }Registration:
plugin_registration.rsuse kreuzberg::plugins::registry::get_document_extractor_registry; use std::sync::Arc; // Register a custom document extractor for new file formats let registry = get_document_extractor_registry(); registry.register("custom", Arc::new(MyCustomExtractor))?;
MIME Type Detection¶
detect_mime_type()¶
Detect MIME type from file path.
Signature:
Example:
mime_detection.rsuse kreuzberg::detect_mime_type; // Detect MIME type from file path let mime_type = detect_mime_type("document.pdf")?; println!("MIME type: {}", mime_type); // "application/pdf"
validate_mime_type()¶
Validate if a MIME type is supported.
Signature:
Example:
mime_validation.rsuse kreuzberg::validate_mime_type; // Check if a MIME type is supported by Kreuzberg if validate_mime_type("application/pdf") { println!("PDF is supported"); }
Complete Documentation¶
For complete Rust API documentation with all types, traits, and functions:
Or visit docs.rs/kreuzberg