Java API Reference¶
Complete reference for the Kreuzberg Java bindings using Java 25+ Foreign Function & Memory API (FFM/Panama).
Installation¶
Add the dependency to your Maven pom.xml:
<dependency>
<groupId>dev.kreuzberg</groupId>
<artifactId>kreuzberg</artifactId>
<version>4.0.0</version>
</dependency>
Or with Gradle:
Requirements: - Java 25 or later - libkreuzberg_ffi native library (auto-loaded) - Optional: Tesseract or EasyOCR/PaddleOCR for OCR functionality
View package on Maven Central.
Core Functions¶
extractFile()¶
Extract content from a file (synchronous).
Signature:
public static ExtractionResult extractFile(String path) throws IOException, KreuzbergException
public static ExtractionResult extractFile(Path path) throws IOException, KreuzbergException
public static ExtractionResult extractFile(Path path, ExtractionConfig config) throws IOException, KreuzbergException
Parameters:
path(String | Path): Path to the file to extractconfig(ExtractionConfig): Optional extraction configuration. Uses defaults if null
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Throws:
IOException: If file not found or not readableKreuzbergException: Base exception for all extraction errors (subclasses:ParsingException,OcrException,MissingDependencyException)
Example - Basic usage:
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
try {
// Extract content from a PDF file
ExtractionResult result = Kreuzberg.extractFile("document.pdf");
System.out.println(result.getContent());
System.out.println("MIME Type: " + result.getMimeType());
} catch (IOException e) {
System.err.println("File error: " + e.getMessage());
} catch (KreuzbergException e) {
System.err.println("Extraction failed: " + e.getMessage());
}
Example - With OCR:
import dev.kreuzberg.*;
import dev.kreuzberg.config.*;
// Configure OCR for scanned documents
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.language("eng")
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
System.out.println(result.getContent());
Example - With multiple options:
// Configure extraction with multiple options for comprehensive processing
ExtractionConfig config = ExtractionConfig.builder()
.useCache(true)
.forceOcr(false)
.enableQualityProcessing(true)
.ocr(OcrConfig.builder()
.backend("tesseract")
.language("eng+fra")
.build())
.pdfOptions(PdfConfig.builder()
.extractImages(true)
.extractMetadata(true)
.build())
.chunking(ChunkingConfig.builder()
.maxChars(1000)
.maxOverlap(200)
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
extractBytes()¶
Extract content from byte array (synchronous).
Signature:
public static ExtractionResult extractBytes(byte[] data, String mimeType, ExtractionConfig config)
throws KreuzbergException
Parameters:
data(byte[]): File content as bytes (must not be empty)mimeType(String): MIME type of the data (required for format detection)config(ExtractionConfig): Optional extraction configuration
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Throws:
KreuzbergException: If extraction or validation fails
Example - Basic usage:
import dev.kreuzberg.Kreuzberg;
// Extract from in-memory byte array
byte[] pdfBytes = /* read from file or stream */;
ExtractionResult result = Kreuzberg.extractBytes(pdfBytes, "application/pdf", null);
System.out.println(result.getContent());
Example - With configuration:
// Extract from bytes with quality processing enabled
ExtractionConfig config = ExtractionConfig.builder()
.enableQualityProcessing(true)
.build();
byte[] docxBytes = /* ... */;
ExtractionResult result = Kreuzberg.extractBytes(docxBytes, "application/vnd.openxmlformats-officedocument.wordprocessingml.document", config);
batchExtractFiles()¶
Extract content from multiple files in parallel (synchronous).
Signature:
public static List<ExtractionResult> batchExtractFiles(List<String> paths, ExtractionConfig config)
throws KreuzbergException
Parameters:
paths(List): List of file paths to extract config(ExtractionConfig): Optional extraction configuration applied to all files
Returns:
List<ExtractionResult>: List of extraction results (one per file)
Throws:
KreuzbergException: If batch extraction fails
Example:
import dev.kreuzberg.Kreuzberg;
import java.util.List;
// Process multiple files in parallel for better performance
List<String> filePaths = List.of(
"doc1.pdf",
"doc2.docx",
"doc3.xlsx"
);
List<ExtractionResult> results = Kreuzberg.batchExtractFiles(filePaths, null);
// Display extraction results for each file
for (int i = 0; i < filePaths.size(); i++) {
System.out.println(filePaths.get(i) + ": " + results.get(i).getContent().length() + " characters");
}
batchExtractBytes()¶
Extract content from multiple byte arrays in parallel (synchronous).
Signature:
public static List<ExtractionResult> batchExtractBytes(List<BytesWithMime> items, ExtractionConfig config)
throws KreuzbergException
Parameters:
items(List): List of byte data with MIME types config(ExtractionConfig): Optional extraction configuration
Returns:
List<ExtractionResult>: List of extraction results
Throws:
KreuzbergException: If batch extraction fails
Example:
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.BytesWithMime;
import java.util.List;
// Process multiple in-memory documents in parallel
List<BytesWithMime> items = List.of(
new BytesWithMime(pdfBytes, "application/pdf"),
new BytesWithMime(docxBytes, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
);
List<ExtractionResult> results = Kreuzberg.batchExtractBytes(items, null);
extractFileAsync()¶
Extract content from a file (asynchronous).
Signature:
public static CompletableFuture<ExtractionResult> extractFileAsync(Path path, ExtractionConfig config)
Parameters:
path(Path): File path to extractconfig(ExtractionConfig): Optional extraction configuration
Returns:
CompletableFuture<ExtractionResult>: Future that completes with the extraction result
Example:
// Asynchronous extraction with error handling
Kreuzberg.extractFileAsync(Path.of("document.pdf"), null)
.thenAccept(result -> System.out.println(result.getContent()))
.exceptionally(e -> {
System.err.println("Error: " + e.getMessage());
return null;
});
extractBytesAsync()¶
Extract content from bytes (asynchronous).
Signature:
public static CompletableFuture<ExtractionResult> extractBytesAsync(
byte[] data,
String mimeType,
ExtractionConfig config
)
Returns:
CompletableFuture<ExtractionResult>: Future that completes with the extraction result
batchExtractFilesAsync()¶
Extract multiple files in parallel (asynchronous).
Signature:
public static CompletableFuture<List<ExtractionResult>> batchExtractFilesAsync(
List<String> paths,
ExtractionConfig config
)
Returns:
CompletableFuture<List<ExtractionResult>>: Future that completes with extraction results
batchExtractBytesAsync()¶
Extract multiple byte arrays in parallel (asynchronous).
Signature:
public static CompletableFuture<List<ExtractionResult>> batchExtractBytesAsync(
List<BytesWithMime> items,
ExtractionConfig config
)
Returns:
CompletableFuture<List<ExtractionResult>>: Future that completes with extraction results
Configuration¶
ExtractionConfig¶
Main extraction configuration using builder pattern.
Builder Methods:
// Build extraction configuration with all available options
ExtractionConfig config = ExtractionConfig.builder()
.useCache(true) // Enable caching (default: true)
.enableQualityProcessing(false) // Enable quality processing (default: false)
.forceOcr(false) // Force OCR on all pages (default: false)
.ocr(OcrConfig) // OCR configuration
.chunking(ChunkingConfig) // Text chunking configuration
.languageDetection(LanguageDetectionConfig) // Language detection settings
.pdfOptions(PdfConfig) // PDF-specific options
.imageExtraction(ImageExtractionConfig) // Image extraction settings
.imagePreprocessing(ImagePreprocessingConfig) // Image preprocessing
.postprocessor(PostProcessorConfig) // Post-processor settings
.tokenReduction(TokenReductionConfig) // Token reduction configuration
.htmlOptions(HtmlOptions) // HTML conversion options
.keywords(KeywordConfig) // Keyword extraction settings
.maxConcurrentExtractions(4) // Max concurrent extractions
.build();
Static Methods:
// Load configuration from file (TOML, YAML, or JSON)
ExtractionConfig config = ExtractionConfig.fromFile("kreuzberg.toml");
// Automatically discover configuration file in current/parent directories
ExtractionConfig config = ExtractionConfig.discover(); // Returns null if not found
OcrConfig¶
OCR configuration for text extraction from images.
Builder Methods:
// Configure OCR backend and language settings
OcrConfig ocr = OcrConfig.builder()
.backend("tesseract") // "tesseract", "easyocr", "paddleocr", etc.
.language("eng") // Language code(s), comma-separated for multiple
.tesseractConfig(config) // Tesseract-specific configuration
.build();
Example - Multi-language OCR:
// Configure OCR to support multiple languages simultaneously
OcrConfig ocr = OcrConfig.builder()
.backend("tesseract")
.language("eng+fra+deu") // English, French, and German
.build();
OcrBackend Interface¶
Custom OCR backend implementation for cloud-based or specialized OCR.
Interface:
public interface OcrBackend {
/**
* Process image and extract text.
*
* @param imageData raw image bytes
* @param configJson OCR configuration as JSON
* @return extracted text, or null if processing fails
*/
String processImage(byte[] imageData, String configJson) throws Exception;
/**
* Languages supported by this backend.
*
* @return list of language codes (empty for all languages)
*/
List<String> supportedLanguages();
}
Example - Custom OCR Backend:
// Implement custom OCR backend for cloud-based or specialized OCR services
class CustomOcrBackend implements OcrBackend {
@Override
public String processImage(byte[] imageData, String configJson) throws Exception {
// Call custom OCR service (e.g., Google Cloud Vision, AWS Textract)
return callCustomOcrService(imageData);
}
@Override
public List<String> supportedLanguages() {
return List.of("eng", "fra", "deu");
}
}
// Register the custom backend with Kreuzberg
OcrBackend backend = new CustomOcrBackend();
Kreuzberg.registerOcrBackend("custom-ocr", backend);
ChunkingConfig¶
Configuration for splitting extracted text into chunks.
Builder Methods:
// Configure text chunking for RAG and embedding workflows
ChunkingConfig chunking = ChunkingConfig.builder()
.maxChars(1000) // Maximum characters per chunk
.maxOverlap(200) // Character overlap between chunks
.preset("large") // Preset: "small", "medium", "large"
.enabled(true) // Enable chunking (default: true)
.embedding(embeddingMap) // Embedding configuration
.build();
LanguageDetectionConfig¶
Configuration for automatic language detection.
Builder Methods:
// Configure automatic language detection with confidence threshold
LanguageDetectionConfig langDetect = LanguageDetectionConfig.builder()
.enabled(true) // Enable language detection
.minConfidence(0.8) // Minimum confidence threshold (0.0-1.0)
.build();
PdfConfig¶
PDF-specific extraction options.
Builder Methods:
// Configure PDF-specific extraction options
PdfConfig pdf = PdfConfig.builder()
.extractImages(true) // Extract images from PDF
.extractMetadata(true) // Extract PDF metadata
.renderImages(false) // Render pages as images for processing
.build();
ImageExtractionConfig¶
Configuration for image extraction from documents.
Builder Methods:
// Configure image extraction settings
ImageExtractionConfig images = ImageExtractionConfig.builder()
.extractImages(true) // Enable image extraction
.targetDpi(150) // Target DPI for extraction
.maxImageDimension(4096) // Maximum image dimension in pixels
.build();
ImagePreprocessingConfig¶
Configuration for preprocessing images before OCR.
Builder Methods:
// Configure image preprocessing to improve OCR accuracy
ImagePreprocessingConfig preproc = ImagePreprocessingConfig.builder()
.targetDpi(300) // Target DPI for OCR
.denoise(true) // Apply denoising
.deskew(true) // Deskew images
.contrastEnhance(true) // Enhance contrast
.build();
TokenReductionConfig¶
Configuration for token reduction (reducing extracted text size).
Builder Methods:
// Configure token reduction to minimize extracted text size
TokenReductionConfig tokenReduce = TokenReductionConfig.builder()
.mode("moderate") // Mode: "none", "light", "moderate", "aggressive"
.preserveImportantWords(true) // Preserve important words
.build();
PostProcessorConfig¶
Configuration for post-processing.
Builder Methods:
// Configure post-processing for extraction results
PostProcessorConfig postproc = PostProcessorConfig.builder()
.enabled(true) // Enable post-processing
.build();
HtmlOptions¶
Configuration for HTML to Markdown conversion.
Builder Methods:
// Configure HTML to Markdown conversion options
HtmlOptions html = HtmlOptions.builder()
.headingStyle("atx") // "atx", "underlined", "atx_closed"
.codeBlockStyle("backticks") // "indented", "backticks", "tildes"
.build();
KeywordConfig¶
Configuration for keyword extraction.
Builder Methods:
// Configure automatic keyword extraction from content
KeywordConfig keywords = KeywordConfig.builder()
.enabled(true)
.maxKeywords(10)
.minKeywordLength(3)
.build();
Results & Types¶
ExtractionResult¶
Result of a document extraction operation.
Accessors:
// Access extracted content and metadata
String content = result.getContent(); // Extracted text content
String mimeType = result.getMimeType(); // Detected MIME type
Map<String, Object> metadata = result.getMetadata(); // Document metadata
List<Table> tables = result.getTables(); // Extracted tables
List<String> languages = result.getDetectedLanguages(); // Detected languages
List<Chunk> chunks = result.getChunks(); // Text chunks
List<ExtractedImage> images = result.getImages(); // Extracted images
List<PageContent> pages = result.getPages(); // Per-page content (if enabled)
boolean success = result.isSuccess(); // Extraction success flag
// Access common metadata fields
Optional<String> language = result.getLanguage(); // Primary language
Optional<String> date = result.getDate(); // Document date
Optional<String> subject = result.getSubject(); // Document subject
Example - Accessing results:
ExtractionResult result = Kreuzberg.extractFile("document.pdf");
// Display basic extraction statistics
System.out.println("Content length: " + result.getContent().length());
System.out.println("MIME: " + result.getMimeType());
System.out.println("Tables: " + result.getTables().size());
System.out.println("Languages: " + result.getDetectedLanguages());
// Extract specific metadata fields
Object pageCount = result.getMetadata().get("page_count");
Object author = result.getMetadata().get("author");
// Process chunks for RAG workflows
for (Chunk chunk : result.getChunks()) {
System.out.println("Chunk " + chunk.getIndex() + ": " + chunk.getContent());
}
pages¶
Type: List<PageContent>
Per-page extracted content when page extraction is enabled via PageConfig.extractPages = true.
Each page contains: - Page number (1-indexed) - Text content for that page - Tables on that page - Images on that page
Example:
import dev.kreuzberg.*;
var config = ExtractionConfig.builder()
.pages(PageConfig.builder()
.extractPages(true)
.build())
.build();
var result = Kreuzberg.extractFile("document.pdf", config);
if (result.getPages() != null) {
for (var page : result.getPages()) {
System.out.println("Page " + page.getPageNumber() + ":");
System.out.println(" Content: " + page.getContent().length() + " chars");
System.out.println(" Tables: " + page.getTables().size());
System.out.println(" Images: " + page.getImages().size());
}
}
Accessing Per-Page Content¶
When page extraction is enabled, access individual pages and iterate over them:
import dev.kreuzberg.*;
var config = ExtractionConfig.builder()
.pages(PageConfig.builder()
.extractPages(true)
.insertPageMarkers(true)
.markerFormat("\n\n--- Page {page_num} ---\n\n")
.build())
.build();
var result = Kreuzberg.extractFile("document.pdf", config);
// Access combined content with page markers
System.out.println("Combined content with markers:");
System.out.println(result.getContent().substring(0, 500));
System.out.println();
// Access per-page content
if (result.getPages() != null) {
for (var page : result.getPages()) {
System.out.println("Page " + page.getPageNumber() + ":");
String preview = page.getContent().substring(0, Math.min(100, page.getContent().length()));
System.out.println(" " + preview + "...");
if (!page.getTables().isEmpty()) {
System.out.println(" Found " + page.getTables().size() + " table(s)");
}
if (!page.getImages().isEmpty()) {
System.out.println(" Found " + page.getImages().size() + " image(s)");
}
}
}
Table¶
Represents a table extracted from a document.
Accessors:
// Access table data in various formats
List<List<String>> cells = table.getCells(); // 2D list of cell values
String markdown = table.getMarkdown(); // Markdown representation
int pageNumber = table.getPageNumber(); // Page number (1-indexed)
// Helper methods for table navigation
int rows = table.getRowCount(); // Number of rows
int cols = table.getColumnCount(); // Number of columns
String cell = table.getCell(row, col); // Get cell value
List<String> row = table.getRow(rowIndex); // Get row
Example:
List<Table> tables = result.getTables();
// Process all extracted tables
for (Table table : tables) {
System.out.println("Table on page " + table.getPageNumber() + ":");
System.out.println("Size: " + table.getRowCount() + " x " + table.getColumnCount());
System.out.println(table.getMarkdown());
// Iterate through all cells in the table
for (int r = 0; r < table.getRowCount(); r++) {
for (int c = 0; c < table.getColumnCount(); c++) {
System.out.print(table.getCell(r, c) + " | ");
}
System.out.println();
}
}
Chunk¶
Represents a chunk of extracted text (for RAG/embeddings).
Accessors:
// Access chunk data for RAG and embedding workflows
String content = chunk.getContent(); // Chunk text
int index = chunk.getIndex(); // Chunk index
Optional<Map<String, Object>> metadata = chunk.getMetadata(); // Chunk metadata
Example:
// Configure chunking for RAG workflow
ExtractionConfig config = ExtractionConfig.builder()
.chunking(ChunkingConfig.builder()
.maxChars(1000)
.maxOverlap(200)
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
// Process each chunk (e.g., for embedding generation)
for (Chunk chunk : result.getChunks()) {
System.out.println("Chunk " + chunk.getIndex() + ": " + chunk.getContent().substring(0, 50) + "...");
}
ChunkMetadata¶
Metadata for a single text chunk.
Accessors:
// Access chunk metadata for page tracking and boundaries
int byteStart = metadata.getByteStart(); // UTF-8 byte offset (inclusive)
int byteEnd = metadata.getByteEnd(); // UTF-8 byte offset (exclusive)
int charCount = metadata.getCharCount(); // Number of characters
Optional<Integer> tokenCount = metadata.getTokenCount(); // Estimated token count
Optional<Integer> firstPage = metadata.getFirstPage(); // First page (1-indexed)
Optional<Integer> lastPage = metadata.getLastPage(); // Last page (1-indexed)
Fields:
byteStart(int): UTF-8 byte offset in content (inclusive)byteEnd(int): UTF-8 byte offset in content (exclusive)charCount(int): Number of characters in chunktokenCount(Optional): Estimated token count (if configured) firstPage(Optional): First page this chunk appears on (1-indexed, only when page boundaries available) lastPage(Optional): Last page this chunk appears on (1-indexed, only when page boundaries available)
Page tracking: When PageStructure.boundaries is available and chunking is enabled, firstPage and lastPage are automatically calculated based on byte offsets.
Example:
import dev.kreuzberg.*;
var config = ExtractionConfig.builder()
.chunking(ChunkingConfig.builder()
.maxChars(500)
.maxOverlap(50)
.build())
.pages(PageConfig.builder()
.extractPages(true)
.build())
.build();
var result = Kreuzberg.extractFile("document.pdf", config);
if (result.getChunks() != null) {
for (var chunk : result.getChunks()) {
var meta = chunk.getMetadata();
String pageInfo = "";
if (meta.getFirstPage().isPresent()) {
int first = meta.getFirstPage().get();
int last = meta.getLastPage().orElse(first);
if (first == last) {
pageInfo = " (page " + first + ")";
} else {
pageInfo = " (pages " + first + "-" + last + ")";
}
}
System.out.printf(
"Chunk [%d:%d]: %d chars%s%n",
meta.getByteStart(),
meta.getByteEnd(),
meta.getCharCount(),
pageInfo
);
}
}
ExtractedImage¶
Represents an image extracted from a document.
Accessors:
// Access extracted image data and metadata
byte[] data = image.getData(); // Image binary data
String format = image.getFormat(); // Image format (png, jpg, etc.)
String mimeType = image.getMimeType(); // MIME type
int pageNumber = image.getPageNumber(); // Page number
Optional<String> caption = image.getCaption(); // Image caption
Extensibility¶
Custom Post-Processors¶
Post-processors enrich extraction results by transforming content or adding metadata.
Interface:
@FunctionalInterface
public interface PostProcessor {
/**
* Process and enrich an extraction result.
*
* @param result the extraction result
* @return the processed result
*/
ExtractionResult process(ExtractionResult result) throws KreuzbergException;
/**
* Compose with another processor.
*
* @param after the next processor
* @return composed processor
*/
default PostProcessor andThen(PostProcessor after) {
return result -> after.process(this.process(result));
}
/**
* Execution stage.
*
* @return the processing stage (EARLY, MIDDLE, LATE)
*/
default ProcessingStage processingStage() {
return ProcessingStage.MIDDLE;
}
/**
* Execution priority within stage (higher = earlier).
*
* @return priority value
*/
default int priority() {
return 0;
}
}
Example - Word count processor:
import dev.kreuzberg.*;
// Create a post-processor that adds word count to metadata
PostProcessor wordCount = result -> {
long count = result.getContent().split("\\s+").length;
Map<String, Object> metadata = new HashMap<>(result.getMetadata());
metadata.put("word_count", count);
return new ExtractionResult(
result.getContent(),
result.getMimeType(),
metadata,
result.getTables(),
result.getDetectedLanguages(),
result.getChunks(),
result.getImages(),
result.isSuccess()
);
};
// Register the processor with priority 50 in MIDDLE stage
Kreuzberg.registerPostProcessor("word-count", wordCount, 50, ProcessingStage.MIDDLE);
// Extract file and access the word count metadata
ExtractionResult result = Kreuzberg.extractFile("document.pdf");
System.out.println("Word count: " + result.getMetadata().get("word_count"));
Example - Uppercase transformer:
// Create a post-processor that transforms content to uppercase
PostProcessor uppercase = result -> {
return new ExtractionResult(
result.getContent().toUpperCase(),
result.getMimeType(),
result.getMetadata(),
result.getTables(),
result.getDetectedLanguages(),
result.getChunks(),
result.getImages(),
result.isSuccess()
);
};
// Register the uppercase transformer
Kreuzberg.registerPostProcessor("uppercase", uppercase);
Custom Validators¶
Validators check extraction results for quality or completeness.
Interface:
@FunctionalInterface
public interface Validator {
/**
* Validate an extraction result.
*
* @param result the extraction result
* @throws ValidationException if validation fails
*/
void validate(ExtractionResult result) throws ValidationException;
/**
* Compose with another validator.
*
* @param after the next validator
* @return composed validator
*/
default Validator andThen(Validator after) {
return result -> {
this.validate(result);
after.validate(result);
};
}
/**
* Execution priority (higher = earlier).
*
* @return priority value
*/
default int priority() {
return 0;
}
}
Example - Minimum content length validator:
// Create a validator that ensures minimum content length
Validator minLength = result -> {
if (result.getContent().length() < 100) {
throw new ValidationException(
"Content too short: " + result.getContent().length() + " < 100"
);
}
};
// Register the validator
Kreuzberg.registerValidator("min-length", minLength);
Example - Quality score validator:
// Create a validator that checks extraction quality score
Validator qualityValidator = result -> {
double score = result.getMetadata().containsKey("quality_score")
? ((Number) result.getMetadata().get("quality_score")).doubleValue()
: 0.0;
if (score < 0.5) {
throw new ValidationException(
String.format("Quality score too low: %.2f < 0.50", score)
);
}
};
// Register the quality validator
Kreuzberg.registerValidator("quality", qualityValidator);
Plugin Management¶
Register, list, and unregister plugins.
Post-Processor Management:
// Register post-processor with default settings
Kreuzberg.registerPostProcessor("processor-name", processor);
// Register with custom priority and execution stage
Kreuzberg.registerPostProcessor("processor-name", processor, 100, ProcessingStage.EARLY);
// Unregister a specific processor
Kreuzberg.unregisterPostProcessor("processor-name");
// List all registered post-processors
List<String> processors = Kreuzberg.listPostProcessors();
// Remove all post-processors
Kreuzberg.clearPostProcessors();
Validator Management:
// Register validator with default priority
Kreuzberg.registerValidator("validator-name", validator);
// Register with custom priority (higher = earlier execution)
Kreuzberg.registerValidator("validator-name", validator, 100);
// Unregister a specific validator
Kreuzberg.unregisterValidator("validator-name");
// List all registered validators
List<String> validators = Kreuzberg.listValidators();
// Remove all validators
Kreuzberg.clearValidators();
OCR Backend Management:
// Register custom OCR backend
Kreuzberg.registerOcrBackend("backend-name", backend);
// Register with supported language filtering
Kreuzberg.registerOcrBackend("backend-name", backend, List.of("eng", "fra", "deu"));
// Unregister a specific OCR backend
Kreuzberg.unregisterOCRBackend("backend-name");
// List all registered OCR backends
List<String> backends = Kreuzberg.listOCRBackends();
// Remove all custom OCR backends
Kreuzberg.clearOCRBackends();
MIME Type Detection¶
detectMimeType()¶
Detect MIME type from file or bytes.
Signatures:
public static String detectMimeType(String path) throws KreuzbergException
public static String detectMimeType(String path, boolean checkExists) throws KreuzbergException
public static String detectMimeType(byte[] data) throws KreuzbergException
public static String detectMimeTypeFromPath(String path) throws KreuzbergException
Example:
// Detect MIME type from file path
String mimeType = Kreuzberg.detectMimeType("document.pdf");
// Detect from path without checking file existence
String mimeType = Kreuzberg.detectMimeType("document.pdf", false);
// Detect from raw byte array
byte[] data = /* ... */;
String mimeType = Kreuzberg.detectMimeType(data);
validateMimeType()¶
Validate and normalize a MIME type string.
Signature:
Example:
// Validate and normalize a MIME type string
String validated = Kreuzberg.validateMimeType("application/pdf");
System.out.println(validated); // "application/pdf"
getExtensionsForMime()¶
Get file extensions for a given MIME type.
Signature:
Example:
// Get file extensions for PDF files
List<String> extensions = Kreuzberg.getExtensionsForMime("application/pdf");
System.out.println(extensions); // ["pdf"]
// Get file extensions for JPEG images (multiple extensions possible)
List<String> extensions = Kreuzberg.getExtensionsForMime("image/jpeg");
System.out.println(extensions); // ["jpg", "jpeg"]
Embeddings & Presets¶
getEmbeddingPreset()¶
Get embedding preset configuration by name.
Signature:
public static Optional<EmbeddingPreset> getEmbeddingPreset(String name) throws KreuzbergException
Example:
// Retrieve an embedding preset configuration by name
Optional<EmbeddingPreset> preset = Kreuzberg.getEmbeddingPreset("default");
if (preset.isPresent()) {
EmbeddingPreset p = preset.get();
System.out.println("Model: " + p.getModel());
System.out.println("Dimensions: " + p.getDimensions());
}
listEmbeddingPresets()¶
List all available embedding presets.
Signature:
Example:
// List all available embedding presets
List<String> presets = Kreuzberg.listEmbeddingPresets();
for (String preset : presets) {
System.out.println("Available: " + preset);
}
Error Handling¶
Exception Hierarchy¶
Kreuzberg uses a checked exception model for error handling.
Exception
├── IOException (from java.io)
├── KreuzbergException
│ ├── ParsingException
│ ├── OcrException
│ ├── MissingDependencyException
│ ├── ValidationException
│ ├── PluginException
│ ├── CacheException
│ └── ImageProcessingException
Specific Exceptions¶
KreuzbergException - Base exception for all Kreuzberg errors.
// Handle general Kreuzberg exceptions
try {
ExtractionResult result = Kreuzberg.extractFile("document.pdf");
} catch (KreuzbergException e) {
System.err.println("Extraction failed: " + e.getMessage());
if (e.getCause() != null) {
e.getCause().printStackTrace();
}
}
ParsingException - Document parsing failure.
// Handle document parsing errors (e.g., corrupted files)
try {
ExtractionResult result = Kreuzberg.extractFile("corrupted.pdf");
} catch (ParsingException e) {
System.err.println("Failed to parse document: " + e.getMessage());
}
OcrException - OCR processing failure.
// Handle OCR-specific errors
try {
ExtractionConfig config = ExtractionConfig.builder()
.forceOcr(true)
.build();
ExtractionResult result = Kreuzberg.extractFile("image.png", config);
} catch (OcrException e) {
System.err.println("OCR failed: " + e.getMessage());
}
MissingDependencyException - Required system dependency not found.
// Handle missing system dependencies (e.g., Tesseract not installed)
try {
ExtractionResult result = Kreuzberg.extractFile("document.pdf");
} catch (MissingDependencyException e) {
System.err.println("Missing dependency: " + e.getMessage());
System.err.println("Install Tesseract or configure alternative OCR backend");
}
ValidationException - Configuration or validation failure.
// Handle validation errors from custom validators
try {
validator.validate(result);
} catch (ValidationException e) {
System.err.println("Validation failed: " + e.getMessage());
}
Comprehensive Error Handling¶
// Comprehensive error handling for all exception types
try {
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder().backend("tesseract").language("eng").build())
.build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
System.out.println("Success: " + result.getContent().length() + " characters");
} catch (ParsingException e) {
System.err.println("Document format not supported or corrupted");
e.printStackTrace();
} catch (OcrException e) {
System.err.println("OCR processing failed");
e.printStackTrace();
} catch (MissingDependencyException e) {
System.err.println("Missing required dependency");
System.err.println("Message: " + e.getMessage());
} catch (ValidationException e) {
System.err.println("Configuration validation failed");
} catch (IOException e) {
System.err.println("File not found or not readable: " + e.getMessage());
} catch (KreuzbergException e) {
System.err.println("Extraction failed: " + e.getMessage());
} finally {
// Clean up resources if needed
}
Utility Methods¶
getVersion()¶
Get the Kreuzberg library version.
Signature:
Example:
// Get the Kreuzberg library version
String version = Kreuzberg.getVersion();
System.out.println("Kreuzberg version: " + version);
Advanced Usage¶
Configuration Discovery¶
Automatically discover configuration from kreuzberg.toml, kreuzberg.yaml, or kreuzberg.json in the current or parent directories.
// Automatically discover configuration file in directory tree
ExtractionConfig config = ExtractionConfig.discover();
if (config != null) {
System.out.println("Configuration discovered!");
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
} else {
System.out.println("No configuration file found, using defaults");
}
Configuration from File¶
Load configuration from a file explicitly.
// Load configuration from a specific file
ExtractionConfig config = ExtractionConfig.fromFile("kreuzberg.toml");
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
Complex Configuration Example¶
import dev.kreuzberg.*;
import dev.kreuzberg.config.*;
import java.nio.file.Path;
public class ComplexExample {
public static void main(String[] args) throws Exception {
// Build comprehensive extraction configuration with all options
ExtractionConfig config = ExtractionConfig.builder()
.useCache(true)
.enableQualityProcessing(true)
.forceOcr(false)
// Configure OCR for multi-language support
.ocr(OcrConfig.builder()
.backend("tesseract")
.language("eng+fra")
.build())
// Configure PDF extraction options
.pdfOptions(PdfConfig.builder()
.extractImages(true)
.extractMetadata(true)
.build())
// Configure image preprocessing for better OCR results
.imagePreprocessing(ImagePreprocessingConfig.builder()
.targetDpi(300)
.denoise(true)
.deskew(true)
.contrastEnhance(true)
.build())
// Configure chunking for RAG workflows
.chunking(ChunkingConfig.builder()
.maxChars(1000)
.maxOverlap(200)
.enabled(true)
.build())
// Configure automatic language detection
.languageDetection(LanguageDetectionConfig.builder()
.enabled(true)
.minConfidence(0.8)
.build())
.build();
// Register custom post-processor for content transformation
PostProcessor uppercaser = result -> new ExtractionResult(
result.getContent().toUpperCase(),
result.getMimeType(),
result.getMetadata(),
result.getTables(),
result.getDetectedLanguages(),
result.getChunks(),
result.getImages(),
result.isSuccess()
);
// Register custom validator for quality checks
Validator minLength = result -> {
if (result.getContent().length() < 100) {
throw new ValidationException("Content too short");
}
};
Kreuzberg.registerPostProcessor("uppercase", uppercaser, 100, ProcessingStage.EARLY);
Kreuzberg.registerValidator("min-length", minLength);
// Extract document with all configurations applied
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
// Display extraction results
System.out.println("Content: " + result.getContent().substring(0, 100));
System.out.println("Tables: " + result.getTables().size());
System.out.println("Images: " + result.getImages().size());
System.out.println("Chunks: " + result.getChunks().size());
System.out.println("Language: " + result.getLanguage());
System.out.println("MIME: " + result.getMimeType());
}
}
Batch Processing with Error Handling¶
import dev.kreuzberg.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
public class BatchProcessor {
public static void main(String[] args) throws Exception {
// Find all PDF files in the documents directory
List<Path> files = Files.list(Path.of("documents/"))
.filter(p -> p.toString().endsWith(".pdf"))
.toList();
// Convert Path objects to String paths
List<String> filePaths = new ArrayList<>();
for (Path file : files) {
filePaths.add(file.toString());
}
// Configure extraction with caching enabled
ExtractionConfig config = ExtractionConfig.builder()
.useCache(true)
.build();
try {
// Process all files in parallel
List<ExtractionResult> results = Kreuzberg.batchExtractFiles(filePaths, config);
// Check results for each file
for (int i = 0; i < filePaths.size(); i++) {
ExtractionResult result = results.get(i);
Path file = files.get(i);
if (result.isSuccess()) {
System.out.println(file + ": " + result.getContent().length() + " chars");
} else {
System.err.println(file + ": extraction failed");
}
}
} catch (KreuzbergException e) {
System.err.println("Batch extraction failed: " + e.getMessage());
}
}
}
Performance Tips¶
- Reuse configurations - Create one
ExtractionConfigand use it for multiple extractions - Batch processing - Use
batchExtractFiles()for multiple files instead of individual calls - Cache enabled - Keep caching enabled for repeated document processing
- OCR selective - Only enable OCR on pages that need it (
forceOcr = false) - Image preprocessing - Enable image preprocessing for better OCR accuracy
- Async operations - Use async methods for non-blocking extraction in concurrent scenarios
Supported File Formats¶
- Documents: PDF, DOCX, DOC, XLSX, XLS, PPTX, PPT, ODT, ODP, ODS
- Images: PNG, JPG, JPEG, GIF, BMP, WebP, TIFF
- Web: HTML, MHTML
- Text: TXT, CSV
Java FFM API Details¶
The Kreuzberg Java bindings use Java's Foreign Function & Memory (FFM) API for direct FFI without JNI overhead.
Memory Management:
// FFM API uses Arena for automatic memory management
try (Arena arena = Arena.ofConfined()) {
// FFI operations use arena for memory management
ExtractionResult result = Kreuzberg.extractFile("document.pdf");
} // Arena automatically cleaned up when try block exits
Arena Types:
Arena.ofConfined()- Thread-confined arena (recommended)Arena.ofShared()- Shared arena for multi-threaded access
Troubleshooting¶
"Failed to load native library" - Ensure libkreuzberg_ffi is in system library path.
export LD_LIBRARY_PATH=/path/to/libkreuzberg_ffi:$LD_LIBRARY_PATH # Linux/Unix
export DYLD_LIBRARY_PATH=/path/to/libkreuzberg_ffi:$DYLD_LIBRARY_PATH # macOS
set PATH=C:\path\to\libkreuzberg_ffi;%PATH% # Windows
"Tesseract not found" - Install Tesseract OCR:
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Windows
# Download from https://github.com/UB-Mannheim/tesseract/wiki
"OutOfMemoryError with large files" - Use streaming or batch processing with smaller batches.