R API Reference¶
Complete reference for the Kreuzberg R API.
Installation¶
Install from the R-universe repository:
Or install from source using remotes:
System Requirements:
- R >= 4.2
- Rust toolchain (cargo, rustc >= 1.91) for building from source
- Supported platforms: Linux (x64, arm64), macOS (Apple Silicon)
Core Functions¶
Batch_extract_bytes()¶
Extract content from multiple raw byte arrays (asynchronous via Tokio runtime).
Signature:
Parameters:
Same as batch_extract_bytes_sync().
Returns:
- List of
kreuzberg_resultobjects
Batch_extract_bytes_sync()¶
Extract content from multiple raw byte arrays (synchronous).
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
data_list |
list of raw | List of binary data (raw vectors) |
mime_types |
character | MIME types corresponding to each byte array |
config |
list, NULL | Extraction configuration |
Returns:
- List of
kreuzberg_resultobjects
Example:
library(kreuzberg)
pdf_data <- readBin("invoice.pdf", what = "raw", n = file.size("invoice.pdf"))
docx_data <- readBin("report.docx", what = "raw", n = file.size("report.docx"))
data_list <- list(pdf_data, docx_data)
mime_types <- c("application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
results <- batch_extract_bytes_sync(data_list, mime_types)
for (i in seq_along(results)) {
cat(sprintf("Document %d: %d characters\n", i, nchar(results[[i]]$content)))
}
Batch_extract_files()¶
Extract content from multiple files in parallel (asynchronous via Tokio runtime).
Signature:
Parameters:
Same as batch_extract_files_sync().
Returns:
- List of
kreuzberg_resultobjects
Batch_extract_files_sync()¶
Extract content from multiple files in parallel (synchronous).
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
paths |
character | Vector of file paths to extract |
config |
list, NULL | Extraction configuration applied to all files |
Returns:
- List of
kreuzberg_resultobjects
Example:
library(kreuzberg)
paths <- c("doc1.pdf", "doc2.docx", "doc3.xlsx")
results <- batch_extract_files_sync(paths)
for (i in seq_along(results)) {
cat(sprintf("%s: %d characters\n", paths[i], nchar(results[[i]]$content)))
}
Extract_bytes()¶
Extract content from raw bytes (asynchronous via Tokio runtime).
Signature:
Parameters:
Same as extract_bytes_sync().
Returns:
kreuzberg_result: Extraction result object
Extract_bytes_sync()¶
Extract content from raw bytes (synchronous).
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
data |
raw | Binary data to extract (raw vector) |
mime_type |
character | MIME type of the data (required for format detection) |
config |
list, NULL | Extraction configuration |
Returns:
kreuzberg_result: Extraction result object
Example:
library(kreuzberg)
data <- readBin("document.pdf", what = "raw", n = file.size("document.pdf"))
result <- extract_bytes_sync(data, "application/pdf")
cat(result$content)
Extract_file()¶
Extract content from a file (asynchronous via Tokio runtime).
Note: R does not have native async/await. This function internally uses a blocking Tokio runtime. For background processing, run in a separate R process or use a thread pool.
Signature:
Parameters:
Same as extract_file_sync().
Returns:
kreuzberg_result: Extraction result object
Example:
library(kreuzberg)
# Equivalent to extract_file_sync in R
result <- extract_file("document.pdf")
cat(result$content)
Extract_file_sync()¶
Extract content from a file (synchronous).
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
path |
character | Path to the file to extract |
mime_type |
character, NULL | Optional MIME type hint. If NULL, MIME type is auto-detected |
config |
list, NULL | Extraction configuration. Uses defaults if NULL |
Returns:
kreuzberg_result: Extraction result object (S3 class inheriting from list)
Raises:
ValidationError: Input validation failedParsingError: Document parsing failedFileNotFoundError: File does not existUnsupportedFormatError: Document format not supportedExtractionError: General extraction failure
Example - Basic usage:
library(kreuzberg)
result <- extract_file_sync("document.pdf")
cat("Content:\n", result$content, "\n")
cat("Pages:", page_count(result), "\n")
Example - With configuration:
library(kreuzberg)
config <- extraction_config(
ocr = ocr_config(backend = "tesseract", language = "eng")
)
result <- extract_file_sync("scanned.pdf", config = config)
Example - With explicit MIME type:
Configuration¶
Chunking_config()¶
Create text chunking configuration.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
max_characters |
integer | Maximum characters per chunk. Default: 1000 |
overlap |
integer | Overlap between chunks. Default: 200 |
| ... | Additional chunking options |
Returns:
- Named list with chunking configuration
Example:
Discover()¶
Search for kreuzberg.toml configuration file in current and parent directories.
Signature:
Returns:
- Named list with configuration if found, NULL otherwise
Example:
config <- discover()
if (!is.null(config)) {
result <- extract_file_sync("document.pdf", config = config)
}
Extraction_config()¶
Create an extraction configuration object.
Signature:
extraction_config(
chunking = NULL,
enable_quality_processing = NULL,
force_ocr = FALSE,
html_options = NULL,
images = NULL,
include_document_structure = NULL,
keywords = NULL,
language_detection = NULL,
layout = NULL,
max_concurrent_extractions = NULL,
ocr = NULL,
output_format = NULL,
pages = NULL,
pdf_options = NULL,
postprocessor = NULL,
result_format = NULL,
security_limits = NULL,
token_reduction = NULL,
use_cache = NULL,
...
) -> list
Parameters:
| Parameter | Type | Description |
|---|---|---|
chunking |
list, NULL | Text chunking options (see chunking_config()) |
enable_quality_processing |
logical, NULL | Enable quality processing enhancements |
force_ocr |
logical | Force OCR on all documents regardless of document type |
html_options |
list, NULL | HTML-specific options |
images |
list, NULL | Image extraction options |
include_document_structure |
logical, NULL | Include hierarchical document structure in results |
keywords |
list, NULL | Keyword extraction options |
language_detection |
list, NULL | Language detection options |
layout |
list, NULL | Layout detection options |
max_concurrent_extractions |
integer, NULL | Maximum concurrent extractions for batch operations |
ocr |
list, NULL | OCR configuration (see ocr_config()) |
output_format |
character, NULL | Output format for extracted content ('plain', 'markdown', 'djot', 'html') |
pages |
list, NULL | Page extraction options |
pdf_options |
list, NULL | PDF-specific options |
postprocessor |
character, NULL | Post-processor name |
result_format |
character, NULL | Result format ('unified', 'element_based') |
security_limits |
list, NULL | Security limit options |
token_reduction |
list, NULL | Token reduction options |
use_cache |
logical, NULL | Enable extraction result caching |
| Other options | Additional configuration parameters |
Returns:
- Named list with configuration options
Example:
config <- extraction_config(
ocr = ocr_config(backend = "tesseract", language = "eng"),
chunking = chunking_config(max_characters = 1000L, overlap = 200L),
use_cache = TRUE
)
result <- extract_file_sync("document.pdf", config = config)
From_file()¶
Load configuration from a TOML, YAML, or JSON file.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
path |
character | Path to configuration file (TOML, YAML, or JSON) |
Returns:
- Named list with configuration
Example:
Layout_detection_config()¶
Create a layout detection configuration.
Signature:
layout_detection_config(confidence_threshold = NULL, apply_heuristics = TRUE, table_model = NULL, ...) -> list
Parameters:
| Parameter | Type | Description |
|---|---|---|
apply_heuristics |
logical | Whether to apply heuristic post-processing to refine layout regions. Default: TRUE |
confidence_threshold |
numeric, NULL | Minimum confidence threshold for detected regions (0.0-1.0). Default: NULL |
table_model |
character, NULL | Table structure recognition model: "tatr" (default), "slanet_wired", "slanet_wireless", "slanet_plus", "slanet_auto". Default: NULL |
| ... | Additional layout detection options |
Returns:
- Named list with layout detection configuration
Example:
Ocr_config()¶
Create OCR configuration.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
backend |
character | OCR backend ("tesseract" or "paddle-ocr"). Default: "tesseract" |
dpi |
integer, NULL | DPI for OCR processing |
language |
character | Language code (ISO 639-3). Default: "eng" |
model_tier |
character, NULL | v4.5.0 PaddleOCR model tier: "mobile" (lightweight, ~21MB total, fast) or "server" (high accuracy, ~172MB, best with GPU). Default: "mobile" |
padding |
integer, NULL | v4.5.0 Padding in pixels (0-100) added around the image before PaddleOCR detection. Default: 10 |
| ... | Additional OCR options |
Returns:
- Named list with OCR configuration
Example:
Results & Types¶
Kreuzberg_result¶
Result object returned by all extraction functions. Inherits from list with named fields.
Fields:
| Field | Type | Description |
|---|---|---|
annotations |
list, NULL | PDF annotations (links, highlights, notes) |
chunks |
list, NULL | Text chunks (if chunking enabled) |
content |
character | Extracted text content |
detected_language |
character, NULL | Detected language code (ISO 639-1) |
djot_content |
list, NULL | Structured Djot content |
document |
list, NULL | Hierarchical document structure |
elements |
list, NULL | Document semantic elements |
extracted_keywords |
list, NULL | Extracted keywords with scores |
images |
list, NULL | Extracted images |
metadata |
list | Document metadata |
mime_type |
character | MIME type of the processed document |
ocr_elements |
list, NULL | OCR elements with positioning and confidence |
pages |
list, NULL | Per-page extracted content (if page extraction enabled) |
processing_warnings |
list, NULL | Non-fatal processing warnings |
quality_score |
numeric, NULL | Quality score (0.0-1.0) |
tables |
list, NULL | Array of extracted tables |
Example:
result <- extract_file_sync("document.pdf")
cat("Content:", result$content, "\n")
cat("MIME type:", result$mime_type, "\n")
cat("Pages:", page_count(result), "\n")
cat("Tables:", length(result$tables), "\n")
cat("Language:", detected_language(result), "\n")
S3 Methods for kreuzberg_result¶
Chunk_count()¶
Get the number of text chunks.
Example:
result <- extract_file_sync("document.pdf", config = extraction_config(chunking = chunking_config()))
chunks <- chunk_count(result)
Content()¶
Extract the text content.
Example:
Detected_language()¶
Get the detected language code.
Example:
result <- extract_file_sync("document.pdf")
lang <- detected_language(result)
if (!is.null(lang)) {
cat("Language:", lang, "\n")
}
Format()¶
Format the result as a string.
Metadata_field()¶
Extract a specific metadata field by name.
Parameters:
| Parameter | Type | Description |
|---|---|---|
x |
kreuzberg_result | Result object |
name |
character | Field name |
Returns:
- Field value or NULL if not present
Example:
result <- extract_file_sync("document.pdf")
title <- metadata_field(result, "title")
author <- metadata_field(result, "author")
Mime_type()¶
Get the MIME type of the document.
Example:
Page_count()¶
Get the number of pages in the document.
Example:
Print()¶
Print a brief summary of the result.
Example:
Summary()¶
Summarize the extraction result.
Example:
Metadata Hash¶
Document metadata with format-specific fields.
Common Fields:
| Field | Type | Description |
|---|---|---|
authors |
character | Document authors |
created_at |
character | Creation date (ISO 8601) |
created_by |
character | Creator/application name |
custom |
list | Additional custom metadata from postprocessors |
date |
character | Document date (ISO 8601 format) |
format_type |
character | Format discriminator ("pdf", "excel", "email", etc.) |
keywords |
character | Document keywords |
language |
character | Document language (ISO 639-1 code) |
modified_at |
character | Modification date (ISO 8601) |
page_count |
integer | Number of pages |
producer |
character | Producer/generator |
subject |
character | Document subject |
title |
character | Document title |
Example:
result <- extract_file_sync("document.pdf")
metadata <- result$metadata
if (metadata$format_type == "pdf") {
cat("Title:", metadata$title, "\n")
cat("Author:", metadata$authors, "\n")
cat("Pages:", metadata$page_count, "\n")
}
PDF Rendering¶
Added in v4.6.2
Render_pdf_page()¶
Render a single page of a PDF as a PNG image.
Signature:
Parameters:
path(character): Path to the PDF filepage_index(integer): Zero-based page index to renderdpi(integer): Resolution for rendering (default 150L)
Returns:
rawvector: PNG-encoded raw vector for the requested page
Example:
Error Handling¶
Errors are raised as typed conditions with class hierarchy:
kreuzberg_error(base)ValidationErrorParsingErrorFileNotFoundErrorUnsupportedFormatErrorExtractionError
Example - Basic error handling:
library(kreuzberg)
tryCatch(
result <- extract_file_sync("document.pdf"),
FileNotFoundError = function(e) {
cat("File not found:", conditionMessage(e), "\n")
},
ValidationError = function(e) {
cat("Validation error:", conditionMessage(e), "\n")
},
kreuzberg_error = function(e) {
cat("Extraction error:", conditionMessage(e), "\n")
}
)
Example - Specific error handling:
tryCatch(
{
result <- extract_file_sync("scanned.pdf", config = extraction_config(
ocr = ocr_config(backend = "unsupported-backend")
))
},
ValidationError = function(e) {
cat("Invalid configuration:", conditionMessage(e), "\n")
},
error = function(e) {
cat("Unexpected error:", conditionMessage(e), "\n")
}
)
Cache Management¶
Cache_stats()¶
Get cache statistics.
Signature:
Returns:
- Named list with:
total_entries(integer): Number of cached entriestotal_size_bytes(integer): Total cache size in bytes
Example:
library(kreuzberg)
stats <- cache_stats()
cat("Cache entries:", stats$total_entries, "\n")
cat("Cache size:", stats$total_size_bytes, "bytes\n")
Clear_cache()¶
Clear the extraction cache.
Signature:
Example:
Validation¶
Validate_language_code()¶
Validate language code.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
code |
character | Language code (ISO 639-3 or 639-1) |
Returns:
- Logical: TRUE if valid, FALSE otherwise
Example:
Validate_mime_type()¶
Validate MIME type.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
mime_type |
character | MIME type to validate |
Returns:
- Logical: TRUE if valid, FALSE otherwise
Example:
Validate_ocr_backend_name()¶
Validate OCR backend name.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
backend |
character | Backend name to validate |
Returns:
- Logical: TRUE if valid, FALSE otherwise
Example:
library(kreuzberg)
is_valid <- validate_ocr_backend_name("tesseract")
if (!is_valid) {
cat("Invalid OCR backend\n")
}
Validate_output_format()¶
Validate output format.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
format |
character | Output format name |
Returns:
- Logical: TRUE if valid, FALSE otherwise
Metadata Detection¶
Detect_mime_type()¶
Detect MIME type from raw bytes.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
data |
raw | Binary data |
Returns:
- Character: Detected MIME type
Example:
library(kreuzberg)
data <- readBin("document", what = "raw", n = file.size("document"))
mime_type <- detect_mime_type(data)
cat("Detected MIME type:", mime_type, "\n")
Detect_mime_type_from_path()¶
Detect MIME type from file path.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
path |
character | File path |
Returns:
- Character: Detected MIME type
Example:
library(kreuzberg)
mime_type <- detect_mime_type_from_path("document.pdf")
cat("MIME type:", mime_type, "\n")
Get_extensions_for_mime()¶
Get file extensions for a MIME type.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
mime_type |
character | MIME type |
Returns:
- Character vector: File extensions for the MIME type
Example:
library(kreuzberg)
extensions <- get_extensions_for_mime("application/pdf")
cat("PDF extensions:", paste(extensions, collapse = ", "), "\n")
Plugins¶
OCR Backends¶
Clear_ocr_backends()¶
Clear all registered OCR backends.
Signature:
List_ocr_backends()¶
List all registered OCR backends.
Signature:
Returns:
- Character vector: Names of registered backends
Example:
library(kreuzberg)
backends <- list_ocr_backends()
cat("Available OCR backends:", paste(backends, collapse = ", "), "\n")
Register_ocr_backend()¶
Register a custom OCR backend.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
name |
character | Backend name |
callback |
function | Backend implementation function |
Unregister_ocr_backend()¶
Unregister an OCR backend.
Signature:
Post-Processors¶
Clear_post_processors()¶
Clear all registered post-processors.
Signature:
List_post_processors()¶
List all registered post-processors.
Signature:
Returns:
- Character vector: Names of registered post-processors
Register_post_processor()¶
Register a custom post-processor.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
name |
character | Processor name |
callback |
function | Processor implementation function |
Unregister_post_processor()¶
Unregister a post-processor.
Signature:
Validators¶
Clear_validators()¶
Clear all registered validators.
Signature:
List_validators()¶
List all registered validators.
Signature:
Returns:
- Character vector: Names of registered validators
Register_validator()¶
Register a custom validator.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
name |
character | Validator name |
callback |
function | Validator implementation function |
Unregister_validator()¶
Unregister a validator.
Signature:
Document Extractors¶
Clear_document_extractors()¶
Clear all document extractors.
Signature:
List_document_extractors()¶
List all available document extractors.
Signature:
Returns:
- Character vector: Names of available document extractors
Unregister_document_extractor()¶
Unregister a document extractor.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
name |
character | Extractor name |
Thread Safety¶
All Kreuzberg functions are thread-safe and can be called from multiple threads concurrently via R's parallel package or future framework.
Example - Using parallel package:
library(kreuzberg)
library(parallel)
files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")
# Use parallel processing
results <- mclapply(files, function(file) {
extract_file_sync(file)
}, mc.cores = 3)
for (i in seq_along(results)) {
cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}
Example - Using future package:
library(kreuzberg)
library(future)
plan(multisession)
files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")
# Process files asynchronously
futures <- lapply(files, function(file) {
future({
extract_file_sync(file)
})
})
# Collect results
results <- lapply(futures, value)
for (i in seq_along(results)) {
cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}
However, for better performance, use the batch API instead:
library(kreuzberg)
files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")
# Better approach: use built-in batch processing
results <- batch_extract_files_sync(files)
for (i in seq_along(results)) {
cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}
LLM Integration¶
Kreuzberg integrates with LLMs via the liter-llm crate for structured extraction and VLM-based OCR. The R binding passes LLM configuration as list options through the extendr FFI layer. See the LLM Integration Guide for full details.
Structured Extraction¶
Pass structured_extraction config to extract structured data from documents using an LLM:
library(kreuzberg)
config <- list(
structured_extraction = list(
schema = list(
type = "object",
properties = list(
title = list(type = "string"),
authors = list(type = "array", items = list(type = "string")),
date = list(type = "string")
),
required = c("title", "authors", "date"),
additionalProperties = FALSE
),
llm = list(model = "openai/gpt-4o-mini"),
strict = TRUE
)
)
result <- extract_file_sync("paper.pdf", config = config)
if (!is.null(result$structured_output)) {
data <- jsonlite::fromJSON(result$structured_output)
cat("Title:", data$title, "\n")
}
VLM OCR¶
Use a vision-language model as an OCR backend:
config <- list(
force_ocr = TRUE,
ocr = list(
backend = "vlm",
vlm_config = list(model = "openai/gpt-4o-mini")
)
)
result <- extract_file_sync("scan.pdf", config = config)
For configuration details including API keys, model selection, and provider setup, see the LLM Integration Guide.