R API Reference¶
Complete reference for the Kreuzberg R API.
Installation¶
Install from the R-universe repository:
Or install from source using remotes:
System Requirements:
- R >= 4.2
- Rust toolchain (cargo, rustc >= 1.91) for building from source
- Supported platforms: Linux (x64, arm64), macOS (Apple Silicon)
Core Functions¶
extract_file_sync()¶
Extract content from a file (synchronous).
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
path | character | Path to the file to extract |
mime_type | character, NULL | Optional MIME type hint. If NULL, MIME type is auto-detected |
config | list, NULL | Extraction configuration. Uses defaults if NULL |
Returns:
kreuzberg_result: Extraction result object (S3 class inheriting from list)
Raises:
ValidationError: Input validation failedParsingError: Document parsing failedFileNotFoundError: File does not existUnsupportedFormatError: Document format not supportedExtractionError: General extraction failure
Example - Basic usage:
library(kreuzberg)
result <- extract_file_sync("document.pdf")
cat("Content:\n", result$content, "\n")
cat("Pages:", page_count(result), "\n")
Example - With configuration:
library(kreuzberg)
config <- extraction_config(
ocr = ocr_config(backend = "tesseract", language = "eng")
)
result <- extract_file_sync("scanned.pdf", config = config)
Example - With explicit MIME type:
extract_file()¶
Extract content from a file (asynchronous via Tokio runtime).
Note: R does not have native async/await. This function internally uses a blocking Tokio runtime. For background processing, run in a separate R process or use a thread pool.
Signature:
Parameters:
Same as extract_file_sync().
Returns:
kreuzberg_result: Extraction result object
Example:
library(kreuzberg)
# Equivalent to extract_file_sync in R
result <- extract_file("document.pdf")
cat(result$content)
extract_bytes_sync()¶
Extract content from raw bytes (synchronous).
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
data | raw | Binary data to extract (raw vector) |
mime_type | character | MIME type of the data (required for format detection) |
config | list, NULL | Extraction configuration |
Returns:
kreuzberg_result: Extraction result object
Example:
library(kreuzberg)
data <- readBin("document.pdf", what = "raw", n = file.size("document.pdf"))
result <- extract_bytes_sync(data, "application/pdf")
cat(result$content)
extract_bytes()¶
Extract content from raw bytes (asynchronous via Tokio runtime).
Signature:
Parameters:
Same as extract_bytes_sync().
Returns:
kreuzberg_result: Extraction result object
batch_extract_files_sync()¶
Extract content from multiple files in parallel (synchronous).
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
paths | character | Vector of file paths to extract |
config | list, NULL | Extraction configuration applied to all files |
Returns:
- List of
kreuzberg_resultobjects
Example:
library(kreuzberg)
paths <- c("doc1.pdf", "doc2.docx", "doc3.xlsx")
results <- batch_extract_files_sync(paths)
for (i in seq_along(results)) {
cat(sprintf("%s: %d characters\n", paths[i], nchar(results[[i]]$content)))
}
batch_extract_files()¶
Extract content from multiple files in parallel (asynchronous via Tokio runtime).
Signature:
Parameters:
Same as batch_extract_files_sync().
Returns:
- List of
kreuzberg_resultobjects
batch_extract_bytes_sync()¶
Extract content from multiple raw byte arrays (synchronous).
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
data_list | list of raw | List of binary data (raw vectors) |
mime_types | character | MIME types corresponding to each byte array |
config | list, NULL | Extraction configuration |
Returns:
- List of
kreuzberg_resultobjects
Example:
library(kreuzberg)
pdf_data <- readBin("invoice.pdf", what = "raw", n = file.size("invoice.pdf"))
docx_data <- readBin("report.docx", what = "raw", n = file.size("report.docx"))
data_list <- list(pdf_data, docx_data)
mime_types <- c("application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
results <- batch_extract_bytes_sync(data_list, mime_types)
for (i in seq_along(results)) {
cat(sprintf("Document %d: %d characters\n", i, nchar(results[[i]]$content)))
}
batch_extract_bytes()¶
Extract content from multiple raw byte arrays (asynchronous via Tokio runtime).
Signature:
Parameters:
Same as batch_extract_bytes_sync().
Returns:
- List of
kreuzberg_resultobjects
Configuration¶
extraction_config()¶
Create an extraction configuration object.
Signature:
extraction_config(
force_ocr = FALSE,
ocr = NULL,
chunking = NULL,
output_format = NULL,
result_format = NULL,
use_cache = NULL,
include_document_structure = NULL,
enable_quality_processing = NULL,
language_detection = NULL,
keywords = NULL,
token_reduction = NULL,
images = NULL,
pages = NULL,
pdf_options = NULL,
html_options = NULL,
postprocessor = NULL,
security_limits = NULL,
max_concurrent_extractions = NULL,
...
) -> list
Parameters:
| Parameter | Type | Description |
|---|---|---|
ocr | list, NULL | OCR configuration (see ocr_config()) |
chunking | list, NULL | Text chunking options (see chunking_config()) |
use_cache | logical, NULL | Enable result caching |
language_detection | list, NULL | Language detection options |
images | list, NULL | Image extraction options |
pages | list, NULL | Page extraction options |
pdf_options | list, NULL | PDF-specific options |
html_options | list, NULL | HTML-specific options |
postprocessor | character, NULL | Post-processor name |
security_limits | list, NULL | Security limit options |
| Other options | Additional configuration parameters |
Returns:
- Named list with configuration options
Example:
config <- extraction_config(
ocr = ocr_config(backend = "tesseract", language = "eng"),
chunking = chunking_config(max_characters = 1000L, overlap = 200L),
use_cache = TRUE
)
result <- extract_file_sync("document.pdf", config = config)
ocr_config()¶
Create OCR configuration.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
backend | character | OCR backend ("tesseract" or "paddle-ocr"). Default: "tesseract" |
language | character | Language code (ISO 639-3). Default: "eng" |
dpi | integer, NULL | DPI for OCR processing |
| ... | Additional OCR options |
Returns:
- Named list with OCR configuration
Example:
chunking_config()¶
Create text chunking configuration.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
max_characters | integer | Maximum characters per chunk. Default: 1000 |
overlap | integer | Overlap between chunks. Default: 200 |
| ... | Additional chunking options |
Returns:
- Named list with chunking configuration
Example:
discover()¶
Search for kreuzberg.toml configuration file in current and parent directories.
Signature:
Returns:
- Named list with configuration if found, NULL otherwise
Example:
config <- discover()
if (!is.null(config)) {
result <- extract_file_sync("document.pdf", config = config)
}
from_file()¶
Load configuration from a TOML, YAML, or JSON file.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
path | character | Path to configuration file (TOML, YAML, or JSON) |
Returns:
- Named list with configuration
Example:
Results & Types¶
kreuzberg_result¶
Result object returned by all extraction functions. Inherits from list with named fields.
Fields:
| Field | Type | Description |
|---|---|---|
content | character | Extracted text content |
mime_type | character | MIME type of the processed document |
pages | list, NULL | Per-page extracted content (if page extraction enabled) |
tables | list, NULL | Array of extracted tables |
chunks | list, NULL | Text chunks (if chunking enabled) |
images | list, NULL | Extracted images |
elements | list, NULL | Document elements |
keywords | character, NULL | Extracted keywords |
quality_score | numeric, NULL | Quality score (0.0-1.0) |
detected_language | character, NULL | Detected language code (ISO 639-1) |
metadata | list | Document metadata |
Example:
result <- extract_file_sync("document.pdf")
cat("Content:", result$content, "\n")
cat("MIME type:", result$mime_type, "\n")
cat("Pages:", page_count(result), "\n")
cat("Tables:", length(result$tables), "\n")
cat("Language:", detected_language(result), "\n")
S3 Methods for kreuzberg_result¶
print()¶
Print a brief summary of the result.
Example:
summary()¶
Summarize the extraction result.
Example:
format()¶
Format the result as a string.
content()¶
Extract the text content.
Example:
mime_type()¶
Get the MIME type of the document.
Example:
page_count()¶
Get the number of pages in the document.
Example:
chunk_count()¶
Get the number of text chunks.
Example:
result <- extract_file_sync("document.pdf", config = extraction_config(chunking = chunking_config()))
chunks <- chunk_count(result)
detected_language()¶
Get the detected language code.
Example:
result <- extract_file_sync("document.pdf")
lang <- detected_language(result)
if (!is.null(lang)) {
cat("Language:", lang, "\n")
}
metadata_field()¶
Extract a specific metadata field by name.
Parameters:
| Parameter | Type | Description |
|---|---|---|
x | kreuzberg_result | Result object |
name | character | Field name |
Returns:
- Field value or NULL if not present
Example:
result <- extract_file_sync("document.pdf")
title <- metadata_field(result, "title")
author <- metadata_field(result, "author")
Metadata Hash¶
Document metadata with format-specific fields.
Common Fields:
| Field | Type | Description |
|---|---|---|
language | character | Document language (ISO 639-1 code) |
date | character | Document date (ISO 8601 format) |
subject | character | Document subject |
format_type | character | Format discriminator ("pdf", "excel", "email", etc.) |
PDF-Specific Fields (when format_type == "pdf"):
| Field | Type | Description |
|---|---|---|
title | character | PDF title |
author | character | PDF author |
page_count | integer | Number of pages |
creation_date | character | Creation date (ISO 8601) |
modification_date | character | Modification date (ISO 8601) |
creator | character | Creator application |
producer | character | Producer application |
keywords | character | PDF keywords |
Example:
result <- extract_file_sync("document.pdf")
metadata <- result$metadata
if (metadata$format_type == "pdf") {
cat("Title:", metadata$title, "\n")
cat("Author:", metadata$author, "\n")
cat("Pages:", metadata$page_count, "\n")
}
Error Handling¶
Errors are raised as typed conditions with class hierarchy: - kreuzberg_error (base) - ValidationError - ParsingError - FileNotFoundError - UnsupportedFormatError - ExtractionError
Example - Basic error handling:
library(kreuzberg)
tryCatch(
result <- extract_file_sync("document.pdf"),
FileNotFoundError = function(e) {
cat("File not found:", conditionMessage(e), "\n")
},
ValidationError = function(e) {
cat("Validation error:", conditionMessage(e), "\n")
},
kreuzberg_error = function(e) {
cat("Extraction error:", conditionMessage(e), "\n")
}
)
Example - Specific error handling:
tryCatch(
{
result <- extract_file_sync("scanned.pdf", config = extraction_config(
ocr = ocr_config(backend = "unsupported-backend")
))
},
ValidationError = function(e) {
cat("Invalid configuration:", conditionMessage(e), "\n")
},
error = function(e) {
cat("Unexpected error:", conditionMessage(e), "\n")
}
)
Cache Management¶
clear_cache()¶
Clear the extraction cache.
Signature:
Example:
cache_stats()¶
Get cache statistics.
Signature:
Returns:
- Named list with:
total_entries(integer): Number of cached entriestotal_size_bytes(integer): Total cache size in bytes
Example:
library(kreuzberg)
stats <- cache_stats()
cat("Cache entries:", stats$total_entries, "\n")
cat("Cache size:", stats$total_size_bytes, "bytes\n")
Validation¶
validate_ocr_backend_name()¶
Validate OCR backend name.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
backend | character | Backend name to validate |
Returns:
- Logical: TRUE if valid, FALSE otherwise
Example:
library(kreuzberg)
is_valid <- validate_ocr_backend_name("tesseract")
if (!is_valid) {
cat("Invalid OCR backend\n")
}
validate_language_code()¶
Validate language code.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
code | character | Language code (ISO 639-3 or 639-1) |
Returns:
- Logical: TRUE if valid, FALSE otherwise
Example:
validate_output_format()¶
Validate output format.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
format | character | Output format name |
Returns:
- Logical: TRUE if valid, FALSE otherwise
Metadata Detection¶
detect_mime_type()¶
Detect MIME type from raw bytes.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
data | raw | Binary data |
Returns:
- Character: Detected MIME type
Example:
library(kreuzberg)
data <- readBin("document", what = "raw", n = file.size("document"))
mime_type <- detect_mime_type(data)
cat("Detected MIME type:", mime_type, "\n")
detect_mime_type_from_path()¶
Detect MIME type from file path.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
path | character | File path |
Returns:
- Character: Detected MIME type
Example:
library(kreuzberg)
mime_type <- detect_mime_type_from_path("document.pdf")
cat("MIME type:", mime_type, "\n")
get_extensions_for_mime()¶
Get file extensions for a MIME type.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
mime_type | character | MIME type |
Returns:
- Character vector: File extensions for the MIME type
Example:
library(kreuzberg)
extensions <- get_extensions_for_mime("application/pdf")
cat("PDF extensions:", paste(extensions, collapse = ", "), "\n")
validate_mime_type()¶
Validate MIME type.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
mime_type | character | MIME type to validate |
Returns:
- Logical: TRUE if valid, FALSE otherwise
Example:
Plugins¶
OCR Backends¶
register_ocr_backend()¶
Register a custom OCR backend.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
name | character | Backend name |
callback | function | Backend implementation function |
unregister_ocr_backend()¶
Unregister an OCR backend.
Signature:
list_ocr_backends()¶
List all registered OCR backends.
Signature:
Returns:
- Character vector: Names of registered backends
Example:
library(kreuzberg)
backends <- list_ocr_backends()
cat("Available OCR backends:", paste(backends, collapse = ", "), "\n")
clear_ocr_backends()¶
Clear all registered OCR backends.
Signature:
Post-Processors¶
register_post_processor()¶
Register a custom post-processor.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
name | character | Processor name |
callback | function | Processor implementation function |
unregister_post_processor()¶
Unregister a post-processor.
Signature:
list_post_processors()¶
List all registered post-processors.
Signature:
Returns:
- Character vector: Names of registered post-processors
clear_post_processors()¶
Clear all registered post-processors.
Signature:
Validators¶
register_validator()¶
Register a custom validator.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
name | character | Validator name |
callback | function | Validator implementation function |
unregister_validator()¶
Unregister a validator.
Signature:
list_validators()¶
List all registered validators.
Signature:
Returns:
- Character vector: Names of registered validators
clear_validators()¶
Clear all registered validators.
Signature:
Document Extractors¶
list_document_extractors()¶
List all available document extractors.
Signature:
Returns:
- Character vector: Names of available document extractors
unregister_document_extractor()¶
Unregister a document extractor.
Signature:
Parameters:
| Parameter | Type | Description |
|---|---|---|
name | character | Extractor name |
clear_document_extractors()¶
Clear all document extractors.
Signature:
Thread Safety¶
All Kreuzberg functions are thread-safe and can be called from multiple threads concurrently via R's parallel package or future framework.
Example - Using parallel package:
library(kreuzberg)
library(parallel)
files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")
# Use parallel processing
results <- mclapply(files, function(file) {
extract_file_sync(file)
}, mc.cores = 3)
for (i in seq_along(results)) {
cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}
Example - Using future package:
library(kreuzberg)
library(future)
plan(multisession)
files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")
# Process files asynchronously
futures <- lapply(files, function(file) {
future({
extract_file_sync(file)
})
})
# Collect results
results <- lapply(futures, value)
for (i in seq_along(results)) {
cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}
However, for better performance, use the batch API instead: