R API Reference¶

Complete reference for the Kreuzberg R API.

Installation¶

Install from the R-universe repository:

R

install.packages("kreuzberg", repos = "https://kreuzberg-dev.r-universe.dev")

Or install from source using remotes:

R

remotes::install_github("kreuzberg-dev/kreuzberg", subdir = "packages/r")

System Requirements:

R >= 4.2
Rust toolchain (cargo, rustc >= 1.91) for building from source
Supported platforms: Linux (x64, arm64), macOS (Apple Silicon)

Core Functions¶

extract_file_sync()¶

Extract content from a file (synchronous).

Signature:

R

extract_file_sync(path, mime_type = NULL, config = NULL) -> kreuzberg_result

Parameters:

Parameter	Type	Description
`path`	character	Path to the file to extract
`mime_type`	character, NULL	Optional MIME type hint. If NULL, MIME type is auto-detected
`config`	list, NULL	Extraction configuration. Uses defaults if NULL

Returns:

kreuzberg_result: Extraction result object (S3 class inheriting from list)

Raises:

ValidationError: Input validation failed
ParsingError: Document parsing failed
FileNotFoundError: File does not exist
UnsupportedFormatError: Document format not supported
ExtractionError: General extraction failure

Example - Basic usage:

R

library(kreuzberg)

result <- extract_file_sync("document.pdf")
cat("Content:\n", result$content, "\n")
cat("Pages:", page_count(result), "\n")

Example - With configuration:

R

library(kreuzberg)

config <- extraction_config(
  ocr = ocr_config(backend = "tesseract", language = "eng")
)
result <- extract_file_sync("scanned.pdf", config = config)

Example - With explicit MIME type:

R

library(kreuzberg)

result <- extract_file_sync("document.pdf", mime_type = "application/pdf")

extract_file()¶

Extract content from a file (asynchronous via Tokio runtime).

Note: R does not have native async/await. This function internally uses a blocking Tokio runtime. For background processing, run in a separate R process or use a thread pool.

Signature:

R

extract_file(path, mime_type = NULL, config = NULL) -> kreuzberg_result

Parameters:

Same as extract_file_sync().

Returns:

kreuzberg_result: Extraction result object

Example:

R

library(kreuzberg)

# Equivalent to extract_file_sync in R
result <- extract_file("document.pdf")
cat(result$content)

extract_bytes_sync()¶

Extract content from raw bytes (synchronous).

Signature:

R

extract_bytes_sync(data, mime_type, config = NULL) -> kreuzberg_result

Parameters:

Parameter	Type	Description
`data`	raw	Binary data to extract (raw vector)
`mime_type`	character	MIME type of the data (required for format detection)
`config`	list, NULL	Extraction configuration

Returns:

kreuzberg_result: Extraction result object

Example:

R

library(kreuzberg)

data <- readBin("document.pdf", what = "raw", n = file.size("document.pdf"))
result <- extract_bytes_sync(data, "application/pdf")
cat(result$content)

extract_bytes()¶

Extract content from raw bytes (asynchronous via Tokio runtime).

Signature:

R

extract_bytes(data, mime_type, config = NULL) -> kreuzberg_result

Parameters:

Same as extract_bytes_sync().

Returns:

kreuzberg_result: Extraction result object

batch_extract_files_sync()¶

Extract content from multiple files in parallel (synchronous).

Signature:

R

batch_extract_files_sync(paths, config = NULL) -> list of kreuzberg_result

Parameters:

Parameter	Type	Description
`paths`	character	Vector of file paths to extract
`config`	list, NULL	Extraction configuration applied to all files

Returns:

List of kreuzberg_result objects

Example:

R

library(kreuzberg)

paths <- c("doc1.pdf", "doc2.docx", "doc3.xlsx")
results <- batch_extract_files_sync(paths)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", paths[i], nchar(results[[i]]$content)))
}

batch_extract_files()¶

Extract content from multiple files in parallel (asynchronous via Tokio runtime).

Signature:

R

batch_extract_files(paths, config = NULL) -> list of kreuzberg_result

Parameters:

Same as batch_extract_files_sync().

Returns:

List of kreuzberg_result objects

batch_extract_bytes_sync()¶

Extract content from multiple raw byte arrays (synchronous).

Signature:

R

batch_extract_bytes_sync(data_list, mime_types, config = NULL) -> list of kreuzberg_result

Parameters:

Parameter	Type	Description
`data_list`	list of raw	List of binary data (raw vectors)
`mime_types`	character	MIME types corresponding to each byte array
`config`	list, NULL	Extraction configuration

Returns:

List of kreuzberg_result objects

Example:

R

library(kreuzberg)

pdf_data <- readBin("invoice.pdf", what = "raw", n = file.size("invoice.pdf"))
docx_data <- readBin("report.docx", what = "raw", n = file.size("report.docx"))

data_list <- list(pdf_data, docx_data)
mime_types <- c("application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")

results <- batch_extract_bytes_sync(data_list, mime_types)

for (i in seq_along(results)) {
  cat(sprintf("Document %d: %d characters\n", i, nchar(results[[i]]$content)))
}

batch_extract_bytes()¶

Extract content from multiple raw byte arrays (asynchronous via Tokio runtime).

Signature:

R

batch_extract_bytes(data_list, mime_types, config = NULL) -> list of kreuzberg_result

Parameters:

Same as batch_extract_bytes_sync().

Returns:

List of kreuzberg_result objects

Configuration¶

extraction_config()¶

Create an extraction configuration object.

Signature:

R

extraction_config(
  force_ocr = FALSE,
  ocr = NULL,
  chunking = NULL,
  output_format = NULL,
  result_format = NULL,
  use_cache = NULL,
  include_document_structure = NULL,
  enable_quality_processing = NULL,
  language_detection = NULL,
  keywords = NULL,
  token_reduction = NULL,
  images = NULL,
  pages = NULL,
  pdf_options = NULL,
  html_options = NULL,
  postprocessor = NULL,
  security_limits = NULL,
  max_concurrent_extractions = NULL,
  ...
) -> list

Parameters:

Parameter	Type	Description
`ocr`	list, NULL	OCR configuration (see `ocr_config()`)
`chunking`	list, NULL	Text chunking options (see `chunking_config()`)
`use_cache`	logical, NULL	Enable result caching
`language_detection`	list, NULL	Language detection options
`images`	list, NULL	Image extraction options
`pages`	list, NULL	Page extraction options
`pdf_options`	list, NULL	PDF-specific options
`html_options`	list, NULL	HTML-specific options
`postprocessor`	character, NULL	Post-processor name
`security_limits`	list, NULL	Security limit options
Other options		Additional configuration parameters

Returns:

Named list with configuration options

Example:

R

config <- extraction_config(
  ocr = ocr_config(backend = "tesseract", language = "eng"),
  chunking = chunking_config(max_characters = 1000L, overlap = 200L),
  use_cache = TRUE
)

result <- extract_file_sync("document.pdf", config = config)

ocr_config()¶

Create OCR configuration.

Signature:

R

ocr_config(backend = "tesseract", language = "eng", dpi = NULL, ...) -> list

Parameters:

Parameter	Type	Description
`backend`	character	OCR backend ("tesseract" or "paddle-ocr"). Default: "tesseract"
`language`	character	Language code (ISO 639-3). Default: "eng"
`dpi`	integer, NULL	DPI for OCR processing
...		Additional OCR options

Returns:

Named list with OCR configuration

Example:

R

config <- extraction_config(
  ocr = ocr_config(backend = "paddle-ocr", language = "eng")
)

chunking_config()¶

Create text chunking configuration.

Signature:

R

chunking_config(max_characters = 1000L, overlap = 200L, ...) -> list

Parameters:

Parameter	Type	Description
`max_characters`	integer	Maximum characters per chunk. Default: 1000
`overlap`	integer	Overlap between chunks. Default: 200
...		Additional chunking options

Returns:

Named list with chunking configuration

Example:

R

config <- extraction_config(
  chunking = chunking_config(max_characters = 2000L, overlap = 500L)
)

discover()¶

Search for kreuzberg.toml configuration file in current and parent directories.

Signature:

R

discover() -> list or NULL

Returns:

Named list with configuration if found, NULL otherwise

Example:

R

config <- discover()
if (!is.null(config)) {
  result <- extract_file_sync("document.pdf", config = config)
}

from_file()¶

Load configuration from a TOML, YAML, or JSON file.

Signature:

R

from_file(path) -> list

Parameters:

Parameter	Type	Description
`path`	character	Path to configuration file (TOML, YAML, or JSON)

Returns:

Named list with configuration

Example:

R

config <- from_file("kreuzberg.toml")
result <- extract_file_sync("document.pdf", config = config)

Results & Types¶

kreuzberg_result¶

Result object returned by all extraction functions. Inherits from list with named fields.

Fields:

Field	Type	Description
`content`	character	Extracted text content
`mime_type`	character	MIME type of the processed document
`pages`	list, NULL	Per-page extracted content (if page extraction enabled)
`tables`	list, NULL	Array of extracted tables
`chunks`	list, NULL	Text chunks (if chunking enabled)
`images`	list, NULL	Extracted images
`elements`	list, NULL	Document elements
`keywords`	character, NULL	Extracted keywords
`quality_score`	numeric, NULL	Quality score (0.0-1.0)
`detected_language`	character, NULL	Detected language code (ISO 639-1)
`metadata`	list	Document metadata

Example:

R

result <- extract_file_sync("document.pdf")

cat("Content:", result$content, "\n")
cat("MIME type:", result$mime_type, "\n")
cat("Pages:", page_count(result), "\n")
cat("Tables:", length(result$tables), "\n")
cat("Language:", detected_language(result), "\n")

S3 Methods for kreuzberg_result¶

print()¶

Print a brief summary of the result.

R

print(x)

Example:

R

result <- extract_file_sync("document.pdf")
print(result)  # Displays summary

summary()¶

Summarize the extraction result.

R

summary(object)

Example:

R

result <- extract_file_sync("document.pdf")
summary(result)

format()¶

Format the result as a string.

R

format(x)

content()¶

Extract the text content.

R

content(x) -> character

Example:

R

result <- extract_file_sync("document.pdf")
text <- content(result)

mime_type()¶

Get the MIME type of the document.

R

mime_type(x) -> character

Example:

R

result <- extract_file_sync("document.pdf")
type <- mime_type(result)

page_count()¶

Get the number of pages in the document.

R

page_count(x) -> integer

Example:

R

result <- extract_file_sync("document.pdf")
pages <- page_count(result)

chunk_count()¶

Get the number of text chunks.

R

chunk_count(x) -> integer

Example:

R

result <- extract_file_sync("document.pdf", config = extraction_config(chunking = chunking_config()))
chunks <- chunk_count(result)

detected_language()¶

Get the detected language code.

R

detected_language(x) -> character or NULL

Example:

R

result <- extract_file_sync("document.pdf")
lang <- detected_language(result)
if (!is.null(lang)) {
  cat("Language:", lang, "\n")
}

metadata_field()¶

Extract a specific metadata field by name.

R

metadata_field(x, name) -> value or NULL

Parameters:

Parameter	Type	Description
`x`	kreuzberg_result	Result object
`name`	character	Field name

Returns:

Field value or NULL if not present

Example:

R

result <- extract_file_sync("document.pdf")
title <- metadata_field(result, "title")
author <- metadata_field(result, "author")

Metadata Hash¶

Document metadata with format-specific fields.

Common Fields:

Field	Type	Description
`language`	character	Document language (ISO 639-1 code)
`date`	character	Document date (ISO 8601 format)
`subject`	character	Document subject
`format_type`	character	Format discriminator ("pdf", "excel", "email", etc.)

PDF-Specific Fields (when format_type == "pdf"):

Field	Type	Description
`title`	character	PDF title
`author`	character	PDF author
`page_count`	integer	Number of pages
`creation_date`	character	Creation date (ISO 8601)
`modification_date`	character	Modification date (ISO 8601)
`creator`	character	Creator application
`producer`	character	Producer application
`keywords`	character	PDF keywords

Example:

R

result <- extract_file_sync("document.pdf")
metadata <- result$metadata

if (metadata$format_type == "pdf") {
  cat("Title:", metadata$title, "\n")
  cat("Author:", metadata$author, "\n")
  cat("Pages:", metadata$page_count, "\n")
}

Error Handling¶

Errors are raised as typed conditions with class hierarchy: - kreuzberg_error (base) - ValidationError - ParsingError - FileNotFoundError - UnsupportedFormatError - ExtractionError

Example - Basic error handling:

R

library(kreuzberg)

tryCatch(
  result <- extract_file_sync("document.pdf"),
  FileNotFoundError = function(e) {
    cat("File not found:", conditionMessage(e), "\n")
  },
  ValidationError = function(e) {
    cat("Validation error:", conditionMessage(e), "\n")
  },
  kreuzberg_error = function(e) {
    cat("Extraction error:", conditionMessage(e), "\n")
  }
)

Example - Specific error handling:

R

tryCatch(
  {
    result <- extract_file_sync("scanned.pdf", config = extraction_config(
      ocr = ocr_config(backend = "unsupported-backend")
    ))
  },
  ValidationError = function(e) {
    cat("Invalid configuration:", conditionMessage(e), "\n")
  },
  error = function(e) {
    cat("Unexpected error:", conditionMessage(e), "\n")
  }
)

Cache Management¶

clear_cache()¶

Clear the extraction cache.

Signature:

R

clear_cache() -> invisible(NULL)

Example:

R

library(kreuzberg)

clear_cache()

cache_stats()¶

Get cache statistics.

Signature:

R

cache_stats() -> list

Returns:

Named list with:
total_entries (integer): Number of cached entries
total_size_bytes (integer): Total cache size in bytes

Example:

R

library(kreuzberg)

stats <- cache_stats()
cat("Cache entries:", stats$total_entries, "\n")
cat("Cache size:", stats$total_size_bytes, "bytes\n")

Validation¶

validate_ocr_backend_name()¶

Validate OCR backend name.

Signature:

R

validate_ocr_backend_name(backend) -> logical

Parameters:

Parameter	Type	Description
`backend`	character	Backend name to validate

Returns:

Logical: TRUE if valid, FALSE otherwise

Example:

R

library(kreuzberg)

is_valid <- validate_ocr_backend_name("tesseract")
if (!is_valid) {
  cat("Invalid OCR backend\n")
}

validate_language_code()¶

Validate language code.

Signature:

R

validate_language_code(code) -> logical

Parameters:

Parameter	Type	Description
`code`	character	Language code (ISO 639-3 or 639-1)

Returns:

Logical: TRUE if valid, FALSE otherwise

Example:

R

library(kreuzberg)

is_valid <- validate_language_code("eng")

validate_output_format()¶

Validate output format.

Signature:

R

validate_output_format(format) -> logical

Parameters:

Parameter	Type	Description
`format`	character	Output format name

Returns:

Logical: TRUE if valid, FALSE otherwise

Metadata Detection¶

detect_mime_type()¶

Detect MIME type from raw bytes.

Signature:

R

detect_mime_type(data) -> character

Parameters:

Parameter	Type	Description
`data`	raw	Binary data

Returns:

Character: Detected MIME type

Example:

R

library(kreuzberg)

data <- readBin("document", what = "raw", n = file.size("document"))
mime_type <- detect_mime_type(data)
cat("Detected MIME type:", mime_type, "\n")

detect_mime_type_from_path()¶

Detect MIME type from file path.

Signature:

R

detect_mime_type_from_path(path) -> character

Parameters:

Parameter	Type	Description
`path`	character	File path

Returns:

Character: Detected MIME type

Example:

R

library(kreuzberg)

mime_type <- detect_mime_type_from_path("document.pdf")
cat("MIME type:", mime_type, "\n")

get_extensions_for_mime()¶

Get file extensions for a MIME type.

Signature:

R

get_extensions_for_mime(mime_type) -> character

Parameters:

Parameter	Type	Description
`mime_type`	character	MIME type

Returns:

Character vector: File extensions for the MIME type

Example:

R

library(kreuzberg)

extensions <- get_extensions_for_mime("application/pdf")
cat("PDF extensions:", paste(extensions, collapse = ", "), "\n")

validate_mime_type()¶

Validate MIME type.

Signature:

R

validate_mime_type(mime_type) -> logical

Parameters:

Parameter	Type	Description
`mime_type`	character	MIME type to validate

Returns:

Logical: TRUE if valid, FALSE otherwise

Example:

R

library(kreuzberg)

is_valid <- validate_mime_type("application/pdf")

Plugins¶

OCR Backends¶

register_ocr_backend()¶

Register a custom OCR backend.

Signature:

R

register_ocr_backend(name, callback) -> invisible(NULL)

Parameters:

Parameter	Type	Description
`name`	character	Backend name
`callback`	function	Backend implementation function

unregister_ocr_backend()¶

Unregister an OCR backend.

Signature:

R

unregister_ocr_backend(name) -> invisible(NULL)

list_ocr_backends()¶

List all registered OCR backends.

Signature:

R

list_ocr_backends() -> character

Returns:

Character vector: Names of registered backends

Example:

R

library(kreuzberg)

backends <- list_ocr_backends()
cat("Available OCR backends:", paste(backends, collapse = ", "), "\n")

clear_ocr_backends()¶

Clear all registered OCR backends.

Signature:

R

clear_ocr_backends() -> invisible(NULL)

Post-Processors¶

register_post_processor()¶

Register a custom post-processor.

Signature:

R

register_post_processor(name, callback) -> invisible(NULL)

Parameters:

Parameter	Type	Description
`name`	character	Processor name
`callback`	function	Processor implementation function

unregister_post_processor()¶

Unregister a post-processor.

Signature:

R

unregister_post_processor(name) -> invisible(NULL)

list_post_processors()¶

List all registered post-processors.

Signature:

R

list_post_processors() -> character

Returns:

Character vector: Names of registered post-processors

clear_post_processors()¶

Clear all registered post-processors.

Signature:

R

clear_post_processors() -> invisible(NULL)

Validators¶

register_validator()¶

Register a custom validator.

Signature:

R

register_validator(name, callback) -> invisible(NULL)

Parameters:

Parameter	Type	Description
`name`	character	Validator name
`callback`	function	Validator implementation function

unregister_validator()¶

Unregister a validator.

Signature:

R

unregister_validator(name) -> invisible(NULL)

list_validators()¶

List all registered validators.

Signature:

R

list_validators() -> character

Returns:

Character vector: Names of registered validators

clear_validators()¶

Clear all registered validators.

Signature:

R

clear_validators() -> invisible(NULL)

Document Extractors¶

list_document_extractors()¶

List all available document extractors.

Signature:

R

list_document_extractors() -> character

Returns:

Character vector: Names of available document extractors

unregister_document_extractor()¶

Unregister a document extractor.

Signature:

R

unregister_document_extractor(name) -> invisible(NULL)

Parameters:

Parameter	Type	Description
`name`	character	Extractor name

clear_document_extractors()¶

Clear all document extractors.

Signature:

R

clear_document_extractors() -> invisible(NULL)

Thread Safety¶

All Kreuzberg functions are thread-safe and can be called from multiple threads concurrently via R's parallel package or future framework.

Example - Using parallel package:

R

library(kreuzberg)
library(parallel)

files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")

# Use parallel processing
results <- mclapply(files, function(file) {
  extract_file_sync(file)
}, mc.cores = 3)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}

Example - Using future package:

R

library(kreuzberg)
library(future)

plan(multisession)

files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")

# Process files asynchronously
futures <- lapply(files, function(file) {
  future({
    extract_file_sync(file)
  })
})

# Collect results
results <- lapply(futures, value)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}

However, for better performance, use the batch API instead:

R

library(kreuzberg)

files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")

# Better approach: use built-in batch processing
results <- batch_extract_files_sync(files)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}