Skip to content

R API Reference

Complete reference for the Kreuzberg R API.

Installation

Install from the R-universe repository:

R
install.packages("kreuzberg", repos = "https://kreuzberg-dev.r-universe.dev")

Or install from source using remotes:

R
remotes::install_github("kreuzberg-dev/kreuzberg", subdir = "packages/r")

System Requirements:

  • R >= 4.2
  • Rust toolchain (cargo, rustc >= 1.91) for building from source
  • Supported platforms: Linux (x64, arm64), macOS (Apple Silicon)

Core Functions

extract_file_sync()

Extract content from a file (synchronous).

Signature:

R
extract_file_sync(path, mime_type = NULL, config = NULL) -> kreuzberg_result

Parameters:

Parameter Type Description
path character Path to the file to extract
mime_type character, NULL Optional MIME type hint. If NULL, MIME type is auto-detected
config list, NULL Extraction configuration. Uses defaults if NULL

Returns:

  • kreuzberg_result: Extraction result object (S3 class inheriting from list)

Raises:

  • ValidationError: Input validation failed
  • ParsingError: Document parsing failed
  • FileNotFoundError: File does not exist
  • UnsupportedFormatError: Document format not supported
  • ExtractionError: General extraction failure

Example - Basic usage:

R
library(kreuzberg)

result <- extract_file_sync("document.pdf")
cat("Content:\n", result$content, "\n")
cat("Pages:", page_count(result), "\n")

Example - With configuration:

R
library(kreuzberg)

config <- extraction_config(
  ocr = ocr_config(backend = "tesseract", language = "eng")
)
result <- extract_file_sync("scanned.pdf", config = config)

Example - With explicit MIME type:

R
library(kreuzberg)

result <- extract_file_sync("document.pdf", mime_type = "application/pdf")

extract_file()

Extract content from a file (asynchronous via Tokio runtime).

Note: R does not have native async/await. This function internally uses a blocking Tokio runtime. For background processing, run in a separate R process or use a thread pool.

Signature:

R
extract_file(path, mime_type = NULL, config = NULL) -> kreuzberg_result

Parameters:

Same as extract_file_sync().

Returns:

  • kreuzberg_result: Extraction result object

Example:

R
library(kreuzberg)

# Equivalent to extract_file_sync in R
result <- extract_file("document.pdf")
cat(result$content)

extract_bytes_sync()

Extract content from raw bytes (synchronous).

Signature:

R
extract_bytes_sync(data, mime_type, config = NULL) -> kreuzberg_result

Parameters:

Parameter Type Description
data raw Binary data to extract (raw vector)
mime_type character MIME type of the data (required for format detection)
config list, NULL Extraction configuration

Returns:

  • kreuzberg_result: Extraction result object

Example:

R
library(kreuzberg)

data <- readBin("document.pdf", what = "raw", n = file.size("document.pdf"))
result <- extract_bytes_sync(data, "application/pdf")
cat(result$content)

extract_bytes()

Extract content from raw bytes (asynchronous via Tokio runtime).

Signature:

R
extract_bytes(data, mime_type, config = NULL) -> kreuzberg_result

Parameters:

Same as extract_bytes_sync().

Returns:

  • kreuzberg_result: Extraction result object

batch_extract_files_sync()

Extract content from multiple files in parallel (synchronous).

Signature:

R
batch_extract_files_sync(paths, config = NULL) -> list of kreuzberg_result

Parameters:

Parameter Type Description
paths character Vector of file paths to extract
config list, NULL Extraction configuration applied to all files

Returns:

  • List of kreuzberg_result objects

Example:

R
library(kreuzberg)

paths <- c("doc1.pdf", "doc2.docx", "doc3.xlsx")
results <- batch_extract_files_sync(paths)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", paths[i], nchar(results[[i]]$content)))
}

batch_extract_files()

Extract content from multiple files in parallel (asynchronous via Tokio runtime).

Signature:

R
batch_extract_files(paths, config = NULL) -> list of kreuzberg_result

Parameters:

Same as batch_extract_files_sync().

Returns:

  • List of kreuzberg_result objects

batch_extract_bytes_sync()

Extract content from multiple raw byte arrays (synchronous).

Signature:

R
batch_extract_bytes_sync(data_list, mime_types, config = NULL) -> list of kreuzberg_result

Parameters:

Parameter Type Description
data_list list of raw List of binary data (raw vectors)
mime_types character MIME types corresponding to each byte array
config list, NULL Extraction configuration

Returns:

  • List of kreuzberg_result objects

Example:

R
library(kreuzberg)

pdf_data <- readBin("invoice.pdf", what = "raw", n = file.size("invoice.pdf"))
docx_data <- readBin("report.docx", what = "raw", n = file.size("report.docx"))

data_list <- list(pdf_data, docx_data)
mime_types <- c("application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")

results <- batch_extract_bytes_sync(data_list, mime_types)

for (i in seq_along(results)) {
  cat(sprintf("Document %d: %d characters\n", i, nchar(results[[i]]$content)))
}

batch_extract_bytes()

Extract content from multiple raw byte arrays (asynchronous via Tokio runtime).

Signature:

R
batch_extract_bytes(data_list, mime_types, config = NULL) -> list of kreuzberg_result

Parameters:

Same as batch_extract_bytes_sync().

Returns:

  • List of kreuzberg_result objects

Configuration

extraction_config()

Create an extraction configuration object.

Signature:

R
extraction_config(
  force_ocr = FALSE,
  ocr = NULL,
  chunking = NULL,
  output_format = NULL,
  result_format = NULL,
  use_cache = NULL,
  include_document_structure = NULL,
  enable_quality_processing = NULL,
  language_detection = NULL,
  keywords = NULL,
  token_reduction = NULL,
  images = NULL,
  pages = NULL,
  pdf_options = NULL,
  html_options = NULL,
  postprocessor = NULL,
  security_limits = NULL,
  max_concurrent_extractions = NULL,
  ...
) -> list

Parameters:

Parameter Type Description
ocr list, NULL OCR configuration (see ocr_config())
chunking list, NULL Text chunking options (see chunking_config())
use_cache logical, NULL Enable result caching
language_detection list, NULL Language detection options
images list, NULL Image extraction options
pages list, NULL Page extraction options
pdf_options list, NULL PDF-specific options
html_options list, NULL HTML-specific options
postprocessor character, NULL Post-processor name
security_limits list, NULL Security limit options
Other options Additional configuration parameters

Returns:

  • Named list with configuration options

Example:

R
config <- extraction_config(
  ocr = ocr_config(backend = "tesseract", language = "eng"),
  chunking = chunking_config(max_characters = 1000L, overlap = 200L),
  use_cache = TRUE
)

result <- extract_file_sync("document.pdf", config = config)

ocr_config()

Create OCR configuration.

Signature:

R
ocr_config(backend = "tesseract", language = "eng", dpi = NULL, ...) -> list

Parameters:

Parameter Type Description
backend character OCR backend ("tesseract" or "paddle-ocr"). Default: "tesseract"
language character Language code (ISO 639-3). Default: "eng"
dpi integer, NULL DPI for OCR processing
... Additional OCR options

Returns:

  • Named list with OCR configuration

Example:

R
config <- extraction_config(
  ocr = ocr_config(backend = "paddle-ocr", language = "eng")
)

chunking_config()

Create text chunking configuration.

Signature:

R
chunking_config(max_characters = 1000L, overlap = 200L, ...) -> list

Parameters:

Parameter Type Description
max_characters integer Maximum characters per chunk. Default: 1000
overlap integer Overlap between chunks. Default: 200
... Additional chunking options

Returns:

  • Named list with chunking configuration

Example:

R
config <- extraction_config(
  chunking = chunking_config(max_characters = 2000L, overlap = 500L)
)

discover()

Search for kreuzberg.toml configuration file in current and parent directories.

Signature:

R
discover() -> list or NULL

Returns:

  • Named list with configuration if found, NULL otherwise

Example:

R
config <- discover()
if (!is.null(config)) {
  result <- extract_file_sync("document.pdf", config = config)
}

from_file()

Load configuration from a TOML, YAML, or JSON file.

Signature:

R
from_file(path) -> list

Parameters:

Parameter Type Description
path character Path to configuration file (TOML, YAML, or JSON)

Returns:

  • Named list with configuration

Example:

R
config <- from_file("kreuzberg.toml")
result <- extract_file_sync("document.pdf", config = config)

Results & Types

kreuzberg_result

Result object returned by all extraction functions. Inherits from list with named fields.

Fields:

Field Type Description
content character Extracted text content
mime_type character MIME type of the processed document
pages list, NULL Per-page extracted content (if page extraction enabled)
tables list, NULL Array of extracted tables
chunks list, NULL Text chunks (if chunking enabled)
images list, NULL Extracted images
elements list, NULL Document elements
keywords character, NULL Extracted keywords
quality_score numeric, NULL Quality score (0.0-1.0)
detected_language character, NULL Detected language code (ISO 639-1)
metadata list Document metadata

Example:

R
result <- extract_file_sync("document.pdf")

cat("Content:", result$content, "\n")
cat("MIME type:", result$mime_type, "\n")
cat("Pages:", page_count(result), "\n")
cat("Tables:", length(result$tables), "\n")
cat("Language:", detected_language(result), "\n")

S3 Methods for kreuzberg_result

print()

Print a brief summary of the result.

R
print(x)

Example:

R
result <- extract_file_sync("document.pdf")
print(result)  # Displays summary

summary()

Summarize the extraction result.

R
summary(object)

Example:

R
result <- extract_file_sync("document.pdf")
summary(result)

format()

Format the result as a string.

R
format(x)

content()

Extract the text content.

R
content(x) -> character

Example:

R
result <- extract_file_sync("document.pdf")
text <- content(result)

mime_type()

Get the MIME type of the document.

R
mime_type(x) -> character

Example:

R
result <- extract_file_sync("document.pdf")
type <- mime_type(result)

page_count()

Get the number of pages in the document.

R
page_count(x) -> integer

Example:

R
result <- extract_file_sync("document.pdf")
pages <- page_count(result)

chunk_count()

Get the number of text chunks.

R
chunk_count(x) -> integer

Example:

R
result <- extract_file_sync("document.pdf", config = extraction_config(chunking = chunking_config()))
chunks <- chunk_count(result)

detected_language()

Get the detected language code.

R
detected_language(x) -> character or NULL

Example:

R
result <- extract_file_sync("document.pdf")
lang <- detected_language(result)
if (!is.null(lang)) {
  cat("Language:", lang, "\n")
}

metadata_field()

Extract a specific metadata field by name.

R
metadata_field(x, name) -> value or NULL

Parameters:

Parameter Type Description
x kreuzberg_result Result object
name character Field name

Returns:

  • Field value or NULL if not present

Example:

R
result <- extract_file_sync("document.pdf")
title <- metadata_field(result, "title")
author <- metadata_field(result, "author")

Metadata Hash

Document metadata with format-specific fields.

Common Fields:

Field Type Description
language character Document language (ISO 639-1 code)
date character Document date (ISO 8601 format)
subject character Document subject
format_type character Format discriminator ("pdf", "excel", "email", etc.)

PDF-Specific Fields (when format_type == "pdf"):

Field Type Description
title character PDF title
author character PDF author
page_count integer Number of pages
creation_date character Creation date (ISO 8601)
modification_date character Modification date (ISO 8601)
creator character Creator application
producer character Producer application
keywords character PDF keywords

Example:

R
result <- extract_file_sync("document.pdf")
metadata <- result$metadata

if (metadata$format_type == "pdf") {
  cat("Title:", metadata$title, "\n")
  cat("Author:", metadata$author, "\n")
  cat("Pages:", metadata$page_count, "\n")
}

Error Handling

Errors are raised as typed conditions with class hierarchy: - kreuzberg_error (base) - ValidationError - ParsingError - FileNotFoundError - UnsupportedFormatError - ExtractionError

Example - Basic error handling:

R
library(kreuzberg)

tryCatch(
  result <- extract_file_sync("document.pdf"),
  FileNotFoundError = function(e) {
    cat("File not found:", conditionMessage(e), "\n")
  },
  ValidationError = function(e) {
    cat("Validation error:", conditionMessage(e), "\n")
  },
  kreuzberg_error = function(e) {
    cat("Extraction error:", conditionMessage(e), "\n")
  }
)

Example - Specific error handling:

R
tryCatch(
  {
    result <- extract_file_sync("scanned.pdf", config = extraction_config(
      ocr = ocr_config(backend = "unsupported-backend")
    ))
  },
  ValidationError = function(e) {
    cat("Invalid configuration:", conditionMessage(e), "\n")
  },
  error = function(e) {
    cat("Unexpected error:", conditionMessage(e), "\n")
  }
)

Cache Management

clear_cache()

Clear the extraction cache.

Signature:

R
clear_cache() -> invisible(NULL)

Example:

R
library(kreuzberg)

clear_cache()

cache_stats()

Get cache statistics.

Signature:

R
cache_stats() -> list

Returns:

  • Named list with:
  • total_entries (integer): Number of cached entries
  • total_size_bytes (integer): Total cache size in bytes

Example:

R
library(kreuzberg)

stats <- cache_stats()
cat("Cache entries:", stats$total_entries, "\n")
cat("Cache size:", stats$total_size_bytes, "bytes\n")

Validation

validate_ocr_backend_name()

Validate OCR backend name.

Signature:

R
validate_ocr_backend_name(backend) -> logical

Parameters:

Parameter Type Description
backend character Backend name to validate

Returns:

  • Logical: TRUE if valid, FALSE otherwise

Example:

R
library(kreuzberg)

is_valid <- validate_ocr_backend_name("tesseract")
if (!is_valid) {
  cat("Invalid OCR backend\n")
}

validate_language_code()

Validate language code.

Signature:

R
validate_language_code(code) -> logical

Parameters:

Parameter Type Description
code character Language code (ISO 639-3 or 639-1)

Returns:

  • Logical: TRUE if valid, FALSE otherwise

Example:

R
library(kreuzberg)

is_valid <- validate_language_code("eng")

validate_output_format()

Validate output format.

Signature:

R
validate_output_format(format) -> logical

Parameters:

Parameter Type Description
format character Output format name

Returns:

  • Logical: TRUE if valid, FALSE otherwise

Metadata Detection

detect_mime_type()

Detect MIME type from raw bytes.

Signature:

R
detect_mime_type(data) -> character

Parameters:

Parameter Type Description
data raw Binary data

Returns:

  • Character: Detected MIME type

Example:

R
library(kreuzberg)

data <- readBin("document", what = "raw", n = file.size("document"))
mime_type <- detect_mime_type(data)
cat("Detected MIME type:", mime_type, "\n")

detect_mime_type_from_path()

Detect MIME type from file path.

Signature:

R
detect_mime_type_from_path(path) -> character

Parameters:

Parameter Type Description
path character File path

Returns:

  • Character: Detected MIME type

Example:

R
library(kreuzberg)

mime_type <- detect_mime_type_from_path("document.pdf")
cat("MIME type:", mime_type, "\n")

get_extensions_for_mime()

Get file extensions for a MIME type.

Signature:

R
get_extensions_for_mime(mime_type) -> character

Parameters:

Parameter Type Description
mime_type character MIME type

Returns:

  • Character vector: File extensions for the MIME type

Example:

R
library(kreuzberg)

extensions <- get_extensions_for_mime("application/pdf")
cat("PDF extensions:", paste(extensions, collapse = ", "), "\n")

validate_mime_type()

Validate MIME type.

Signature:

R
validate_mime_type(mime_type) -> logical

Parameters:

Parameter Type Description
mime_type character MIME type to validate

Returns:

  • Logical: TRUE if valid, FALSE otherwise

Example:

R
library(kreuzberg)

is_valid <- validate_mime_type("application/pdf")

Plugins

OCR Backends

register_ocr_backend()

Register a custom OCR backend.

Signature:

R
register_ocr_backend(name, callback) -> invisible(NULL)

Parameters:

Parameter Type Description
name character Backend name
callback function Backend implementation function

unregister_ocr_backend()

Unregister an OCR backend.

Signature:

R
unregister_ocr_backend(name) -> invisible(NULL)

list_ocr_backends()

List all registered OCR backends.

Signature:

R
list_ocr_backends() -> character

Returns:

  • Character vector: Names of registered backends

Example:

R
library(kreuzberg)

backends <- list_ocr_backends()
cat("Available OCR backends:", paste(backends, collapse = ", "), "\n")

clear_ocr_backends()

Clear all registered OCR backends.

Signature:

R
clear_ocr_backends() -> invisible(NULL)

Post-Processors

register_post_processor()

Register a custom post-processor.

Signature:

R
register_post_processor(name, callback) -> invisible(NULL)

Parameters:

Parameter Type Description
name character Processor name
callback function Processor implementation function

unregister_post_processor()

Unregister a post-processor.

Signature:

R
unregister_post_processor(name) -> invisible(NULL)

list_post_processors()

List all registered post-processors.

Signature:

R
list_post_processors() -> character

Returns:

  • Character vector: Names of registered post-processors

clear_post_processors()

Clear all registered post-processors.

Signature:

R
clear_post_processors() -> invisible(NULL)

Validators

register_validator()

Register a custom validator.

Signature:

R
register_validator(name, callback) -> invisible(NULL)

Parameters:

Parameter Type Description
name character Validator name
callback function Validator implementation function

unregister_validator()

Unregister a validator.

Signature:

R
unregister_validator(name) -> invisible(NULL)

list_validators()

List all registered validators.

Signature:

R
list_validators() -> character

Returns:

  • Character vector: Names of registered validators

clear_validators()

Clear all registered validators.

Signature:

R
clear_validators() -> invisible(NULL)

Document Extractors

list_document_extractors()

List all available document extractors.

Signature:

R
list_document_extractors() -> character

Returns:

  • Character vector: Names of available document extractors

unregister_document_extractor()

Unregister a document extractor.

Signature:

R
unregister_document_extractor(name) -> invisible(NULL)

Parameters:

Parameter Type Description
name character Extractor name

clear_document_extractors()

Clear all document extractors.

Signature:

R
clear_document_extractors() -> invisible(NULL)

Thread Safety

All Kreuzberg functions are thread-safe and can be called from multiple threads concurrently via R's parallel package or future framework.

Example - Using parallel package:

R
library(kreuzberg)
library(parallel)

files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")

# Use parallel processing
results <- mclapply(files, function(file) {
  extract_file_sync(file)
}, mc.cores = 3)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}

Example - Using future package:

R
library(kreuzberg)
library(future)

plan(multisession)

files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")

# Process files asynchronously
futures <- lapply(files, function(file) {
  future({
    extract_file_sync(file)
  })
})

# Collect results
results <- lapply(futures, value)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}

However, for better performance, use the batch API instead:

R
library(kreuzberg)

files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")

# Better approach: use built-in batch processing
results <- batch_extract_files_sync(files)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}