R API Reference

Complete reference for the Kreuzberg R API.

Installation

Install from the R-universe repository:

install.packages("kreuzberg", repos = "https://kreuzberg-dev.r-universe.dev")

Or install from source using remotes:

remotes::install_github("kreuzberg-dev/kreuzberg-lts", subdir = "packages/r")

System Requirements:

R >= 4.2
Rust toolchain (cargo, rustc >= 1.91) for building from source
Supported platforms: Linux (x64, arm64), macOS (Apple Silicon)

Core Functions

Batch_extract_bytes()

Extract content from multiple raw byte arrays (asynchronous via Tokio runtime).

Signature:

batch_extract_bytes(data_list, mime_types, config = NULL) -> list of kreuzberg_result

Parameters:

Same as batch_extract_bytes_sync().

Returns:

List of kreuzberg_result objects

Batch_extract_bytes_sync()

Extract content from multiple raw byte arrays (synchronous).

Signature:

batch_extract_bytes_sync(data_list, mime_types, config = NULL) -> list of kreuzberg_result

Parameters:

Parameter	Type	Description
`data_list`	list of raw	List of binary data (raw vectors)
`mime_types`	character	MIME types corresponding to each byte array
`config`	list, NULL	Extraction configuration

Returns:

List of kreuzberg_result objects

Example:

library(kreuzberg)

pdf_data <- readBin("invoice.pdf", what = "raw", n = file.size("invoice.pdf"))
docx_data <- readBin("report.docx", what = "raw", n = file.size("report.docx"))

data_list <- list(pdf_data, docx_data)
mime_types <- c("application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")

results <- batch_extract_bytes_sync(data_list, mime_types)

for (i in seq_along(results)) {
  cat(sprintf("Document %d: %d characters\n", i, nchar(results[[i]]$content)))
}

Batch_extract_files()

Extract content from multiple files in parallel (asynchronous via Tokio runtime).

Signature:

batch_extract_files(paths, config = NULL) -> list of kreuzberg_result

Parameters:

Same as batch_extract_files_sync().

Returns:

List of kreuzberg_result objects

Batch_extract_files_sync()

Extract content from multiple files in parallel (synchronous).

Signature:

batch_extract_files_sync(paths, config = NULL) -> list of kreuzberg_result

Parameters:

Parameter	Type	Description
`paths`	character	Vector of file paths to extract
`config`	list, NULL	Extraction configuration applied to all files

Returns:

List of kreuzberg_result objects

Example:

library(kreuzberg)

paths <- c("doc1.pdf", "doc2.docx", "doc3.xlsx")
results <- batch_extract_files_sync(paths)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", paths[i], nchar(results[[i]]$content)))
}

Extract_bytes()

Extract content from raw bytes (asynchronous via Tokio runtime).

Signature:

extract_bytes(data, mime_type, config = NULL) -> kreuzberg_result

Parameters:

Same as extract_bytes_sync().

Returns:

kreuzberg_result: Extraction result object

Extract_bytes_sync()

Extract content from raw bytes (synchronous).

Signature:

extract_bytes_sync(data, mime_type, config = NULL) -> kreuzberg_result

Parameters:

Parameter	Type	Description
`data`	raw	Binary data to extract (raw vector)
`mime_type`	character	MIME type of the data (required for format detection)
`config`	list, NULL	Extraction configuration

Returns:

kreuzberg_result: Extraction result object

Example:

library(kreuzberg)

data <- readBin("document.pdf", what = "raw", n = file.size("document.pdf"))
result <- extract_bytes_sync(data, "application/pdf")
cat(result$content)

Extract_file()

Extract content from a file (asynchronous via Tokio runtime).

Note: R does not have native async/await. This function internally uses a blocking Tokio runtime. For background processing, run in a separate R process or use a thread pool.

Signature:

extract_file(path, mime_type = NULL, config = NULL) -> kreuzberg_result

Parameters:

Same as extract_file_sync().

Returns:

kreuzberg_result: Extraction result object

Example:

library(kreuzberg)

# Equivalent to extract_file_sync in R
result <- extract_file("document.pdf")
cat(result$content)

Extract_file_sync()

Extract content from a file (synchronous).

Signature:

extract_file_sync(path, mime_type = NULL, config = NULL) -> kreuzberg_result

Parameters:

Parameter	Type	Description
`path`	character	Path to the file to extract
`mime_type`	character, NULL	Optional MIME type hint. If NULL, MIME type is auto-detected
`config`	list, NULL	Extraction configuration. Uses defaults if NULL

Returns:

kreuzberg_result: Extraction result object (S3 class inheriting from list)

Raises:

ValidationError: Input validation failed
ParsingError: Document parsing failed
FileNotFoundError: File does not exist
UnsupportedFormatError: Document format not supported
ExtractionError: General extraction failure

Example - Basic usage:

library(kreuzberg)

result <- extract_file_sync("document.pdf")
cat("Content:\n", result$content, "\n")
cat("Pages:", page_count(result), "\n")

Example - With configuration:

library(kreuzberg)

config <- extraction_config(
  ocr = ocr_config(backend = "tesseract", language = "eng")
)
result <- extract_file_sync("scanned.pdf", config = config)

Example - With explicit MIME type:

library(kreuzberg)

result <- extract_file_sync("document.pdf", mime_type = "application/pdf")

Configuration

Chunking_config()

Create text chunking configuration.

Signature:

chunking_config(max_characters = 1000L, overlap = 200L, ...) -> list

Parameters:

Parameter	Type	Description
`max_characters`	integer	Maximum characters per chunk. Default: 1000
`overlap`	integer	Overlap between chunks. Default: 200
…		Additional chunking options

Returns:

Named list with chunking configuration

Example:

config <- extraction_config(
  chunking = chunking_config(max_characters = 2000L, overlap = 500L)
)

Discover()

Search for kreuzberg.toml configuration file in current and parent directories.

Signature:

discover() -> list or NULL

Returns:

Named list with configuration if found, NULL otherwise

Example:

config <- discover()
if (!is.null(config)) {
  result <- extract_file_sync("document.pdf", config = config)
}

Extraction_config()

Create an extraction configuration object.

Signature:

extraction_config(
  chunking = NULL,
  enable_quality_processing = NULL,
  force_ocr = FALSE,
  html_options = NULL,
  images = NULL,
  include_document_structure = NULL,
  keywords = NULL,
  language_detection = NULL,
  layout = NULL,
  max_concurrent_extractions = NULL,
  ocr = NULL,
  output_format = NULL,
  pages = NULL,
  pdf_options = NULL,
  postprocessor = NULL,
  result_format = NULL,
  security_limits = NULL,
  token_reduction = NULL,
  use_cache = NULL,
  ...
) -> list

Parameters:

Parameter	Type	Description
`chunking`	list, NULL	Text chunking options (see `chunking_config()`)
`enable_quality_processing`	logical, NULL	Enable quality processing enhancements
`force_ocr`	logical	Force OCR on all documents regardless of document type
`html_options`	list, NULL	HTML-specific options
`images`	list, NULL	Image extraction options
`include_document_structure`	logical, NULL	Include hierarchical document structure in results
`keywords`	list, NULL	Keyword extraction options
`language_detection`	list, NULL	Language detection options
`layout`	list, NULL	Layout detection options
`max_concurrent_extractions`	integer, NULL	Maximum concurrent extractions for batch operations
`ocr`	list, NULL	OCR configuration (see `ocr_config()`)
`output_format`	character, NULL	Output format for extracted content (‘plain’, ‘markdown’, ‘djot’, ‘html’)
`pages`	list, NULL	Page extraction options
`pdf_options`	list, NULL	PDF-specific options
`postprocessor`	character, NULL	Post-processor name
`result_format`	character, NULL	Result format (‘unified’, ‘element_based’)
`security_limits`	list, NULL	Security limit options
`token_reduction`	list, NULL	Token reduction options
`use_cache`	logical, NULL	Enable extraction result caching
Other options		Additional configuration parameters

Returns:

Named list with configuration options

Example:

config <- extraction_config(
  ocr = ocr_config(backend = "tesseract", language = "eng"),
  chunking = chunking_config(max_characters = 1000L, overlap = 200L),
  use_cache = TRUE
)

result <- extract_file_sync("document.pdf", config = config)

From_file()

Load configuration from a TOML, YAML, or JSON file.

Signature:

from_file(path) -> list

Parameters:

Parameter	Type	Description
`path`	character	Path to configuration file (TOML, YAML, or JSON)

Returns:

Named list with configuration

Example:

config <- from_file("kreuzberg.toml")
result <- extract_file_sync("document.pdf", config = config)

Layout_detection_config()

Create a layout detection configuration.

Signature:

layout_detection_config(confidence_threshold = NULL, apply_heuristics = TRUE, table_model = NULL, ...) -> list

Parameters:

Parameter	Type	Description
`apply_heuristics`	logical	Whether to apply heuristic post-processing to refine layout regions. Default: TRUE
`confidence_threshold`	numeric, NULL	Minimum confidence threshold for detected regions (0.0-1.0). Default: NULL
`table_model`	character, NULL	Table structure recognition model: “tatr” (default), “slanet_wired”, “slanet_wireless”, “slanet_plus”, “slanet_auto”. Default: NULL
…		Additional layout detection options

Returns:

Named list with layout detection configuration

Example:

config <- extraction_config(
  layout = layout_detection_config(apply_heuristics = TRUE)
)

Ocr_config()

Create OCR configuration.

Signature:

ocr_config(backend = "tesseract", language = "eng", dpi = NULL, ...) -> list

Parameters:

Parameter	Type	Description
`backend`	character	OCR backend (“tesseract” or “paddle-ocr”). Default: “tesseract”
`dpi`	integer, NULL	DPI for OCR processing
`language`	character	Language code (ISO 639-3). Default: “eng”
`model_tier`	character, NULL	v4.5.0 PaddleOCR model tier: “mobile” (lightweight, ~21MB total, fast) or “server” (high accuracy, ~172MB, best with GPU). Default: “mobile”
`padding`	integer, NULL	v4.5.0 Padding in pixels (0-100) added around the image before PaddleOCR detection. Default: 10
…		Additional OCR options

Returns:

Named list with OCR configuration

Example:

config <- extraction_config(
  ocr = ocr_config(backend = "paddle-ocr", language = "eng")
)

Results & Types

Kreuzberg_result

Result object returned by all extraction functions. Inherits from list with named fields.

Fields:

Field	Type	Description
`annotations`	list, NULL	PDF annotations (links, highlights, notes)
`chunks`	list, NULL	Text chunks (if chunking enabled)
`content`	character	Extracted text content
`detected_language`	character, NULL	Detected language code (ISO 639-1)
`djot_content`	list, NULL	Structured Djot content
`document`	list, NULL	Hierarchical document structure
`elements`	list, NULL	Document semantic elements
`extracted_keywords`	list, NULL	Extracted keywords with scores
`images`	list, NULL	Extracted images
`metadata`	list	Document metadata
`mime_type`	character	MIME type of the processed document
`ocr_elements`	list, NULL	OCR elements with positioning and confidence
`pages`	list, NULL	Per-page extracted content (if page extraction enabled)
`processing_warnings`	list, NULL	Non-fatal processing warnings
`quality_score`	numeric, NULL	Quality score (0.0-1.0)
`tables`	list, NULL	Array of extracted tables

Example:

result <- extract_file_sync("document.pdf")

cat("Content:", result$content, "\n")
cat("MIME type:", result$mime_type, "\n")
cat("Pages:", page_count(result), "\n")
cat("Tables:", length(result$tables), "\n")
cat("Language:", detected_language(result), "\n")

S3 Methods for kreuzberg_result

Chunk_count()

Get the number of text chunks.

chunk_count(x) -> integer

Example:

result <- extract_file_sync("document.pdf", config = extraction_config(chunking = chunking_config()))
chunks <- chunk_count(result)

Content()

Extract the text content.

content(x) -> character

Example:

result <- extract_file_sync("document.pdf")
text <- content(result)

Detected_language()

Get the detected language code.

detected_language(x) -> character or NULL

Example:

result <- extract_file_sync("document.pdf")
lang <- detected_language(result)
if (!is.null(lang)) {
  cat("Language:", lang, "\n")
}

Format()

Format the result as a string.

format(x)

Metadata_field()

Extract a specific metadata field by name.

metadata_field(x, name) -> value or NULL

Parameters:

Parameter	Type	Description
`x`	kreuzberg_result	Result object
`name`	character	Field name

Returns:

Field value or NULL if not present

Example:

result <- extract_file_sync("document.pdf")
title <- metadata_field(result, "title")
author <- metadata_field(result, "author")

Mime_type()

Get the MIME type of the document.

mime_type(x) -> character

Example:

result <- extract_file_sync("document.pdf")
type <- mime_type(result)

Page_count()

Get the number of pages in the document.

page_count(x) -> integer

Example:

result <- extract_file_sync("document.pdf")
pages <- page_count(result)

Print()

Print a brief summary of the result.

print(x)

Example:

result <- extract_file_sync("document.pdf")
print(result)  # Displays summary

Summary()

Summarize the extraction result.

summary(object)

Example:

result <- extract_file_sync("document.pdf")
summary(result)

Metadata Hash

Document metadata with format-specific fields.

Common Fields:

Field	Type	Description
`authors`	character	Document authors
`created_at`	character	Creation date (ISO 8601)
`created_by`	character	Creator/application name
`custom`	list	Additional custom metadata from postprocessors
`date`	character	Document date (ISO 8601 format)
`format_type`	character	Format discriminator (“pdf”, “excel”, “email”, etc.)
`keywords`	character	Document keywords
`language`	character	Document language (ISO 639-1 code)
`modified_at`	character	Modification date (ISO 8601)
`page_count`	integer	Number of pages
`producer`	character	Producer/generator
`subject`	character	Document subject
`title`	character	Document title

Example:

result <- extract_file_sync("document.pdf")
metadata <- result$metadata

if (metadata$format_type == "pdf") {
  cat("Title:", metadata$title, "\n")
  cat("Author:", metadata$authors, "\n")
  cat("Pages:", metadata$page_count, "\n")
}

PDF Rendering

Render_pdf_page()

Render a single page of a PDF as a PNG image.

Signature:

render_pdf_page(path, page_index, dpi = 150L)

Parameters:

path (character): Path to the PDF file
page_index (integer): Zero-based page index to render
dpi (integer): Resolution for rendering (default 150L)

Returns:

raw vector: PNG-encoded raw vector for the requested page

Example:

png <- render_pdf_page("document.pdf", 0L)
writeBin(png, "first_page.png")

Error Handling

Errors are raised as typed conditions with class hierarchy:

kreuzberg_error (base)
- ValidationError
- ParsingError
- FileNotFoundError
- UnsupportedFormatError
- ExtractionError

Example - Basic error handling:

library(kreuzberg)

tryCatch(
  result <- extract_file_sync("document.pdf"),
  FileNotFoundError = function(e) {
    cat("File not found:", conditionMessage(e), "\n")
  },
  ValidationError = function(e) {
    cat("Validation error:", conditionMessage(e), "\n")
  },
  kreuzberg_error = function(e) {
    cat("Extraction error:", conditionMessage(e), "\n")
  }
)

Example - Specific error handling:

tryCatch(
  {
    result <- extract_file_sync("scanned.pdf", config = extraction_config(
      ocr = ocr_config(backend = "unsupported-backend")
    ))
  },
  ValidationError = function(e) {
    cat("Invalid configuration:", conditionMessage(e), "\n")
  },
  error = function(e) {
    cat("Unexpected error:", conditionMessage(e), "\n")
  }
)

Cache Management

Cache_stats()

Get cache statistics.

Signature:

cache_stats() -> list

Returns:

Named list with:
- total_entries (integer): Number of cached entries
- total_size_bytes (integer): Total cache size in bytes

Example:

library(kreuzberg)

stats <- cache_stats()
cat("Cache entries:", stats$total_entries, "\n")
cat("Cache size:", stats$total_size_bytes, "bytes\n")

Clear_cache()

Clear the extraction cache.

Signature:

clear_cache() -> invisible(NULL)

Example:

library(kreuzberg)

clear_cache()

Validation

Validate_language_code()

Validate language code.

Signature:

validate_language_code(code) -> logical

Parameters:

Parameter	Type	Description
`code`	character	Language code (ISO 639-3 or 639-1)

Returns:

Logical: TRUE if valid, FALSE otherwise

Example:

library(kreuzberg)

is_valid <- validate_language_code("eng")

Validate_mime_type()

Validate MIME type.

Signature:

validate_mime_type(mime_type) -> logical

Parameters:

Parameter	Type	Description
`mime_type`	character	MIME type to validate

Returns:

Logical: TRUE if valid, FALSE otherwise

Example:

library(kreuzberg)

is_valid <- validate_mime_type("application/pdf")

Validate_ocr_backend_name()

Validate OCR backend name.

Signature:

validate_ocr_backend_name(backend) -> logical

Parameters:

Parameter	Type	Description
`backend`	character	Backend name to validate

Returns:

Logical: TRUE if valid, FALSE otherwise

Example:

library(kreuzberg)

is_valid <- validate_ocr_backend_name("tesseract")
if (!is_valid) {
  cat("Invalid OCR backend\n")
}

Validate_output_format()

Validate output format.

Signature:

validate_output_format(format) -> logical

Parameters:

Parameter	Type	Description
`format`	character	Output format name

Returns:

Logical: TRUE if valid, FALSE otherwise

Metadata Detection

Detect_mime_type()

Detect MIME type from raw bytes.

Signature:

detect_mime_type(data) -> character

Parameters:

Parameter	Type	Description
`data`	raw	Binary data

Returns:

Character: Detected MIME type

Example:

library(kreuzberg)

data <- readBin("document", what = "raw", n = file.size("document"))
mime_type <- detect_mime_type(data)
cat("Detected MIME type:", mime_type, "\n")

Detect_mime_type_from_path()

Detect MIME type from file path.

Signature:

detect_mime_type_from_path(path) -> character

Parameters:

Parameter	Type	Description
`path`	character	File path

Returns:

Character: Detected MIME type

Example:

library(kreuzberg)

mime_type <- detect_mime_type_from_path("document.pdf")
cat("MIME type:", mime_type, "\n")

Get_extensions_for_mime()

Get file extensions for a MIME type.

Signature:

get_extensions_for_mime(mime_type) -> character

Parameters:

Parameter	Type	Description
`mime_type`	character	MIME type

Returns:

Character vector: File extensions for the MIME type

Example:

library(kreuzberg)

extensions <- get_extensions_for_mime("application/pdf")
cat("PDF extensions:", paste(extensions, collapse = ", "), "\n")

Plugins

OCR Backends

Clear_ocr_backends()

Clear all registered OCR backends.

Signature:

clear_ocr_backends() -> invisible(NULL)

List_ocr_backends()

List all registered OCR backends.

Signature:

list_ocr_backends() -> character

Returns:

Character vector: Names of registered backends

Example:

library(kreuzberg)

backends <- list_ocr_backends()
cat("Available OCR backends:", paste(backends, collapse = ", "), "\n")

Register_ocr_backend()

Signature:

register_ocr_backend(name, callback) -> invisible(NULL)

Parameters:

Parameter	Type	Description
`name`	character	Backend name
`callback`	function	Backend implementation function

Unregister_ocr_backend()

Unregister an OCR backend.

Signature:

unregister_ocr_backend(name) -> invisible(NULL)

Post-Processors

Clear_post_processors()

Clear all registered post-processors.

Signature:

clear_post_processors() -> invisible(NULL)

List_post_processors()

List all registered post-processors.

Signature:

list_post_processors() -> character

Returns:

Character vector: Names of registered post-processors

Register_post_processor()

Signature:

register_post_processor(name, callback) -> invisible(NULL)

Parameters:

Parameter	Type	Description
`name`	character	Processor name
`callback`	function	Processor implementation function

Unregister_post_processor()

Unregister a post-processor.

Signature:

unregister_post_processor(name) -> invisible(NULL)

Validators

Clear_validators()

Clear all registered validators.

Signature:

clear_validators() -> invisible(NULL)

List_validators()

List all registered validators.

Signature:

list_validators() -> character

Returns:

Character vector: Names of registered validators

Register_validator()

Signature:

register_validator(name, callback) -> invisible(NULL)

Parameters:

Parameter	Type	Description
`name`	character	Validator name
`callback`	function	Validator implementation function

Unregister_validator()

Unregister a validator.

Signature:

unregister_validator(name) -> invisible(NULL)

Document Extractors

Clear_document_extractors()

Clear all document extractors.

Signature:

clear_document_extractors() -> invisible(NULL)

List_document_extractors()

List all available document extractors.

Signature:

list_document_extractors() -> character

Returns:

Character vector: Names of available document extractors

Unregister_document_extractor()

Unregister a document extractor.

Signature:

unregister_document_extractor(name) -> invisible(NULL)

Parameters:

Parameter	Type	Description
`name`	character	Extractor name

Thread Safety

All Kreuzberg functions are thread-safe and can be called from multiple threads concurrently via R’s parallel package or future framework.

Example - Using parallel package:

library(kreuzberg)
library(parallel)

files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")

# Use parallel processing
results <- mclapply(files, function(file) {
  extract_file_sync(file)
}, mc.cores = 3)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}

Example - Using future package:

library(kreuzberg)
library(future)

plan(multisession)

files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")

# Process files asynchronously
futures <- lapply(files, function(file) {
  future({
    extract_file_sync(file)
  })
})

# Collect results
results <- lapply(futures, value)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}

However, for better performance, use the batch API instead:

library(kreuzberg)

files <- c("doc1.pdf", "doc2.pdf", "doc3.pdf")

# Better approach: use built-in batch processing
results <- batch_extract_files_sync(files)

for (i in seq_along(results)) {
  cat(sprintf("%s: %d characters\n", files[i], nchar(results[[i]]$content)))
}

LLM Integration

Kreuzberg integrates with LLMs via the liter-llm crate for structured extraction and VLM-based OCR. The R binding passes LLM configuration as list options through the extendr FFI layer. See the LLM Integration Guide for full details.

Structured Extraction

Pass structured_extraction config to extract structured data from documents using an LLM:

library(kreuzberg)

config <- list(
  structured_extraction = list(
    schema = list(
      type = "object",
      properties = list(
        title = list(type = "string"),
        authors = list(type = "array", items = list(type = "string")),
        date = list(type = "string")
      ),
      required = c("title", "authors", "date"),
      additionalProperties = FALSE
    ),
    llm = list(model = "openai/gpt-4o-mini"),
    strict = TRUE
  )
)

result <- extract_file_sync("paper.pdf", config = config)

if (!is.null(result$structured_output)) {
  data <- jsonlite::fromJSON(result$structured_output)
  cat("Title:", data$title, "\n")
}

VLM OCR

Use a vision-language model as an OCR backend:

config <- list(
  force_ocr = TRUE,
  ocr = list(
    backend = "vlm",
    vlm_config = list(model = "openai/gpt-4o-mini")
  )
)

result <- extract_file_sync("scan.pdf", config = config)

For configuration details including API keys, model selection, and provider setup, see the LLM Integration Guide.