C API Reference¶
Complete reference for the Kreuzberg C FFI library. The kreuzberg-ffi crate exposes a C-compatible API (extern "C") for text extraction from documents, with thread-local error storage and explicit memory management.
All functions are synchronous. Error state is stored per-thread, so multiple threads can call extraction concurrently without interference.
Requirements¶
- C11 or later compiler (GCC, Clang, MSVC)
- libkreuzberg_ffi static or shared library (built from the
kreuzberg-fficrate) - kreuzberg.h header file (auto-generated by cbindgen)
- Tesseract OCR (optional, for image and scanned document extraction)
Installation¶
Build from source¶
# Build the FFI crate (produces static library)
cargo build --release -p kreuzberg-ffi
# The static library is at target/release/libkreuzberg_ffi.a (Unix)
# or target/release/kreuzberg_ffi.lib (Windows)
Compile and link¶
# Linux
cc -I/path/to/kreuzberg-ffi \
-o myapp myapp.c \
-L/path/to/target/release \
-lkreuzberg_ffi -lpthread -ldl -lm
# macOS
cc -I/path/to/kreuzberg-ffi \
-o myapp myapp.c \
-L/path/to/target/release \
-lkreuzberg_ffi -lpthread \
-framework CoreFoundation -framework Security
# Windows (MSVC)
cl /I path\to\kreuzberg-ffi myapp.c /link /LIBPATH:path\to\target\release kreuzberg_ffi.lib
Static linking¶
For static linking, define KREUZBERG_STATIC before including the header:
Quick Start¶
Basic file extraction¶
#include <stdio.h>
#include "kreuzberg.h"
int main(void) {
CExtractionResult *result = kreuzberg_extract_file_sync("document.pdf");
if (result == NULL) {
const char *error = kreuzberg_last_error();
fprintf(stderr, "Extraction failed: %s\n", error);
return 1;
}
if (result->success) {
printf("MIME type: %s\n", result->mime_type);
printf("Content length: %zu\n", strlen(result->content));
printf("Content: %.200s...\n", result->content);
}
kreuzberg_free_result(result);
return 0;
}
Extraction with error handling¶
#include <stdio.h>
#include "kreuzberg.h"
int main(void) {
const char *config = "{\"force_ocr\": true, \"ocr\": {\"language\": \"eng\"}}";
CExtractionResult *result = kreuzberg_extract_file_sync_with_config(
"scanned.pdf", config
);
if (result == NULL) {
CErrorDetails details = kreuzberg_get_error_details();
fprintf(stderr, "Error [%s]: %s (code=%u)\n",
details.error_type, details.message, details.error_code);
if (details.source_file != NULL) {
fprintf(stderr, " at %s:%u in %s\n",
details.source_file, details.source_line,
details.source_function);
}
kreuzberg_free_string(details.message);
kreuzberg_free_string(details.error_type);
if (details.source_file) kreuzberg_free_string(details.source_file);
if (details.source_function) kreuzberg_free_string(details.source_function);
if (details.context_info) kreuzberg_free_string(details.context_info);
return 1;
}
printf("Extracted %zu characters\n", strlen(result->content));
kreuzberg_free_result(result);
return 0;
}
Core Extraction Functions¶
Kreuzberg_batch_extract_bytes_sync¶
Batch extract text from multiple in-memory documents.
Signature:
CBatchResult *kreuzberg_batch_extract_bytes_sync(
const CBytesWithMime *items,
uintptr_t count,
const char *config_json
);
Parameters:
items(const CBytesWithMime*): Array of byte-buffer/MIME-type pairscount(uintptr_t): Number of itemsconfig_json(const char*): JSON configuration, or NULL
Returns:
CBatchResult*: Batch result; free withkreuzberg_free_batch_resultNULLon error
Kreuzberg_batch_extract_files_sync¶
Batch extract text from multiple files in a single call.
Signature:
CBatchResult *kreuzberg_batch_extract_files_sync(
const char *const *file_paths,
uintptr_t count,
const char *config_json
);
Parameters:
file_paths(const char const): Array of null-terminated file path stringscount(uintptr_t): Number of file paths in the arrayconfig_json(const char*): JSON configuration applied to all files, or NULL
Returns:
CBatchResult*: Batch result containing an array of individual results; free withkreuzberg_free_batch_resultNULLon error (for example, invalid arguments)
Kreuzberg_extract_batch_parallel¶
Parallel variant of streaming batch extraction using a thread pool.
Signature:
int kreuzberg_extract_batch_parallel(
const char *const *files,
uintptr_t count,
const char *config_json,
ResultCallback result_callback,
void *user_data,
Option_ErrorCallback error_callback,
uintptr_t max_parallel
);
Parameters:
files(const char const): Array of file pathscount(uintptr_t): Number of filesconfig_json(const char*): JSON config or NULLresult_callback(ResultCallback): Called for each successful extractionuser_data(void*): Opaque pointer passed through to callbackserror_callback(Option_ErrorCallback): Called on per-file failuresmax_parallel(uintptr_t): Maximum concurrent extractions (0 = number of CPUs)
Returns:
0on success,-1on error
Kreuzberg_extract_batch_streaming¶
Stream-process multiple files with a callback for each result, avoiding memory accumulation.
Signature:
int kreuzberg_extract_batch_streaming(
const char *const *files,
uintptr_t count,
const char *config_json,
ResultCallback result_callback,
void *user_data,
Option_ErrorCallback error_callback
);
Parameters:
files(const char const): Array of file pathscount(uintptr_t): Number of filesconfig_json(const char*): JSON config or NULLresult_callback(ResultCallback): Called for each successful extractionuser_data(void*): Opaque pointer passed through to callbackserror_callback(Option_ErrorCallback): Called on per-file failures (optional)
Returns:
0on success,-1on error
Kreuzberg_extract_bytes_sync¶
Extract text from an in-memory byte buffer with a specified MIME type.
Signature:
CExtractionResult *kreuzberg_extract_bytes_sync(
const uint8_t *data,
uintptr_t data_len,
const char *mime_type
);
Parameters:
data(const uint8_t*): Pointer to the document bytesdata_len(uintptr_t): Length of the byte buffermime_type(const char*): Null-terminated MIME type string (for example,"application/pdf")
Returns:
CExtractionResult*: Populated result on success; caller must free withkreuzberg_free_resultNULLon error; checkkreuzberg_last_error()orkreuzberg_get_error_details()for details
Kreuzberg_extract_bytes_sync_with_config¶
Extract text from bytes with a JSON configuration string.
Signature:
CExtractionResult *kreuzberg_extract_bytes_sync_with_config(
const uint8_t *data,
uintptr_t data_len,
const char *mime_type,
const char *config_json
);
Parameters:
data(const uint8_t*): Document bytesdata_len(uintptr_t): Length of datamime_type(const char*): MIME type stringconfig_json(const char*): JSON configuration, or NULL for defaults
Returns:
CExtractionResult*: Result on success; free withkreuzberg_free_resultNULLon error
Kreuzberg_extract_file_sync¶
Extract text and metadata from a file.
Signature:
Parameters:
file_path(const char*): Null-terminated path to the document file
Returns:
CExtractionResult*: Populated result on success; caller must free withkreuzberg_free_resultNULLon error; checkkreuzberg_last_error()orkreuzberg_get_error_details()for details
Kreuzberg_extract_file_sync_with_config¶
Extract text and metadata from a file with custom JSON configuration.
Signature:
CExtractionResult *kreuzberg_extract_file_sync_with_config(
const char *file_path,
const char *config_json
);
Parameters:
file_path(const char*): Null-terminated path to the documentconfig_json(const char*): Null-terminated JSON configuration string, or NULL for defaults
Returns:
CExtractionResult*: Result on success; free withkreuzberg_free_resultNULLon error
Configuration¶
Config Builder¶
Construct an ExtractionConfig programmatically using the builder pattern.
Signature:
ConfigBuilder *kreuzberg_config_builder_new(void);
int32_t kreuzberg_config_builder_set_chunking(ConfigBuilder *builder, const char *chunking_json);
int32_t kreuzberg_config_builder_set_image_extraction(ConfigBuilder *builder, const char *image_json);
int32_t kreuzberg_config_builder_set_include_document_structure(ConfigBuilder *builder, int32_t include);
int32_t kreuzberg_config_builder_set_language_detection(ConfigBuilder *builder, const char *ld_json);
int32_t kreuzberg_config_builder_set_layout(ConfigBuilder *builder, const char *layout_json);
int32_t kreuzberg_config_builder_set_ocr(ConfigBuilder *builder, const char *ocr_json);
int32_t kreuzberg_config_builder_set_pdf(ConfigBuilder *builder, const char *pdf_json);
int32_t kreuzberg_config_builder_set_post_processor(ConfigBuilder *builder, const char *pp_json);
int32_t kreuzberg_config_builder_set_use_cache(ConfigBuilder *builder, int32_t use_cache);
ExtractionConfig *kreuzberg_config_builder_build(ConfigBuilder *builder);
void kreuzberg_config_builder_free(ConfigBuilder *builder);
Returns: Setter functions return 0 on success, -1 on error.
Example:
ConfigBuilder *builder = kreuzberg_config_builder_new();
kreuzberg_config_builder_set_use_cache(builder, 1);
kreuzberg_config_builder_set_output_format(builder, "Markdown");
kreuzberg_config_builder_set_ocr(builder,
"{\"backend\": \"tesseract\", \"languages\": [\"eng\"]}");
kreuzberg_config_builder_set_chunking(builder,
"{\"max_chars\": 1000, \"max_overlap\": 200}");
ExtractionConfig *config = kreuzberg_config_builder_build(builder);
/* builder is consumed -- do NOT call kreuzberg_config_builder_free */
/* Use config... */
kreuzberg_config_free(config);
Important: After calling kreuzberg_config_builder_build, the builder is consumed. Do not call kreuzberg_config_builder_free on a consumed builder. To discard a builder without building, call kreuzberg_config_builder_free instead.
JSON Configuration¶
Parse, serialize, validate, and discover configuration from JSON.
Signature:
ExtractionConfig *kreuzberg_config_from_json(const char *json_config);
char *kreuzberg_config_get_field(const ExtractionConfig *config, const char *field_name);
int32_t kreuzberg_config_is_valid(const char *json_config);
int32_t kreuzberg_config_merge(ExtractionConfig *base, const ExtractionConfig *override_config);
char *kreuzberg_config_to_json(const ExtractionConfig *config);
Parameters/Returns:
kreuzberg_config_from_json: Parses JSON intoExtractionConfig*; free withkreuzberg_config_free. Returns NULL on error.kreuzberg_config_get_field: Returns a specific field as JSON; free withkreuzberg_free_string.kreuzberg_config_is_valid: Returns 1 if valid, 0 if invalid.kreuzberg_config_merge: Merges override into base (in-place). Returns 1 on success, 0 on error.kreuzberg_config_to_json: Serializes config to JSON string; free withkreuzberg_free_string.
Embeddings¶
Retrieve information about available embedding presets and models.
Signature:
char *kreuzberg_get_embedding_preset(const char *preset_name);
char *kreuzberg_list_embedding_presets(void);
kreuzberg_get_embedding_preset: Returns JSON for a specific preset; free withkreuzberg_free_string.kreuzberg_list_embedding_presets: Returns JSON array of all preset names; free withkreuzberg_free_string.
File-based Configuration¶
char *kreuzberg_config_discover(void);
ExtractionConfig *kreuzberg_config_from_file(const char *path);
char *kreuzberg_load_extraction_config_from_file(const char *file_path);
kreuzberg_config_discover: Searches parent directories for a config file. Returns JSON string or NULL; free withkreuzberg_free_string.kreuzberg_config_from_file: Loads config from a file. ReturnsExtractionConfig*; free withkreuzberg_config_free.kreuzberg_load_extraction_config_from_file: Loads config and returns raw JSON string; free withkreuzberg_free_string.
PDF Rendering¶
Added in v4.6.2
Kreuzberg_render_pdf_page¶
Render a single page of a PDF as a PNG image.
Signature:
Parameters:
file_path(const char*): Path to the PDF file (UTF-8 encoded)page_index(size_t): Zero-based page index to renderdpi(int): Resolution for rendering (for example 150)
Returns:
CRenderPageResult*: Pointer to a single page render result, or NULL on error. Free withkreuzberg_free_render_page_result.
Kreuzberg_free_render_page_result¶
Free a single page result returned by kreuzberg_render_pdf_page.
Error Handling¶
Kreuzberg_last_error¶
Get the last error message for the current thread.
Signature:
Returns:
- Pointer to a null-terminated error message string
- The returned pointer should NOT be freed
Kreuzberg_last_error_code¶
Get the numeric error code for the last error.
Signature:
Returns:
- Numeric error code (0-7), or -1 if no error occurred
Kreuzberg_last_panic_context¶
Get the panic context if the last error was a caught panic.
Signature:
Returns:
- Static string with panic details, or NULL. Do NOT free.
Kreuzberg_get_error_details¶
Retrieve structured error information from thread-local storage.
Signature:
Returns:
CErrorDetailsstruct with all fields populated. Non-NULL string fields must be freed withkreuzberg_free_string.
Example:
CExtractionResult *result = kreuzberg_extract_file_sync("bad_file.xyz");
if (result == NULL) {
CErrorDetails details = kreuzberg_get_error_details();
fprintf(stderr, "Error: %s (code=%u, type=%s)\n",
details.message, details.error_code, details.error_type);
if (details.source_file != NULL) {
fprintf(stderr, " at %s:%u in %s\n",
details.source_file, details.source_line, details.source_function);
}
if (details.is_panic) {
fprintf(stderr, " [PANIC]\n");
}
/* Free all non-NULL string fields */
kreuzberg_free_string(details.message);
kreuzberg_free_string(details.error_type);
if (details.source_file) kreuzberg_free_string(details.source_file);
if (details.source_function) kreuzberg_free_string(details.source_function);
if (details.context_info) kreuzberg_free_string(details.context_info);
}
For language bindings that have trouble returning structs by value, use the heap-allocated variant:
CErrorDetails *kreuzberg_get_error_details_ptr(void);
void kreuzberg_free_error_details(CErrorDetails *details);
Error Codes¶
| Code | Name | Description |
|---|---|---|
| 0 | validation |
Input validation error |
| 1 | parsing |
Document parsing error |
| 2 | ocr |
OCR processing error |
| 3 | missing_dependency |
Required library not found |
| 4 | io |
File I/O error |
| 5 | plugin |
Plugin registration or execution |
| 6 | unsupported_format |
MIME type not supported |
| 7 | internal |
Internal/unexpected error |
Error Code Functions¶
uint32_t kreuzberg_error_code_count(void); /* returns 8 */
uint32_t kreuzberg_error_code_internal(void); /* returns 7 */
uint32_t kreuzberg_error_code_io(void); /* returns 4 */
uint32_t kreuzberg_error_code_missing_dependency(void); /* returns 3 */
uint32_t kreuzberg_error_code_ocr(void); /* returns 2 */
uint32_t kreuzberg_error_code_parsing(void); /* returns 1 */
uint32_t kreuzberg_error_code_plugin(void); /* returns 5 */
uint32_t kreuzberg_error_code_unsupported_format(void); /* returns 6 */
uint32_t kreuzberg_error_code_validation(void); /* returns 0 */
Error Introspection¶
uint32_t kreuzberg_classify_error(const char *error_message);
const char *kreuzberg_error_code_description(uint32_t code);
const char *kreuzberg_error_code_name(uint32_t code);
kreuzberg_error_code_name: Returns a static string like"validation","ocr". Do NOT free.kreuzberg_error_code_description: Returns a static description. Do NOT free.kreuzberg_classify_error: Classifies an arbitrary error message string into one of the error codes (0-7).
Example:
uint32_t code = kreuzberg_classify_error("Failed to open file: permission denied");
if (code == kreuzberg_error_code_io()) {
printf("This is an I/O error: %s\n", kreuzberg_error_code_description(code));
}
Memory Management¶
Correct memory management is critical when using the C API. Every allocation has a specific free function. Mixing allocators (for example, calling free() on a Kreuzberg-allocated string) causes undefined behavior.
Rules¶
| Allocated by | Free with |
|---|---|
kreuzberg_extract_* (returns CExtractionResult*) |
kreuzberg_free_result |
kreuzberg_batch_extract_* (returns CBatchResult*) |
kreuzberg_free_batch_result |
Functions returning char* |
kreuzberg_free_string |
kreuzberg_config_from_json / kreuzberg_config_from_file / kreuzberg_config_builder_build |
kreuzberg_config_free |
kreuzberg_config_builder_new |
kreuzberg_config_builder_free (only if NOT built) |
kreuzberg_get_error_details_ptr |
kreuzberg_free_error_details |
kreuzberg_result_pool_new |
kreuzberg_result_pool_free |
Free Functions¶
void kreuzberg_config_builder_free(ConfigBuilder *builder);
void kreuzberg_config_free(ExtractionConfig *config);
void kreuzberg_free_batch_result(CBatchResult *batch_result);
void kreuzberg_free_error_details(CErrorDetails *details);
void kreuzberg_free_result(CExtractionResult *result);
void kreuzberg_free_string(char *s);
All free functions accept NULL (no-op).
Kreuzberg_clone_string¶
Duplicate a null-terminated string using the Kreuzberg allocator.
Signature:
Parameters:
s(const char*): Null-terminated UTF-8 string to clone
Returns:
char*: Cloned string; free withkreuzberg_free_stringNULLon error
Example:
CExtractionResult *result = kreuzberg_extract_file_sync("doc.pdf");
if (result != NULL && result->success) {
/* Clone content before freeing result */
char *saved_content = kreuzberg_clone_string(result->content);
kreuzberg_free_result(result);
/* Use saved_content... */
printf("%s\n", saved_content);
kreuzberg_free_string(saved_content);
}
MIME Type Utilities¶
Kreuzberg_detect_mime_type¶
Detect MIME type from a file path, optionally checking that the file exists.
Signature:
Parameters:
file_path(const char*): Path to filecheck_exists(bool): If true, verifies the file exists before detection
Returns:
char*: Detected MIME type string; free withkreuzberg_free_stringNULLon error
Kreuzberg_detect_mime_type_from_bytes¶
Detect MIME type from raw byte content.
Signature:
Example:
uint8_t data[512];
/* ... read data ... */
char *mime = kreuzberg_detect_mime_type_from_bytes(data, 512);
if (mime != NULL) {
printf("Detected: %s\n", mime);
kreuzberg_free_string(mime);
}
Kreuzberg_detect_mime_type_from_path¶
Detect MIME type by reading both the file extension and file content.
Signature:
Kreuzberg_get_extensions_for_mime¶
Get file extensions for a given MIME type.
Signature:
Returns:
char*: Comma-separated list of extensions; free withkreuzberg_free_stringNULLon error
Kreuzberg_validate_mime_type¶
Validate that a MIME type is supported by Kreuzberg.
Signature:
Returns:
char*: Validated MIME type if supported; free withkreuzberg_free_stringNULLif unsupported or on error
Returns:
char*: JSON array of extensions (for example,["pdf"]); free withkreuzberg_free_stringNULLon error
Example:
char *exts = kreuzberg_get_extensions_for_mime("application/pdf");
if (exts != NULL) {
printf("Extensions: %s\n", exts); /* ["pdf"] */
kreuzberg_free_string(exts);
}
Plugin System¶
Document Extractors¶
Register custom document extractors to handle new or proprietary formats.
bool kreuzberg_register_document_extractor(
const char *name,
DocumentExtractorCallback callback,
const char *mime_types,
int32_t priority
);
bool kreuzberg_unregister_document_extractor(const char *name);
char *kreuzberg_list_document_extractors(void);
bool kreuzberg_clear_document_extractors(void);
Callback signature:
typedef char *(*DocumentExtractorCallback)(
const uint8_t *content,
uintptr_t content_len,
const char *mime_type,
const char *config_json
);
The callback must return a null-terminated JSON string containing the extraction result, or NULL on error. The returned string must be freeable by kreuzberg_free_string.
Example:
char *my_extractor(const uint8_t *content, size_t len,
const char *mime_type, const char *config) {
/* Process content, return JSON ExtractionResult */
return strdup("{\"content\":\"extracted text\",\"mime_type\":\"text/plain\",\"metadata\":{}}");
}
bool ok = kreuzberg_register_document_extractor(
"my-extractor", my_extractor,
"application/x-custom,text/x-custom", 100
);
if (!ok) {
fprintf(stderr, "Registration failed: %s\n", kreuzberg_last_error());
}
OCR Backends¶
Register custom OCR backends for image text recognition.
bool kreuzberg_register_ocr_backend(const char *name, OcrBackendCallback callback);
bool kreuzberg_register_ocr_backend_with_languages(
const char *name,
OcrBackendCallback callback,
const char *languages_json
);
bool kreuzberg_unregister_ocr_backend(const char *name);
char *kreuzberg_list_ocr_backends(void);
bool kreuzberg_clear_ocr_backends(void);
char *kreuzberg_get_ocr_languages(const char *backend);
int32_t kreuzberg_is_language_supported(const char *backend, const char *language);
char *kreuzberg_list_ocr_backends_with_languages(void);
Callback signature:
typedef char *(*OcrBackendCallback)(
const uint8_t *image_bytes,
uintptr_t image_length,
const char *config_json
);
Example:
char *my_ocr(const uint8_t *image_bytes, size_t image_length,
const char *config_json) {
/* Run custom OCR on image data */
return strdup("Recognized text from image");
}
kreuzberg_register_ocr_backend("my-ocr", my_ocr);
/* Check language support */
char *langs = kreuzberg_get_ocr_languages("tesseract");
if (langs != NULL) {
printf("Tesseract languages: %s\n", langs);
kreuzberg_free_string(langs);
}
Post-Processors¶
Register custom post-processing steps to modify extraction results.
bool kreuzberg_register_post_processor(
const char *name,
PostProcessorCallback callback,
int32_t priority
);
bool kreuzberg_register_post_processor_with_stage(
const char *name,
PostProcessorCallback callback,
int32_t priority,
const char *stage /* "early", "middle", or "late" */
);
bool kreuzberg_unregister_post_processor(const char *name);
bool kreuzberg_clear_post_processors(void);
char *kreuzberg_list_post_processors(void);
Callback signature:
The callback receives the current result as JSON, modifies it, and returns a new JSON string.
Validators¶
Register custom validation logic that runs after extraction.
bool kreuzberg_register_validator(
const char *name,
ValidatorCallback callback,
int32_t priority
);
bool kreuzberg_unregister_validator(const char *name);
bool kreuzberg_clear_validators(void);
char *kreuzberg_list_validators(void);
Callback signature:
Return NULL if validation passes. Return an error message string if validation fails.
Example:
char *check_not_empty(const char *result_json) {
/* Simple check: reject if content is empty */
if (strstr(result_json, "\"content\":\"\"") != NULL) {
return strdup("Validation failed: extracted content is empty");
}
return NULL; /* passes */
}
kreuzberg_register_validator("not-empty", check_not_empty, 100);
Result Pool¶
Pre-allocate memory for extraction results to reduce allocation overhead in batch scenarios.
ResultPool *kreuzberg_result_pool_new(uintptr_t capacity);
void kreuzberg_result_pool_reset(ResultPool *pool);
void kreuzberg_result_pool_free(ResultPool *pool);
CResultPoolStats kreuzberg_result_pool_stats(const ResultPool *pool);
Example:
ResultPool *pool = kreuzberg_result_pool_new(100);
if (pool == NULL) {
fprintf(stderr, "Pool creation failed: %s\n", kreuzberg_last_error());
return;
}
/* Process batches, resetting between them */
kreuzberg_result_pool_reset(pool);
/* Extract directly into the pool */
kreuzberg_extract_file_into_pool(pool, "doc.pdf", NULL);
/* Check pool efficiency */
CResultPoolStats stats = kreuzberg_result_pool_stats(pool);
printf("Pool: %zu/%zu results, %zu allocs, %zu bytes\n",
stats.current_count, stats.capacity,
stats.total_allocations, stats.estimated_memory_bytes);
kreuzberg_result_pool_free(pool);
Pool Functions¶
int32_t kreuzberg_extract_file_into_pool(ResultPool *pool, const char *file_path, const char *config_json);
CExtractionResultView *kreuzberg_extract_file_into_pool_view(ResultPool *pool, const char *file_path, const char *config_json);
void kreuzberg_result_pool_free(ResultPool *pool);
ResultPool *kreuzberg_result_pool_new(uintptr_t capacity);
void kreuzberg_result_pool_reset(ResultPool *pool);
CResultPoolStats kreuzberg_result_pool_stats(ResultPool *pool);
Utility Functions¶
Kreuzberg_version¶
Get the library version string.
Signature:
Returns:
- Pointer to a static null-terminated string (for example,
"4.3.8"). Do NOT free this pointer.
Example:
Version Macros¶
The header also defines compile-time version macros:
#define KREUZBERG_VERSION_MAJOR 4
#define KREUZBERG_VERSION_MINOR 3
#define KREUZBERG_VERSION_PATCH 8
#define KREUZBERG_VERSION "4.3.8"
Data Types¶
CExtractionResult¶
The primary result structure returned by extraction functions.
typedef struct CExtractionResult {
char *annotations_json; /* JSON PDF annotations (may be NULL) */
char *chunks_json; /* JSON array of text chunks (may be NULL) */
char *content; /* Extracted text (UTF-8, null-terminated) */
char *date; /* Document date (may be NULL) */
char *detected_languages_json; /* JSON array of detected languages (may be NULL) */
char *djot_content_json; /* JSON Djot content (may be NULL) */
char *document_json; /* JSON document structure (may be NULL) */
char *elements_json; /* JSON semantic elements (may be NULL) */
char *extracted_keywords_json; /* JSON keywords (may be NULL) */
char *images_json; /* JSON array of extracted images (may be NULL) */
char *language; /* Document language (may be NULL) */
char *metadata_json; /* JSON object with metadata (may be NULL) */
char *mime_type; /* Detected MIME type */
char *ocr_elements_json; /* JSON OCR elements (may be NULL) */
char *page_structure_json; /* JSON page structure (may be NULL) */
char *pages_json; /* JSON per-page content (may be NULL) */
char *processing_warnings_json; /* JSON warnings array (may be NULL) */
char *quality_score_json; /* JSON quality score (may be NULL) */
char *subject; /* Document subject (may be NULL) */
char *tables_json; /* JSON array of tables (may be NULL) */
bool success; /* true if extraction succeeded */
uint8_t _padding1[7]; /* Internal padding */
} CExtractionResult;
All char* fields are null-terminated UTF-8 strings. Fields marked "may be NULL" are optional and depend on the document type and configuration. Free the entire struct with kreuzberg_free_result (which frees all string fields automatically).
CBatchResult¶
Container for batch extraction results.
typedef struct CBatchResult {
uintptr_t count; /* Number of results */
CExtractionResult **results; /* Array of result pointers */
bool success; /* true if batch operation succeeded */
uint8_t _padding2[7]; /* Internal padding */
} CBatchResult;
Free with kreuzberg_free_batch_result (frees all individual results and the array).
CBytesWithMime¶
Input structure for byte-based batch extraction.
typedef struct CBytesWithMime {
const uint8_t *data; /* Pointer to document bytes */
uintptr_t data_len; /* Length in bytes */
const char *mime_type; /* MIME type as null-terminated string */
} CBytesWithMime;
The caller retains ownership of data and mime_type pointers.
CErrorDetails¶
Structured error information.
typedef struct CErrorDetails {
char *context_info; /* Additional context (may be NULL; free if non-NULL) */
uint32_t error_code; /* Numeric error code (0-7) */
char *error_type; /* Human-readable type name (free with kreuzberg_free_string) */
char *message; /* Error message (free with kreuzberg_free_string) */
char *source_file; /* Source file (may be NULL; free if non-NULL) */
char *source_function; /* Source function (may be NULL; free if non-NULL) */
uint32_t source_line; /* Line number (0 if unknown) */
int32_t is_panic; /* 1 if from a panic, 0 otherwise */
} CErrorDetails;
CExtractionResultView¶
Zero-copy view into an ExtractionResult. Provides direct pointers to string data without allocation. All pointers are UTF-8 byte slices (not null-terminated).
typedef struct CExtractionResultView {
uintptr_t chunk_count;
uintptr_t content_len;
const uint8_t *content_ptr;
uintptr_t date_len;
const uint8_t *date_ptr;
uintptr_t detected_language_count;
uintptr_t image_count;
uintptr_t language_len;
const uint8_t *language_ptr;
uintptr_t mime_type_len;
const uint8_t *mime_type_ptr;
uintptr_t page_count;
uintptr_t subject_len;
const uint8_t *subject_ptr;
uintptr_t table_count;
uintptr_t title_len;
const uint8_t *title_ptr;
} CExtractionResultView;
---
### View Functions
```c
int32_t kreuzberg_get_result_view(const CExtractionResult *result, CExtractionResultView *out_view);
int32_t kreuzberg_view_get_content(const CExtractionResultView *view, const uint8_t **out_ptr, uintptr_t *out_len);
int32_t kreuzberg_view_get_mime_type(const CExtractionResultView *view, const uint8_t **out_ptr, uintptr_t *out_len);
Views are used in streaming callbacks. They are valid only during the callback invocation. Copy any data you need to keep.
---
### CMetadataField
Returned by `kreuzberg_result_get_metadata_field`.
```c
typedef struct CMetadataField {
const char *name; /* Field name (do NOT free) */
char *json_value; /* JSON value string (free with kreuzberg_free_string if non-NULL) */
int32_t is_null; /* 1 if field does not exist, 0 if it does */
} CMetadataField;
Standard Metadata Fields¶
The following fields are standard across all document types and can be queried via kreuzberg_result_get_metadata_field:
| Field | Description | Type (JSON) |
|---|---|---|
author |
Document author | string |
characters |
Character count | number |
created_at |
Creation timestamp | string (ISO 8601) |
creator |
Document creator/application | string |
description |
Document description | string |
keywords |
Document keywords | array<string> |
language |
Primary language | string (ISO 639-1) |
modified_at |
Modification timestamp | string (ISO 8601) |
pages |
Page count info | object |
producer |
PDF producer | string |
subject |
Document subject | string |
title |
Document title | string |
version |
Document version | string |
---
## Result Retrieval
Functions for retrieving specific data from `CExtractionResult`.
### kreuzberg_result_get_chunk_count
Get the number of text chunks in the result.
**Signature:**
```c
uintptr_t kreuzberg_result_get_chunk_count(const CExtractionResult *result);
Kreuzberg_result_get_detected_language¶
Get a detected language at a specific index.
Signature:
Kreuzberg_result_get_metadata_field¶
Get a specific metadata field by name.
Signature:
CMetadataField kreuzberg_result_get_metadata_field(const CExtractionResult *result, const char *name);
Kreuzberg_result_get_page_count¶
Get the number of pages in the result.
Signature:
CResultPoolStats¶
Statistics for result pool tracking.
typedef struct CResultPoolStats {
uintptr_t capacity; /* Maximum capacity */
uintptr_t current_count; /* Current results in pool */
uintptr_t estimated_memory_bytes; /* Estimated memory usage */
uintptr_t growth_events; /* Number of capacity growths */
uintptr_t total_allocations; /* Total allocations made */
} CResultPoolStats;
CStringInternStats¶
Statistics for string interning efficiency.
typedef struct CStringInternStats {
uintptr_t cache_hits; /* Cache hits */
uintptr_t cache_misses; /* Cache misses */
uintptr_t estimated_memory_saved; /* Memory saved by deduplication */
uintptr_t total_memory_bytes; /* Total memory used */
uintptr_t total_requests; /* Total intern requests */
uintptr_t unique_count; /* Unique strings interned */
} CStringInternStats;
---
### Intern Functions
```c
void kreuzberg_free_interned_string(char *s);
char *kreuzberg_intern_string(const char *s);
void kreuzberg_string_intern_reset(void);
CStringInternStats kreuzberg_string_intern_stats(void);
Format Conversion Utilities¶
Utilities for parsing and serializing configuration enums.
Kreuzberg_code_block_style_to_string¶
Kreuzberg_heading_style_to_string¶
Kreuzberg_highlight_style_to_string¶
Kreuzberg_list_indent_type_to_string¶
Kreuzberg_newline_style_to_string¶
Kreuzberg_parse_code_block_style¶
Kreuzberg_parse_heading_style¶
Kreuzberg_parse_highlight_style¶
Kreuzberg_parse_list_indent_type¶
Kreuzberg_parse_newline_style¶
Kreuzberg_parse_preprocessing_preset¶
Kreuzberg_parse_whitespace_mode¶
Kreuzberg_preprocessing_preset_to_string¶
Kreuzberg_whitespace_mode_to_string¶
Access via:
```c
CStringInternStats kreuzberg_string_intern_stats(void);
void kreuzberg_string_intern_reset(void);
Thread Safety¶
All extraction functions are thread-safe. Error state is stored in thread-local storage, so each thread maintains its own independent error context. Multiple threads can call extraction functions concurrently without interference.
Key points:
kreuzberg_last_error()andkreuzberg_get_error_details()return per-thread stateCExtractionResultViewstructs are NOT thread-safe (used in streaming callbacks)ResultPooluses internal mutex synchronization; safe for concurrent access but may serialize operations- For maximum parallel throughput, use separate pools per thread
LLM Integration¶
Kreuzberg integrates with LLMs via the liter-llm crate for structured extraction and VLM-based OCR. The C FFI layer accepts LLM configuration as JSON strings serialized into the config parameter. See the LLM Integration Guide for full details.
Structured Extraction¶
Pass a JSON config string containing structured_extraction to the extraction function:
#include "kreuzberg.h"
#include <stdio.h>
int main(void) {
const char *config_json =
"{"
" \"structured_extraction\": {"
" \"schema\": {"
" \"type\": \"object\","
" \"properties\": {"
" \"title\": {\"type\": \"string\"},"
" \"authors\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}},"
" \"date\": {\"type\": \"string\"}"
" },"
" \"required\": [\"title\", \"authors\", \"date\"],"
" \"additionalProperties\": false"
" },"
" \"llm\": {\"model\": \"openai/gpt-4o-mini\"},"
" \"strict\": true"
" }"
"}";
CExtractionResult *result = kreuzberg_extract_file_sync("paper.pdf", NULL, config_json);
if (result) {
const char *structured = kreuzberg_result_structured_output(result);
if (structured) {
printf("Structured output: %s\n", structured);
}
kreuzberg_result_free(result);
}
return 0;
}
VLM OCR¶
Pass OCR config with backend: "vlm" and a vlm_config in the JSON config string:
const char *config_json =
"{"
" \"force_ocr\": true,"
" \"ocr\": {"
" \"backend\": \"vlm\","
" \"vlm_config\": {\"model\": \"openai/gpt-4o-mini\"}"
" }"
"}";
CExtractionResult *result = kreuzberg_extract_file_sync("scan.pdf", NULL, config_json);
For configuration details including API keys, model selection, and provider setup, see the LLM Integration Guide.
Related Resources¶
- Header file: crates/kreuzberg-ffi/kreuzberg.h (auto-generated by cbindgen)
- FFI crate: crates/kreuzberg-ffi/ (Rust implementation)
- Rust core: crates/kreuzberg/ (extraction engine)
- Go bindings: packages/go/v4/ (Go wrapper over this C API)