TypeScript API Reference¶
Complete reference for the Kreuzberg native TypeScript/Node.js API (@kreuzberg/node).
WASM Alternative
This reference covers native bindings (@kreuzberg/node) for Node.js, Bun, and Deno.
For browser and edge environments, see the WASM API Reference (@kreuzberg/wasm).
Installation¶
Or with other package managers:
Core Functions¶
extractFileSync()¶
Extract content from a file (synchronous).
Signature:
function extractFileSync(
filePath: string,
mimeType: string | null = null,
config: ExtractionConfig | null = null
): ExtractionResult
Parameters:
filePath(string): Path to the file to extractmimeType(string | null): Optional MIME type hint. If null, MIME type is auto-detected from file extension and contentconfig(ExtractionConfig | null): Extraction configuration. Uses defaults if null
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Throws:
Error: Base error for all extraction failures (validation, parsing, OCR, etc.)
Example - Basic usage:
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf');
console.log(result.content);
console.log(`Pages: ${result.metadata.pageCount}`);
Example - With OCR:
import { extractFileSync } from '@kreuzberg/node';
const config = {
ocr: {
backend: 'tesseract',
language: 'eng'
}
};
const result = extractFileSync('scanned.pdf', null, config);
Example - With explicit MIME type:
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf', 'application/pdf');
extractFile()¶
Extract content from a file (asynchronous).
Signature:
async function extractFile(
filePath: string,
mimeType: string | null = null,
config: ExtractionConfig | null = null
): Promise<ExtractionResult>
Parameters:
Same as extractFileSync().
Returns:
Promise<ExtractionResult>: Promise resolving to extraction result
Examples:
import { extractFile } from '@kreuzberg/node';
async function main() {
const result = await extractFile('document.pdf');
console.log(result.content);
}
main();
extractBytesSync()¶
Extract content from bytes (synchronous).
Signature:
function extractBytesSync(
data: Uint8Array,
mimeType: string,
config: ExtractionConfig | null = null
): ExtractionResult
Parameters:
data(Uint8Array): File content as Uint8ArraymimeType(string): MIME type of the data (required for format detection)config(ExtractionConfig | null): Extraction configuration. Uses defaults if null
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Examples:
import { extractBytesSync } from '@kreuzberg/node';
import { readFileSync } from 'fs';
const data = readFileSync('document.pdf');
const result = extractBytesSync(data, 'application/pdf');
console.log(result.content);
extractBytes()¶
Extract content from bytes (asynchronous).
Signature:
async function extractBytes(
data: Uint8Array,
mimeType: string,
config: ExtractionConfig | null = null
): Promise<ExtractionResult>
Parameters:
Same as extractBytesSync().
Returns:
Promise<ExtractionResult>: Promise resolving to extraction result
batchExtractFilesSync()¶
Extract content from multiple files in parallel (synchronous).
Signature:
function batchExtractFilesSync(
paths: string[],
config: ExtractionConfig | null = null
): ExtractionResult[]
Parameters:
paths(string[]): Array of file paths to extractconfig(ExtractionConfig | null): Extraction configuration applied to all files
Returns:
ExtractionResult[]: Array of extraction results (one per file)
Examples:
import { batchExtractFilesSync } from '@kreuzberg/node';
const paths = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
const results = batchExtractFilesSync(paths);
results.forEach((result, i) => {
console.log(`${paths[i]}: ${result.content.length} characters`);
});
batchExtractFiles()¶
Extract content from multiple files in parallel (asynchronous).
Signature:
async function batchExtractFiles(
paths: string[],
config: ExtractionConfig | null = null
): Promise<ExtractionResult[]>
Parameters:
Same as batchExtractFilesSync().
Returns:
Promise<ExtractionResult[]>: Promise resolving to array of extraction results
Examples:
import { batchExtractFiles } from '@kreuzberg/node';
const files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
const results = await batchExtractFiles(files);
for (const result of results) {
console.log(result.content);
}
batchExtractBytesSync()¶
Extract content from multiple byte arrays in parallel (synchronous).
Signature:
function batchExtractBytesSync(
dataList: Uint8Array[],
mimeTypes: string[],
config: ExtractionConfig | null = null
): ExtractionResult[]
Parameters:
dataList(Uint8Array[]): Array of file contents as Uint8ArraymimeTypes(string[]): Array of MIME types (one per data item, same length as dataList)config(ExtractionConfig | null): Extraction configuration applied to all items
Returns:
ExtractionResult[]: Array of extraction results (one per data item)
batchExtractBytes()¶
Extract content from multiple byte arrays in parallel (asynchronous).
Signature:
async function batchExtractBytes(
dataList: Uint8Array[],
mimeTypes: string[],
config: ExtractionConfig | null = null
): Promise<ExtractionResult[]>
Parameters:
Same as batchExtractBytesSync().
Returns:
Promise<ExtractionResult[]>: Promise resolving to array of extraction results
Configuration¶
ExtractionConfig¶
Main configuration interface for extraction operations.
Type Definition:
interface ExtractionConfig {
ocr?: OcrConfig | null;
forceOcr?: boolean;
pdfOptions?: PdfConfig | null;
chunking?: ChunkingConfig | null;
languageDetection?: LanguageDetectionConfig | null;
tokenReduction?: TokenReductionConfig | null;
imageExtraction?: ImageExtractionConfig | null;
postProcessor?: PostProcessorConfig | null;
}
Fields:
ocr(OcrConfig | null): OCR configuration. Default: null (no OCR)forceOcr(boolean): Force OCR even for text-based PDFs. Default: falsepdfOptions(PdfConfig | null): PDF-specific configuration. Default: nullchunking(ChunkingConfig | null): Text chunking configuration. Default: nulllanguageDetection(LanguageDetectionConfig | null): Language detection configuration. Default: nulltokenReduction(TokenReductionConfig | null): Token reduction configuration. Default: nullimageExtraction(ImageExtractionConfig | null): Image extraction from documents. Default: nullpostProcessor(PostProcessorConfig | null): Post-processing configuration. Default: null
Example:
import { extractFileSync, ExtractionConfig } from '@kreuzberg/node';
const config: ExtractionConfig = {
ocr: {
backend: 'tesseract',
language: 'eng'
},
forceOcr: false,
pdfOptions: {
passwords: ['password1', 'password2'],
extractImages: true
}
};
const result = extractFileSync('document.pdf', null, config);
ExtractionConfig Static Methods¶
The ExtractionConfig object provides static methods for loading configuration from files and discovering configuration files in the filesystem.
ExtractionConfig.fromFile()¶
Load extraction configuration from a file.
Signature:
Parameters:
filePath(string): Path to the configuration file (absolute or relative). Supports.toml,.yaml,.jsonformats
Returns:
ExtractionConfig: Configuration object loaded from the file
Throws:
Error: If file does not exist, is not readable, or contains invalid configuration
Example:
import { ExtractionConfig, extractFileSync } from '@kreuzberg/node';
const config = ExtractionConfig.fromFile('kreuzberg.toml');
const result = extractFileSync('document.pdf', null, config);
console.log(result.content);
ExtractionConfig.discover()¶
Discover and load configuration from current or parent directories.
Signature:
Returns:
ExtractionConfig | null: Configuration object if found, ornullif no configuration file exists
Description:
Searches for a kreuzberg.toml, kreuzberg.yaml, or kreuzberg.json file starting from the current working directory and traversing up the directory tree. Returns the first configuration file found.
Example:
import { ExtractionConfig, extractFile } from '@kreuzberg/node';
const config = ExtractionConfig.discover();
if (config) {
const result = await extractFile('document.pdf', null, config);
console.log('Extracted using discovered config');
console.log(result.content);
} else {
const result = await extractFile('document.pdf', null, null);
console.log('No config file found, using defaults');
console.log(result.content);
}
OcrConfig¶
OCR processing configuration.
Type Definition:
interface OcrConfig {
backend: string;
language: string;
tesseractConfig?: TesseractConfig | null;
}
Fields:
backend(string): OCR backend to use. Options: "tesseract", "guten-ocr". Default: "tesseract"language(string): Language code for OCR (ISO 639-3). Default: "eng"tesseractConfig(TesseractConfig | null): Tesseract-specific configuration. Default: null
Example:
TesseractConfig¶
Tesseract OCR backend configuration.
Type Definition:
interface TesseractConfig {
psm?: number;
oem?: number;
enableTableDetection?: boolean;
tesseditCharWhitelist?: string | null;
tesseditCharBlacklist?: string | null;
}
Fields:
psm(number): Page segmentation mode (0-13). Default: 3 (auto)oem(number): OCR engine mode (0-3). Default: 3 (LSTM only)enableTableDetection(boolean): Enable table detection and extraction. Default: falsetesseditCharWhitelist(string | null): Character whitelist (e.g., "0123456789" for digits only). Default: nulltesseditCharBlacklist(string | null): Character blacklist. Default: null
Example:
const config: ExtractionConfig = {
ocr: {
backend: 'tesseract',
language: 'eng',
tesseractConfig: {
psm: 6,
enableTableDetection: true,
tesseditCharWhitelist: '0123456789'
}
}
};
PdfConfig¶
PDF-specific configuration.
Type Definition:
interface PdfConfig {
passwords?: string[] | null;
extractImages?: boolean;
imageDpi?: number;
}
Fields:
passwords(string[] | null): List of passwords to try for encrypted PDFs. Default: nullextractImages(boolean): Extract images from PDF. Default: falseimageDpi(number): DPI for image extraction. Default: 300
Example:
const pdfConfig: PdfConfig = {
passwords: ['password1', 'password2'],
extractImages: true,
imageDpi: 300
};
ChunkingConfig¶
Text chunking configuration for splitting long documents.
Type Definition:
interface ChunkingConfig {
chunkSize?: number;
chunkOverlap?: number;
chunkingStrategy?: string;
}
Fields:
chunkSize(number): Maximum chunk size in tokens. Default: 512chunkOverlap(number): Overlap between chunks in tokens. Default: 50chunkingStrategy(string): Chunking strategy. Options: "fixed", "semantic". Default: "fixed"
LanguageDetectionConfig¶
Language detection configuration.
Type Definition:
Fields:
enabled(boolean): Enable language detection. Default: trueconfidenceThreshold(number): Minimum confidence threshold (0.0-1.0). Default: 0.5
ImageExtractionConfig¶
Image extraction configuration.
Type Definition:
interface ImageExtractionConfig {
enabled?: boolean;
minWidth?: number;
minHeight?: number;
}
Fields:
enabled(boolean): Enable image extraction from documents. Default: falseminWidth(number): Minimum image width in pixels. Default: 100minHeight(number): Minimum image height in pixels. Default: 100
TokenReductionConfig¶
Token reduction configuration for compressing extracted text.
Type Definition:
Fields:
enabled(boolean): Enable token reduction. Default: falsestrategy(string): Reduction strategy. Options: "whitespace", "stemming". Default: "whitespace"
PostProcessorConfig¶
Post-processing configuration.
Type Definition:
Fields:
enabled(boolean): Enable post-processing. Default: trueprocessors(string[] | null): List of processor names to enable. Default: null (all registered processors)
Results & Types¶
ExtractionResult¶
Result object returned by all extraction functions.
Type Definition:
interface ExtractionResult {
content: string;
mimeType: string;
metadata: Metadata;
tables: Table[];
detectedLanguages: string[] | null;
}
Fields:
content(string): Extracted text contentmimeType(string): MIME type of the processed documentmetadata(Metadata): Document metadata (format-specific fields)tables(Table[]): Array of extracted tablesdetectedLanguages(string[] | null): Array of detected language codes (ISO 639-1) if language detection is enabledpages(PageContent[] | undefined): Per-page extracted content when page extraction is enabled viaPageConfig.extractPages = true
Example:
const result = extractFileSync('document.pdf');
console.log(`Content: ${result.content}`);
console.log(`MIME type: ${result.mimeType}`);
console.log(`Page count: ${result.metadata.pageCount}`);
console.log(`Tables: ${result.tables.length}`);
if (result.detectedLanguages) {
console.log(`Languages: ${result.detectedLanguages.join(', ')}`);
}
pages¶
Type: PageContent[] | undefined
Per-page extracted content when page extraction is enabled via PageConfig.extractPages = true.
Each page contains: - Page number (1-indexed) - Text content for that page - Tables on that page - Images on that page
Example:
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf', null, {
pages: {
extractPages: true
}
});
if (result.pages) {
for (const page of result.pages) {
console.log(`Page ${page.pageNumber}:`);
console.log(` Content: ${page.content.length} chars`);
console.log(` Tables: ${page.tables.length}`);
console.log(` Images: ${page.images.length}`);
}
}
Accessing Per-Page Content¶
When page extraction is enabled, access individual pages and iterate over them:
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf', null, {
pages: {
extractPages: true,
insertPageMarkers: true,
markerFormat: '\n\n--- Page {page_num} ---\n\n'
}
});
// Access combined content with page markers
console.log('Combined content with markers:');
console.log(result.content.substring(0, 500));
console.log();
// Access per-page content
if (result.pages) {
for (const page of result.pages) {
console.log(`Page ${page.pageNumber}:`);
console.log(` ${page.content.substring(0, 100)}...`);
if (page.tables.length > 0) {
console.log(` Found ${page.tables.length} table(s)`);
}
if (page.images.length > 0) {
console.log(` Found ${page.images.length} image(s)`);
}
}
}
Metadata¶
Document metadata with format-specific fields.
Type Definition:
interface Metadata {
// Common fields
language?: string;
date?: string;
subject?: string;
formatType?: string;
// PDF-specific fields
title?: string;
author?: string;
pageCount?: number;
creationDate?: string;
modificationDate?: string;
creator?: string;
producer?: string;
keywords?: string;
// Excel-specific fields
sheetCount?: number;
sheetNames?: string[];
// Email-specific fields
fromEmail?: string;
fromName?: string;
toEmails?: string[];
ccEmails?: string[];
bccEmails?: string[];
messageId?: string;
attachments?: string[];
// Additional fields...
[key: string]: any;
}
Common Fields:
language(string): Document language (ISO 639-1 code)date(string): Document date (ISO 8601 format)subject(string): Document subjectformatType(string): Format discriminator ("pdf", "excel", "email", etc.)
PDF-Specific Fields (when formatType === "pdf"):
title(string): PDF titleauthor(string): PDF authorpageCount(number): Number of pagescreationDate(string): Creation date (ISO 8601)modificationDate(string): Modification date (ISO 8601)creator(string): Creator applicationproducer(string): Producer applicationkeywords(string): PDF keywords
Example:
const result = extractFileSync('document.pdf');
const metadata = result.metadata;
if (metadata.formatType === 'pdf') {
console.log(`Title: ${metadata.title}`);
console.log(`Author: ${metadata.author}`);
console.log(`Pages: ${metadata.pageCount}`);
}
See the Types Reference for complete metadata field documentation.
Table¶
Extracted table structure.
Type Definition:
Fields:
cells(string[][]): 2D array of table cells (rows x columns)markdown(string): Table rendered as markdownpageNumber(number): Page number where table was found
Example:
const result = extractFileSync('invoice.pdf');
for (const table of result.tables) {
console.log(`Table on page ${table.pageNumber}:`);
console.log(table.markdown);
console.log();
}
ChunkMetadata¶
Metadata for a single text chunk.
Type Definition:
interface ChunkMetadata {
byteStart: number;
byteEnd: number;
charCount: number;
tokenCount?: number;
firstPage?: number;
lastPage?: number;
}
Fields:
byteStart(number): UTF-8 byte offset in content (inclusive)byteEnd(number): UTF-8 byte offset in content (exclusive)charCount(number): Number of characters in chunktokenCount(number | undefined): Estimated token count (if configured)firstPage(number | undefined): First page this chunk appears on (1-indexed, only when page boundaries available)lastPage(number | undefined): Last page this chunk appears on (1-indexed, only when page boundaries available)
Page tracking: When PageStructure.boundaries is available and chunking is enabled, firstPage and lastPage are automatically calculated based on byte offsets.
Example:
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf', null, {
chunking: { chunkSize: 500, chunkOverlap: 50 },
pages: { extractPages: true }
});
if (result.chunks) {
for (const chunk of result.chunks) {
const meta = chunk.metadata;
let pageInfo = '';
if (meta.firstPage !== undefined) {
if (meta.firstPage === meta.lastPage) {
pageInfo = ` (page ${meta.firstPage})`;
} else {
pageInfo = ` (pages ${meta.firstPage}-${meta.lastPage})`;
}
}
console.log(
`Chunk [${meta.byteStart}:${meta.byteEnd}]: ${meta.charCount} chars${pageInfo}`
);
}
}
Extensibility¶
Custom Post-Processors¶
Create custom post-processors to add processing logic to the extraction pipeline.
Protocol:
interface PostProcessorProtocol {
name(): string;
process(result: ExtractionResult): ExtractionResult;
processingStage(): string;
}
Example:
import { registerPostProcessor, extractFileSync } from '@kreuzberg/node';
class CustomProcessor implements PostProcessorProtocol {
name(): string {
return 'custom_processor';
}
process(result: ExtractionResult): ExtractionResult {
result.metadata.customField = 'custom_value';
return result;
}
processingStage(): string {
return 'middle';
}
}
registerPostProcessor(new CustomProcessor());
const result = extractFileSync('document.pdf');
console.log(result.metadata.customField);
Managing Processors:
import {
registerPostProcessor,
unregisterPostProcessor,
clearPostProcessors
} from '@kreuzberg/node';
registerPostProcessor(new CustomProcessor());
unregisterPostProcessor('custom_processor');
clearPostProcessors();
Custom Validators¶
Create custom validators to validate extraction results.
Protocol:
Functions:
import {
registerValidator,
unregisterValidator,
clearValidators
} from '@kreuzberg/node';
registerValidator(validator);
unregisterValidator('validator_name');
clearValidators();
Custom OCR Backends¶
Register custom OCR backends for image and PDF processing.
Example with Guten-OCR:
import { GutenOcrBackend, registerOcrBackend } from '@kreuzberg/node';
const gutenOcr = new GutenOcrBackend();
registerOcrBackend(gutenOcr);
const config = {
ocr: {
backend: 'guten-ocr',
language: 'eng'
}
};
Error Handling¶
All errors are thrown as standard JavaScript Error objects with descriptive messages.
Example:
import { extractFileSync } from '@kreuzberg/node';
try {
const result = extractFileSync('document.pdf');
console.log(result.content);
} catch (error) {
console.error(`Extraction failed: ${error.message}`);
if (error.message.includes('file not found')) {
console.error('File does not exist');
} else if (error.message.includes('parsing')) {
console.error('Failed to parse document');
} else if (error.message.includes('OCR')) {
console.error('OCR processing failed');
}
}
See Error Handling Reference for detailed error documentation.
Type Exports¶
All types are exported for use in your TypeScript code:
import type {
ExtractionConfig,
ExtractionResult,
OcrConfig,
TesseractConfig,
PdfConfig,
ChunkingConfig,
LanguageDetectionConfig,
ImageExtractionConfig,
TokenReductionConfig,
PostProcessorConfig,
Table,
Metadata,
PostProcessorProtocol,
ValidatorProtocol,
OcrBackendProtocol
} from '@kreuzberg/node';
Performance Recommendations¶
Batch Processing¶
For processing multiple documents, always use batch APIs:
// Good - Uses batch API
const batchResults = await batchExtractFiles(['doc1.pdf', 'doc2.pdf', 'doc3.pdf']);
// Bad - Multiple individual calls
const individualResults = [];
for (const file of files) {
individualResults.push(await extractFile(file));
}
Benefits of batch APIs:
- Parallel processing in Rust
- Better memory management
- Optimal resource utilization
Sync vs Async¶
- Use async functions (
extractFile,batchExtractFiles) for I/O-bound operations - Use sync functions (
extractFileSync,batchExtractFilesSync) for simple scripts or CLI tools
System Requirements¶
Node.js: 16.x or higher
Native Dependencies:
- Tesseract OCR (for OCR support):
brew install tesseract(macOS) orapt-get install tesseract-ocr(Ubuntu) - LibreOffice (for legacy Office formats):
brew install libreoffice(macOS) orapt-get install libreoffice(Ubuntu)
Platforms:
- Linux (x64, arm64)
- macOS (x64, arm64)
- Windows (x64)