TypeScript API Reference v4.0.0¶
Complete reference for the Kreuzberg native TypeScript/Node.js API (@kreuzberg/node).
WASM Alternative
This reference covers native bindings (@kreuzberg/node) for Node.js, Bun, and Deno.
For browser and edge environments, see the WASM API Reference (@kreuzberg/wasm).
Installation¶
Or with other package managers:
Core Functions¶
BatchExtractBytes()¶
Extract content from multiple byte arrays in parallel (asynchronous).
Signature:
async function batchExtractBytes(
dataList: Uint8Array[],
mimeTypes: string[],
config: ExtractionConfig | null = null
): Promise<ExtractionResult[]>
Parameters:
Same as batchExtractBytesSync().
Returns:
Promise<ExtractionResult[]>: Promise resolving to array of extraction results
BatchExtractBytesSync()¶
Extract content from multiple byte arrays in parallel (synchronous).
Signature:
function batchExtractBytesSync(
dataList: Uint8Array[],
mimeTypes: string[],
config: ExtractionConfig | null = null
): ExtractionResult[]
Parameters:
dataList(Uint8Array[]): Array of file contents as Uint8ArraymimeTypes(string[]): Array of MIME types (one per data item, same length as dataList)config(ExtractionConfig | null): Extraction configuration applied to all items
Returns:
ExtractionResult[]: Array of extraction results (one per data item)
BatchExtractFiles()¶
Extract content from multiple files in parallel (asynchronous).
Signature:
async function batchExtractFiles(
paths: string[],
config: ExtractionConfig | null = null
): Promise<ExtractionResult[]>
Parameters:
Same as batchExtractFilesSync().
Returns:
Promise<ExtractionResult[]>: Promise resolving to array of extraction results
Examples:
import { batchExtractFiles } from '@kreuzberg/node';
const files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
const results = await batchExtractFiles(files);
for (const result of results) {
console.log(result.content);
}
BatchExtractFilesSync()¶
Extract content from multiple files in parallel (synchronous).
Signature:
function batchExtractFilesSync(
paths: string[],
config: ExtractionConfig | null = null
): ExtractionResult[]
Parameters:
paths(string[]): Array of file paths to extractconfig(ExtractionConfig | null): Extraction configuration applied to all files
Returns:
ExtractionResult[]: Array of extraction results (one per file)
Examples:
import { batchExtractFilesSync } from '@kreuzberg/node';
const paths = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
const results = batchExtractFilesSync(paths);
results.forEach((result, i) => {
console.log(`${paths[i]}: ${result.content.length} characters`);
});
BatchExtractFilesWithConfigs() v4.5.0¶
Extract content from multiple files in parallel, with per-file configuration overrides (asynchronous).
Signature:
async function batchExtractFilesWithConfigs(
paths: string[],
fileConfigs: (FileExtractionConfig | null)[],
config: ExtractionConfig | null = null
): Promise<ExtractionResult[]>
Parameters:
paths(string[]): Array of file pathsfileConfigs((FileExtractionConfig | null)[]): Array of per-file configs (null = use batch defaults). Must match paths length.config(ExtractionConfig | null): Batch-level extraction configuration
Returns:
Promise<ExtractionResult[]>: Promise resolving to array of extraction results
BatchExtractFilesWithConfigsSync() v4.5.0¶
Synchronous variant of batchExtractFilesWithConfigs().
Signature:
function batchExtractFilesWithConfigsSync(
paths: string[],
fileConfigs: (FileExtractionConfig | null)[],
config: ExtractionConfig | null = null
): ExtractionResult[]
Example:
import { batchExtractFilesWithConfigsSync } from '@kreuzberg/node';
const results = batchExtractFilesWithConfigsSync(
['report.pdf', 'scanned.pdf', 'page.html'],
[
null, // use batch defaults
{ forceOcr: true, ocr: { backend: 'tesseract', language: 'deu' } },
{ outputFormat: 'markdown' },
],
);
BatchExtractBytesWithConfigs() v4.5.0¶
Extract content from multiple byte arrays in parallel, with per-file configuration overrides (asynchronous).
Signature:
async function batchExtractBytesWithConfigs(
dataList: Uint8Array[],
mimeTypes: string[],
fileConfigs: (FileExtractionConfig | null)[],
config: ExtractionConfig | null = null
): Promise<ExtractionResult[]>
BatchExtractBytesWithConfigsSync() v4.5.0¶
Synchronous variant of batchExtractBytesWithConfigs().
Signature:
function batchExtractBytesWithConfigsSync(
dataList: Uint8Array[],
mimeTypes: string[],
fileConfigs: (FileExtractionConfig | null)[],
config: ExtractionConfig | null = null
): ExtractionResult[]
ExtractBytes()¶
Extract content from bytes (asynchronous).
Signature:
async function extractBytes(
data: Uint8Array,
mimeType: string,
config: ExtractionConfig | null = null
): Promise<ExtractionResult>
Parameters:
Same as extractBytesSync().
Returns:
Promise<ExtractionResult>: Promise resolving to extraction result
ExtractBytesSync()¶
Extract content from bytes (synchronous).
Signature:
function extractBytesSync(
data: Uint8Array,
mimeType: string,
config: ExtractionConfig | null = null
): ExtractionResult
Parameters:
data(Uint8Array): File content as Uint8ArraymimeType(string): MIME type of the data (required for format detection)config(ExtractionConfig | null): Extraction configuration. Uses defaults if null
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Examples:
import { extractBytesSync } from '@kreuzberg/node';
import { readFileSync } from 'fs';
const data = readFileSync('document.pdf');
const result = extractBytesSync(data, 'application/pdf');
console.log(result.content);
ExtractFile()¶
Extract content from a file (asynchronous).
Signature:
async function extractFile(
filePath: string,
mimeType: string | null = null,
config: ExtractionConfig | null = null
): Promise<ExtractionResult>
Parameters:
Same as extractFileSync().
Returns:
Promise<ExtractionResult>: Promise resolving to extraction result
Examples:
import { extractFile } from '@kreuzberg/node';
async function main() {
const result = await extractFile('document.pdf');
console.log(result.content);
}
main();
ExtractFileSync()¶
Extract content from a file (synchronous).
Signature:
function extractFileSync(
filePath: string,
mimeType: string | null = null,
config: ExtractionConfig | null = null
): ExtractionResult
Parameters:
filePath(string): Path to the file to extractmimeType(string | null): Optional MIME type hint. If null, MIME type is auto-detected from file extension and contentconfig(ExtractionConfig | null): Extraction configuration. Uses defaults if null
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Throws:
Error: Base error for all extraction failures (validation, parsing, OCR, etc.)
Example - Basic usage:
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf');
console.log(result.content);
console.log(`Pages: ${result.metadata.pageCount}`);
Example - With OCR:
import { extractFileSync } from '@kreuzberg/node';
const config = {
ocr: {
backend: 'tesseract',
language: 'eng'
}
};
const result = extractFileSync('scanned.pdf', null, config);
Example - With explicit MIME type:
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf', 'application/pdf');
Configuration¶
ExtractionConfig¶
Main configuration interface for extraction operations.
Type Definition:
interface ExtractionConfig {
chunking?: ChunkingConfig | null;
concurrency?: ConcurrencyConfig | null; // <span class="version-badge">v4.5.0</span>
enableQualityProcessing?: boolean;
forceOcr?: boolean;
htmlOptions?: HtmlConversionOptions | null;
imageExtraction?: ImageExtractionConfig | null;
includeDocumentStructure?: boolean;
keywords?: KeywordConfig | null;
languageDetection?: LanguageDetectionConfig | null;
maxConcurrentExtractions?: number;
ocr?: OcrConfig | null;
outputFormat?: "plain" | "markdown" | "djot" | "html";
pages?: PageConfig | null;
pdfOptions?: PdfConfig | null;
postProcessor?: PostProcessorConfig | null;
resultFormat?: "unified" | "element_based";
securityLimits?: Record<string, number> | null;
tokenReduction?: TokenReductionConfig | null;
useCache?: boolean;
}
Fields:
useCache(boolean): Enable caching for identical inputs. Default: trueenableQualityProcessing(boolean): Enable built-in filters to improve extraction reliability. Default: falseocr(OcrConfig | null): OCR configuration. Default: null (no OCR)forceOcr(boolean): Force OCR even for text-based PDFs. Default: falsepdfOptions(PdfConfig | null): PDF-specific configuration. Default: nullchunking(ChunkingConfig | null): Text chunking configuration. Default: nullconcurrency(ConcurrencyConfig | null): Concurrency configuration. Default: nullimageExtraction(ImageExtractionConfig | null): Image extraction from documents. Default: nulllanguageDetection(LanguageDetectionConfig | null): Language detection configuration. Default: nulltokenReduction(TokenReductionConfig | null): Token reduction configuration. Default: nullpostProcessor(PostProcessorConfig | null): Post-processing configuration. Default: nullhtmlOptions(HtmlConversionOptions | null): HTML to Markdown conversion options. Default: nullkeywords(KeywordConfig | null): Keyword extraction configuration. Default: nullpages(PageConfig | null): Page tracking and continuous extraction options. Default: nullsecurityLimits(Record| null): Safety limits (archive recursion, xml depth, etc.) maxConcurrentExtractions(number): Maximum parallel tasks for batching. Default: 4outputFormat("plain" | "markdown" | "djot" | "html"): Generated text format. Default: "plain"resultFormat("unified" | "element_based"): Shape of extraction results. Default: "unified"includeDocumentStructure(boolean): Construct hierarchical document tree. Default: false
Example:
import { extractFileSync, ExtractionConfig } from '@kreuzberg/node';
const config: ExtractionConfig = {
ocr: {
backend: 'tesseract',
language: 'eng'
},
forceOcr: false,
pdfOptions: {
passwords: ['password1', 'password2'],
extractImages: true
}
};
const result = extractFileSync('document.pdf', null, config);
ExtractionConfig Static Methods¶
The ExtractionConfig object provides static methods for loading configuration from files and discovering configuration files in the filesystem.
ExtractionConfig.fromFile()¶
Load extraction configuration from a file.
Signature:
Parameters:
filePath(string): Path to the configuration file (absolute or relative). Supports.toml,.yaml,.jsonformats
Returns:
ExtractionConfig: Configuration object loaded from the file
Throws:
Error: If file does not exist, is not readable, or contains invalid configuration
Example:
import { ExtractionConfig, extractFileSync } from '@kreuzberg/node';
const config = ExtractionConfig.fromFile('kreuzberg.toml');
const result = extractFileSync('document.pdf', null, config);
console.log(result.content);
ExtractionConfig.discover()¶
Discover and load configuration from current or parent directories.
Signature:
Returns:
ExtractionConfig | null: Configuration object if found, ornullif no configuration file exists
Description:
Searches for a kreuzberg.toml, kreuzberg.yaml, or kreuzberg.json file starting from the current working directory and traversing up the directory tree. Returns the first configuration file found.
Example:
import { ExtractionConfig, extractFile } from '@kreuzberg/node';
const config = ExtractionConfig.discover();
if (config) {
const result = await extractFile('document.pdf', null, config);
console.log('Extracted using discovered config');
console.log(result.content);
} else {
const result = await extractFile('document.pdf', null, null);
console.log('No config file found, using defaults');
console.log(result.content);
}
FileExtractionConfig v4.5.0¶
Per-file extraction configuration overrides for batch operations. All fields are optional — undefined or omitted means "use the batch-level default."
Type Definition:
interface FileExtractionConfig {
enableQualityProcessing?: boolean;
ocr?: OcrConfig | null;
forceOcr?: boolean;
chunking?: ChunkingConfig | null;
imageExtraction?: ImageExtractionConfig | null;
pdfOptions?: PdfConfig | null;
tokenReduction?: TokenReductionConfig | null;
languageDetection?: LanguageDetectionConfig | null;
pages?: PageConfig | null;
keywords?: KeywordConfig | null;
postProcessor?: PostProcessorConfig | null;
htmlOptions?: HtmlConversionOptions | null;
resultFormat?: "unified" | "element_based";
outputFormat?: "plain" | "markdown" | "djot" | "html";
includeDocumentStructure?: boolean;
}
Example:
import type { FileExtractionConfig } from '@kreuzberg/node';
const perFileConfig: FileExtractionConfig = {
forceOcr: true,
ocr: { backend: 'tesseract', language: 'deu' },
};
See Configuration Reference for full details on merge semantics and excluded batch-level fields.
OcrConfig¶
OCR processing configuration.
Type Definition:
interface OcrConfig {
backend: string;
language: string;
tesseractConfig?: TesseractConfig | null;
}
Fields:
backend(string): OCR backend to use. Options: "tesseract", "paddle-ocr". Default: "tesseract"language(string): Language code for OCR (ISO 639-3). Default: "eng"tesseractConfig(TesseractConfig | null): Tesseract-specific configuration. Default: nullmodelTier(string | null): v4.5.0 PaddleOCR model tier: "mobile" (lightweight, ~21MB total, fast) or "server" (high accuracy, ~172MB, best with GPU). Default: "mobile"padding(number | null): v4.5.0 Padding in pixels (0-100) added around the image before PaddleOCR detection. Default: 10
Example:
TesseractConfig¶
Tesseract OCR backend configuration.
Type Definition:
interface TesseractConfig {
psm?: number;
oem?: number;
enableTableDetection?: boolean;
tesseditCharWhitelist?: string | null;
tesseditCharBlacklist?: string | null;
}
Fields:
psm(number): Page segmentation mode (0-13). Default: 3 (auto)oem(number): OCR engine mode (0-3). Default: 3 (LSTM only)enableTableDetection(boolean): Enable table detection and extraction. Default: falsetesseditCharWhitelist(string | null): Character whitelist (for example, "0123456789" for digits only). Default: nulltesseditCharBlacklist(string | null): Character blacklist. Default: null
Example:
const config: ExtractionConfig = {
ocr: {
backend: 'tesseract',
language: 'eng',
tesseractConfig: {
psm: 6,
enableTableDetection: true,
tesseditCharWhitelist: '0123456789'
}
}
};
PdfConfig¶
PDF-specific configuration.
Type Definition:
interface PdfConfig {
allowSingleColumnTables?: boolean;
passwords?: string[] | null;
extractImages?: boolean;
imageDpi?: number;
}
Fields:
allowSingleColumnTables(boolean) v4.5.0: Allow extraction of single-column tables. Default: falsepasswords(string[] | null): List of passwords to try for encrypted PDFs. Default: nullextractImages(boolean): Extract images from PDF. Default: falseimageDpi(number): DPI for image extraction. Default: 300
Example:
const pdfConfig: PdfConfig = {
allowSingleColumnTables: false,
passwords: ['password1', 'password2'],
extractImages: true,
imageDpi: 300
};
ConcurrencyConfig v4.5.0¶
Concurrency configuration for controlling parallel extraction.
Type Definition:
Fields:
maxThreads(number | null): Maximum number of concurrent threads. Default: null (use system default)
Example:
LayoutDetectionConfig v4.5.0¶
Configuration for ONNX-based document layout analysis.
Type Definition:
interface LayoutDetectionConfig {
preset?: string; // "fast" or "accurate" (default)
confidenceThreshold?: number; // minimum detection confidence
applyHeuristics?: boolean; // post-processing refinement (default: true)
}
Fields:
preset(string): Model preset —"accurate"(RT-DETR v2, 17 classes) or"fast"(YOLO DocLayNet, 11 classes). Default:"accurate"confidenceThreshold(number | null): Minimum confidence for layout detections. Default: null (model default)applyHeuristics(boolean): Apply post-processing heuristics to refine results. Default: true
Example:
const config: ExtractionConfig = {
layout: {
preset: "accurate"
},
acceleration: {
provider: "cuda"
}
};
ChunkingConfig¶
Text chunking configuration for splitting long documents.
Type Definition:
interface ChunkingConfig {
maxChars?: number;
maxOverlap?: number;
embedding?: EmbeddingConfig | null;
preset?: string | null;
chunkerType?: string | null;
sizingType?: "characters" | "tokenizer" | null;
sizingModel?: string | null;
sizingCacheDir?: string | null;
prependHeadingContext?: boolean | null;
}
Fields:
maxChars(number): Maximum characters per chunk. Default: 1000maxOverlap(number): Overlap between chunks in characters. Default: 200embedding(EmbeddingConfig | null): Embedding configuration for generating embeddings. Default: nullpreset(string | null): Chunking preset to use. Default: nullsizingType("characters" | "tokenizer" | null): How chunk size is measured. Use"tokenizer"to measure by token count using a HuggingFace tokenizer. Default: null (characters)sizingModel(string | null): HuggingFace model ID for tokenizer-based sizing (for example"bert-base-uncased"). Required whensizingTypeis"tokenizer". Default: nullsizingCacheDir(string | null): Optional directory to cache downloaded tokenizer files. Default: nullchunkerType(string | null): Type of chunker to use. Options:"text"(default),"markdown","yaml". Default: null (text)prependHeadingContext(boolean | null): When true, prepends heading hierarchy path to each chunk's content. Most useful withchunkerType: "markdown". Default: null (false)
LanguageDetectionConfig¶
Language detection configuration.
Type Definition:
Fields:
enabled(boolean): Enable language detection. Default: trueconfidenceThreshold(number): Minimum confidence threshold (0.0-1.0). Default: 0.5
ImageExtractionConfig¶
Image extraction configuration.
Type Definition:
interface ImageExtractionConfig {
enabled?: boolean;
minWidth?: number;
minHeight?: number;
}
Fields:
enabled(boolean): Enable image extraction from documents. Default: falseminWidth(number): Minimum image width in pixels. Default: 100minHeight(number): Minimum image height in pixels. Default: 100
TokenReductionConfig¶
Token reduction configuration for compressing extracted text.
Type Definition:
Fields:
enabled(boolean): Enable token reduction. Default: falsestrategy(string): Reduction strategy. Options: "whitespace", "stemming". Default: "whitespace"
PostProcessorConfig¶
Post-processing configuration.
Type Definition:
Fields:
enabled(boolean): Enable post-processing. Default: trueprocessors(string[] | null): List of processor names to enable. Default: null (all registered processors)
Results & Types¶
ExtractionResult¶
Result object returned by all extraction functions.
Type Definition:
interface ExtractionResult {
annotations?: PdfAnnotation[];
chunks?: Chunk[];
content: string;
detectedLanguages: string[] | null;
djotContent?: DjotContent | null;
document?: DocumentStructure | null;
elements?: Element[];
extractedKeywords?: ExtractedKeyword[];
images?: ExtractedImage[];
metadata: Metadata;
metadataJson: string;
mimeType: string;
ocrElements?: OcrElement[];
outputFormat?: string | null;
pages?: PageContent[];
processingWarnings: ProcessingWarning[];
qualityScore?: number;
resultFormat?: string | null;
tables: Table[];
}
Fields:
annotations(PdfAnnotation[] | undefined): Extracted PDF annotations and highlightschunks(Chunk[] | undefined): Text chunks if chunking is enabledcontent(string): Extracted text contentdetectedLanguages(string[] | null): Array of detected language codes (ISO 639-1) if language detection is enableddjotContent(DjotContent | null): Rich structural markupdocument(DocumentStructure | null): Hierarchical document structureelements(Element[] | undefined): Semantic elements (headings, paragraphs, etc.)extractedKeywords(ExtractedKeyword[] | undefined): Extracted keywords (RAKE/YAKE)images(ExtractedImage[] | undefined): Extracted images if enabledmetadata(Metadata): Document metadata (format-specific fields)mimeType(string): MIME type of the processed documentocrElements(OcrElement[] | undefined): Granular OCR text blocks with bounding boxespages(PageContent[] | undefined): Per-page extracted content when page extraction is enabled viaPageConfig.extractPages = trueprocessingWarnings(ProcessingWarning[]): Non-fatal warnings encountered during extractionqualityScore(number | undefined): Document quality estimation scoretables(Table[]): Array of extracted tables
Example:
const result = extractFileSync('document.pdf');
console.log(`Content: ${result.content}`);
console.log(`MIME type: ${result.mimeType}`);
console.log(`Page count: ${result.metadata.page_count}`);
console.log(`Tables: ${result.tables.length}`);
if (result.detectedLanguages) {
console.log(`Languages: ${result.detectedLanguages.join(', ')}`);
}
Pages¶
Type: PageContent[] | undefined
Per-page extracted content when page extraction is enabled via PageConfig.extractPages = true.
Each page contains:
- Page number (1-indexed)
- Text content for that page
- Tables on that page
- Images on that page
Example:
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf', null, {
pages: {
extractPages: true
}
});
if (result.pages) {
for (const page of result.pages) {
console.log(`Page ${page.pageNumber}:`);
console.log(` Content: ${page.content.length} chars`);
console.log(` Tables: ${page.tables.length}`);
console.log(` Images: ${page.images.length}`);
}
}
Accessing Per-Page Content¶
When page extraction is enabled, access individual pages and iterate over them:
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf', null, {
pages: {
extractPages: true,
insertPageMarkers: true,
markerFormat: '\n\n--- Page {page_num} ---\n\n'
}
});
// Access combined content with page markers
console.log('Combined content with markers:');
console.log(result.content.substring(0, 500));
console.log();
// Access per-page content
if (result.pages) {
for (const page of result.pages) {
console.log(`Page ${page.pageNumber}:`);
console.log(` ${page.content.substring(0, 100)}...`);
if (page.tables.length > 0) {
console.log(` Found ${page.tables.length} table(s)`);
}
if (page.images.length > 0) {
console.log(` Found ${page.images.length} image(s)`);
}
}
}
Metadata¶
Document metadata with format-specific fields.
Type Definition:
interface Metadata {
// Standard 13 Fields
authors?: string[];
createdAt?: string;
createdBy?: string;
custom?: Record<string, any>;
date?: string;
formatType?: string;
keywords?: string[];
language?: string;
modifiedAt?: string;
modifiedBy?: string;
pageCount?: number;
producer?: string;
subject?: string;
title?: string;
// Format-specific fields
sheetCount?: number;
sheetNames?: string[];
fromEmail?: string;
fromName?: string;
toEmails?: string[];
ccEmails?: string[];
bccEmails?: string[];
messageId?: string;
attachments?: string[];
// Allow for any other fields
[key: string]: any;
}
Common Fields:
authors(string[]): Document authorscreatedAt(string): Creation date (ISO 8601)createdBy(string): Creator application/usercustom(Record): Custom metadata fields date(string): Document dateformatType(string): Format discriminator ("pdf", "docx", etc.)keywords(string[]): Document keywordslanguage(string): Document language (ISO 639-1)modifiedAt(string): Modification date (ISO 8601)modifiedBy(string): Last modifierpageCount(number): Total number of pagesproducer(string): Producer applicationsubject(string): Document subjecttitle(string): Document title
Excel-Specific Fields (when formatType === "excel"):
sheetCount(number): Number of sheetssheetNames(string[]): Names of the sheets
Example:
const result = extractFileSync('document.pdf');
const metadata = result.metadata;
if (metadata.format_type === 'pdf') {
console.log(`Title: ${metadata.title}`);
console.log(`Author: ${metadata.author}`);
console.log(`Pages: ${metadata.page_count}`);
}
See the Types Reference for complete metadata field documentation.
Table¶
Extracted table structure.
Type Definition:
Fields:
cells(string[][]): 2D array of table cells (rows x columns)markdown(string): Table rendered as markdownpageNumber(number): Page number where table was found
Example:
const result = extractFileSync('invoice.pdf');
for (const table of result.tables) {
console.log(`Table on page ${table.pageNumber}:`);
console.log(table.markdown);
console.log();
}
ChunkMetadata¶
Metadata for a single text chunk.
Type Definition:
interface ChunkMetadata {
byteStart: number;
byteEnd: number;
charCount: number;
tokenCount?: number;
firstPage?: number;
lastPage?: number;
headingContext?: HeadingContext;
}
Fields:
byteStart(number): UTF-8 byte offset in content (inclusive)byteEnd(number): UTF-8 byte offset in content (exclusive)charCount(number): Number of characters in chunktokenCount(number | undefined): Estimated token count (if configured)firstPage(number | undefined): First page this chunk appears on (1-indexed, only when page boundaries available)lastPage(number | undefined): Last page this chunk appears on (1-indexed, only when page boundaries available)headingContext(HeadingContext | undefined): Heading hierarchy when using Markdown chunker. Only populated when chunker_type is set to markdown.
Page tracking: When PageStructure.boundaries is available and chunking is enabled, firstPage and lastPage are automatically calculated based on byte offsets.
Example:
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf', null, {
chunking: { chunkSize: 500, chunkOverlap: 50 },
pages: { extractPages: true }
});
if (result.chunks) {
for (const chunk of result.chunks) {
const meta = chunk.metadata;
let pageInfo = '';
if (meta.firstPage !== undefined) {
if (meta.firstPage === meta.lastPage) {
pageInfo = ` (page ${meta.firstPage})`;
} else {
pageInfo = ` (pages ${meta.firstPage}-${meta.lastPage})`;
}
}
console.log(
`Chunk [${meta.byteStart}:${meta.byteEnd}]: ${meta.charCount} chars${pageInfo}`
);
}
}
Extensibility¶
Custom Post-Processors¶
Create custom post-processors to add processing logic to the extraction pipeline.
Protocol:
interface PostProcessorProtocol {
name(): string;
process(result: ExtractionResult): ExtractionResult;
processingStage(): string;
}
Example:
import { registerPostProcessor, extractFileSync } from '@kreuzberg/node';
class CustomProcessor implements PostProcessorProtocol {
name(): string {
return 'custom_processor';
}
process(result: ExtractionResult): ExtractionResult {
result.metadata.customField = 'custom_value';
return result;
}
processingStage(): string {
return 'middle';
}
}
registerPostProcessor(new CustomProcessor());
const result = extractFileSync('document.pdf');
console.log(result.metadata.customField);
Managing Processors:
import {
registerPostProcessor,
unregisterPostProcessor,
clearPostProcessors
} from '@kreuzberg/node';
registerPostProcessor(new CustomProcessor());
unregisterPostProcessor('custom_processor');
clearPostProcessors();
Custom Validators¶
Create custom validators to validate extraction results.
Protocol:
Functions:
import {
registerValidator,
unregisterValidator,
clearValidators
} from '@kreuzberg/node';
registerValidator(validator);
unregisterValidator('validator_name');
clearValidators();
Custom OCR Backends¶
Register custom OCR backends for image and PDF processing.
Example with PaddleOCR (native backend):
PaddleOCR is now built into the native Rust core. Simply set the backend to "paddle-ocr":
import { extractFileSync } from '@kreuzberg/node';
const config = {
ocr: {
backend: 'paddle-ocr',
language: 'en'
}
};
const result = extractFileSync('scanned.pdf', null, config);
console.log(result.content);
Embeddings¶
EmbedSync()¶
Generate embeddings for a list of texts synchronously.
Signature:
Parameters:
texts(string[]): List of strings to embed.config(EmbeddingConfig, optional): Embedding configuration.
Returns: number[][] — one embedding vector per input text.
Example:
import { embed, embedSync } from "@kreuzberg/node";
import type { EmbeddingConfig } from "@kreuzberg/node";
const config: EmbeddingConfig = {
model: { type: "preset", name: "balanced" },
normalize: true,
};
// Synchronous
const embeddings = embedSync(["Hello, world!", "Kreuzberg is fast"], config);
console.log(embeddings.length); // 2
console.log(embeddings[0].length); // 768
// Asynchronous (preferred)
const asyncEmbeddings = await embed(["Hello, world!", "Kreuzberg is fast"], config);
console.log(asyncEmbeddings[0].length); // 768
Embed()¶
Async variant of embedSync().
Signature:
Same parameters and return type as embedSync().
PDF Rendering¶
Added in v4.6.2
RenderPdfPageSync()¶
Render a single page of a PDF as a PNG image (synchronous).
Signature:
Parameters:
filePath(string): Path to the PDF filepageIndex(number): Zero-based page index to renderdpi(number | undefined): Resolution for rendering (default 150)
Returns:
Buffer: PNG-encoded Buffer for the requested page
Example:
import { renderPdfPageSync } from "@kreuzberg/node";
const png = renderPdfPageSync("document.pdf", 0);
writeFileSync("first_page.png", png);
PdfPageIterator (class)¶
A more memory-efficient alternative to iteratePdfPagesSync/iteratePdfPages when memory is a concern or when pages should be processed as they are rendered (for example, sending each page to a vision model for OCR). Renders one page at a time via the .next() method.
Signature:
class PdfPageIterator {
constructor(filePath: string, dpi?: number);
next(): PdfPageResult | null;
pageCount(): number;
close(): void;
}
interface PdfPageResult {
pageIndex: number;
data: Buffer;
}
Example:
import { PdfPageIterator } from "@kreuzberg/node";
const iter = new PdfPageIterator("document.pdf", 150);
let result;
while ((result = iter.next()) !== null) {
const { pageIndex, data } = result;
writeFileSync(`page_${pageIndex}.png`, data);
}
iter.close();
Error Handling¶
All errors are thrown as standard JavaScript Error objects with descriptive messages.
Example:
import { extractFileSync } from '@kreuzberg/node';
try {
const result = extractFileSync('document.pdf');
console.log(result.content);
} catch (error) {
console.error(`Extraction failed: ${error.message}`);
if (error.message.includes('file not found')) {
console.error('File does not exist');
} else if (error.message.includes('parsing')) {
console.error('Failed to parse document');
} else if (error.message.includes('OCR')) {
console.error('OCR processing failed');
}
}
See Error Handling Reference for detailed error documentation.
Type Exports¶
All types are exported for use in your TypeScript code:
import type {
ExtractionConfig,
ExtractionResult,
FileExtractionConfig,
OcrConfig,
TesseractConfig,
PdfConfig,
ChunkingConfig,
LanguageDetectionConfig,
ImageExtractionConfig,
TokenReductionConfig,
PostProcessorConfig,
Table,
Metadata,
PostProcessorProtocol,
ValidatorProtocol,
OcrBackendProtocol
} from '@kreuzberg/node';
Performance Recommendations¶
Batch Processing¶
For processing multiple documents, always use batch APIs:
// Good - Uses batch API
const batchResults = await batchExtractFiles(['doc1.pdf', 'doc2.pdf', 'doc3.pdf']);
// Bad - Multiple individual calls
const individualResults = [];
for (const file of files) {
individualResults.push(await extractFile(file));
}
Benefits of batch APIs:
- Parallel processing in Rust
- Better memory management
- Optimal resource usage
Sync vs Async¶
- Use async functions (
extractFile,batchExtractFiles) for I/O-bound operations - Use sync functions (
extractFileSync,batchExtractFilesSync) for simple scripts or CLI tools
LLM Integration¶
Kreuzberg integrates with LLMs via the liter-llm crate for structured extraction and VLM-based OCR. See the LLM Integration Guide for full details.
Structured Extraction¶
Use structuredExtraction in your config to extract structured data from documents using an LLM:
import { extractFileSync } from '@kreuzberg/node';
const config = {
structuredExtraction: {
schema: {
type: 'object',
properties: {
title: { type: 'string' },
authors: { type: 'array', items: { type: 'string' } },
date: { type: 'string' },
},
required: ['title', 'authors', 'date'],
additionalProperties: false,
},
llm: {
model: 'openai/gpt-4o-mini',
},
strict: true,
},
};
const result = extractFileSync('paper.pdf', null, config);
console.log(result.structuredOutput);
The structuredOutput field on ExtractionResult contains the JSON string conforming to the provided schema:
const result = extractFileSync('paper.pdf', null, config);
if (result.structuredOutput) {
const data = JSON.parse(result.structuredOutput);
console.log(data.title);
}
VLM OCR¶
Use a vision-language model as an OCR backend by setting backend: 'vlm' with a vlmConfig:
import { extractFileSync } from '@kreuzberg/node';
const config = {
forceOcr: true,
ocr: {
backend: 'vlm',
vlmConfig: {
model: 'openai/gpt-4o-mini',
},
},
};
const result = extractFileSync('scan.pdf', null, config);
console.log(result.content);
LLM Embeddings¶
Generate embeddings using an LLM provider instead of local ONNX models:
import { embedSync } from '@kreuzberg/node';
const vectors = embedSync(['hello world'], {
modelType: 'llm',
llm: { model: 'openai/text-embedding-3-small' },
});
For configuration details including API keys, model selection, and provider setup, see the LLM Integration Guide.
Code Intelligence¶
Kreuzberg uses tree-sitter-language-pack to parse and analyze source code files across 248 programming languages. When extracting code files, the result metadata includes structural analysis, imports, exports, symbols, diagnostics, and semantic code chunks.
Code intelligence data is available in result.metadata.format when formatType is "code".
import { extractFileSync, ExtractionConfig } from "@kreuzberg/node";
const config: ExtractionConfig = {
treeSitter: {
process: {
structure: true,
imports: true,
exports: true,
comments: true,
docstrings: true,
},
},
};
const result = extractFileSync("app.ts", config);
// Access code intelligence from format metadata
const fmt = result.metadata?.format;
if (fmt && fmt.formatType === "code") {
console.log(`Language: ${fmt.language}`);
console.log(`Functions/classes: ${fmt.structure.length}`);
console.log(`Imports: ${fmt.imports.length}`);
for (const item of fmt.structure) {
console.log(` ${item.kind}: ${item.name} at line ${item.span.startLine}`);
}
for (const chunk of fmt.chunks ?? []) {
console.log(`Chunk: ${chunk.content.slice(0, 50)}...`);
}
}
For configuration details, see the Code Intelligence Guide.
System Requirements¶
Node.js: 16.x or higher
Native Dependencies:
- Tesseract OCR (for OCR support):
brew install tesseract(macOS) orapt-get install tesseract-ocr(Ubuntu)
Platforms:
- Linux (x64, arm64)
- MacOS (x64, arm64)
- Windows (x64)