WebAssembly API Reference¶
Complete reference for the Kreuzberg WebAssembly binding (@kreuzberg/wasm).
The WASM binding provides a browser-compatible, runtime-agnostic interface to Kreuzberg's document extraction capabilities. It works in browsers, Node.js, Deno, Bun, and Cloudflare Workers.
Installation¶
Or with other package managers:
Deno¶
Module Initialization¶
initWasm()¶
Initialize the WASM module. This must be called once before using any extraction functions.
Signature:
Throws:
Error: If WASM module fails to load or is not supported in the current environment
Example - Basic initialization:
import { initWasm } from '@kreuzberg/wasm';
async function main() {
await initWasm();
// Now you can use extraction functions
}
main().catch(console.error);
Example - With error handling:
import { initWasm, getWasmCapabilities } from '@kreuzberg/wasm';
async function initializeKreuzberg() {
const caps = getWasmCapabilities();
if (!caps.hasWasm) {
throw new Error('WebAssembly is not supported in this environment');
}
try {
await initWasm();
console.log('Kreuzberg initialized successfully');
} catch (error) {
console.error('Failed to initialize Kreuzberg:', error);
throw error;
}
}
initializeKreuzberg().catch(console.error);
isInitialized()¶
Check if the WASM module is initialized.
Signature:
Returns:
boolean: True if WASM module is initialized, false otherwise
Example:
import { isInitialized, initWasm } from '@kreuzberg/wasm';
if (!isInitialized()) {
await initWasm();
}
getVersion()¶
Get the WASM module version.
Signature:
Returns:
string: The version string of the WASM module
Throws:
Error: If WASM module is not initialized
Example:
import { initWasm, getVersion } from '@kreuzberg/wasm';
await initWasm();
const version = getVersion();
console.log(`Using Kreuzberg ${version}`);
getInitializationError()¶
Get the initialization error if module failed to load. Used for debugging initialization issues.
Signature:
Returns:
Error | null: The error that occurred during initialization, or null if no error
Core Extraction Functions¶
extractBytes()¶
Extract content from document bytes asynchronously.
Signature:
async function extractBytes(
data: Uint8Array,
mimeType: string,
config?: ExtractionConfig | null
): Promise<ExtractionResult>
Parameters:
data(Uint8Array): The document bytes to extract frommimeType(string): MIME type of the document (e.g., 'application/pdf', 'image/jpeg'). Required.config(ExtractionConfig | null): Optional extraction configuration. Uses defaults if not provided.
Returns:
Promise<ExtractionResult>: Extraction result containing content, metadata, tables, images, chunks, and more
Throws:
Error: If WASM module is not initialized, document data is empty, MIME type is missing, or extraction fails
Example - Extract PDF:
import { initWasm, extractBytes } from '@kreuzberg/wasm';
await initWasm();
const pdfBytes = new Uint8Array(buffer);
const result = await extractBytes(pdfBytes, 'application/pdf');
console.log(result.content);
console.log(`Found ${result.tables?.length ?? 0} tables`);
Example - Extract with configuration:
import { initWasm, extractBytes } from '@kreuzberg/wasm';
import type { ExtractionConfig } from '@kreuzberg/wasm';
await initWasm();
const config: ExtractionConfig = {
ocr: {
backend: 'tesseract-wasm',
language: 'deu' // German
},
images: {
extractImages: true,
targetDpi: 200
}
};
const result = await extractBytes(pdfBytes, 'application/pdf', config);
Example - Extract from File in browser:
import { initWasm, extractBytes } from '@kreuzberg/wasm';
import { fileToUint8Array } from '@kreuzberg/wasm/adapters/wasm-adapter';
await initWasm();
const file = inputEvent.target.files[0];
const bytes = await fileToUint8Array(file);
const result = await extractBytes(bytes, file.type);
console.log(result.content);
extractFile()¶
Extract content from a file on the file system (Node.js, Deno, Bun only).
Signature:
async function extractFile(
path: string,
mimeType?: string | null,
config?: ExtractionConfig | null
): Promise<ExtractionResult>
Parameters:
path(string): Path to the file to extract from. Required.mimeType(string | null): Optional MIME type. If not provided, will be auto-detected from file content and extension.config(ExtractionConfig | null): Optional extraction configuration
Returns:
Promise<ExtractionResult>: Extraction result
Throws:
Error: If WASM module is not initialized, file path is missing, file doesn't exist, runtime is not supported (browser), or extraction fails
Example - Extract with auto-detection:
import { extractFile } from '@kreuzberg/wasm';
const result = await extractFile('./document.pdf');
console.log(result.content);
Example - Extract with explicit MIME type:
import { extractFile } from '@kreuzberg/wasm';
const result = await extractFile('./document.docx', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
console.log(result.content);
Example - Extract with configuration:
import { extractFile } from '@kreuzberg/wasm';
const result = await extractFile('./report.xlsx', null, {
chunking: {
maxChars: 1000
}
});
extractFromFile()¶
Extract content from a File or Blob (browser-friendly wrapper).
Convenience function that combines fileToUint8Array() and extractBytes() for streamlined browser usage.
Signature:
async function extractFromFile(
file: File | Blob,
mimeType?: string | null,
config?: ExtractionConfig | null
): Promise<ExtractionResult>
Parameters:
file(File | Blob): The File or Blob to extract from. Required.mimeType(string | null): Optional MIME type. If not provided, usesfile.typefor File objects, defaults to 'application/octet-stream' for Blob.config(ExtractionConfig | null): Optional extraction configuration
Returns:
Promise<ExtractionResult>: Extraction result
Throws:
Error: If WASM module is not initialized or extraction fails
Example - Simple file input:
import { initWasm, extractFromFile } from '@kreuzberg/wasm';
await initWasm();
const fileInput = document.getElementById('file') as HTMLInputElement;
fileInput.addEventListener('change', async (e) => {
const file = e.target.files?.[0];
if (file) {
const result = await extractFromFile(file);
console.log(result.content);
}
});
Example - With configuration:
import { extractFromFile } from '@kreuzberg/wasm';
const result = await extractFromFile(file, file.type, {
chunking: { maxChars: 1000 },
images: { extractImages: true }
});
batchExtractBytes()¶
Extract content from multiple byte arrays in parallel.
Signature:
async function batchExtractBytes(
dataList: Uint8Array[],
mimeTypes: string[],
config?: ExtractionConfig | null
): Promise<ExtractionResult[]>
Parameters:
dataList(Uint8Array[]): Array of document bytes to extract from. Required.mimeTypes(string[]): Array of MIME types corresponding to each document. Must match length ofdataList. Required.config(ExtractionConfig | null): Optional extraction configuration applied to all documents
Returns:
- Promise
type: Array of extraction results in the same order as input
Throws:
Error: If WASM module is not initialized or any extraction fails
Example:
import { initWasm, batchExtractBytes } from '@kreuzberg/wasm';
await initWasm();
const dataList = [pdfBytes1, pdfBytes2, pdfBytes3];
const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];
const results = await batchExtractBytes(dataList, mimeTypes, {
extract_tables: true
});
for (const result of results) {
console.log(`${result.mimeType}: ${result.content.length} characters`);
}
batchExtractFiles()¶
Extract content from multiple browser File objects in parallel.
Signature:
async function batchExtractFiles(
files: File[],
config?: ExtractionConfig | null
): Promise<ExtractionResult[]>
Parameters:
files(File[]): Array of File objects to extract from. Required.config(ExtractionConfig | null): Optional extraction configuration applied to all files
Returns:
- Promise
type: Array of extraction results in the same order as input
Throws:
Error: If WASM module is not initialized or any extraction fails
Example - Process multiple file uploads:
import { initWasm, batchExtractFiles } from '@kreuzberg/wasm';
await initWasm();
const fileInput = document.getElementById('files') as HTMLInputElement;
const files = Array.from(fileInput.files);
const results = await batchExtractFiles(files, {
extract_tables: true
});
for (const result of results) {
console.log(`${result.mimeType}: ${result.content.length} characters`);
}
Synchronous Extraction Functions¶
extractBytesSync()¶
Extract content from document bytes synchronously.
Note: Synchronous extraction may block the event loop on large documents. Use async extraction (extractBytes()) for better performance in most cases.
Signature:
function extractBytesSync(
data: Uint8Array,
mimeType: string,
config?: ExtractionConfig | null
): ExtractionResult
Parameters:
data(Uint8Array): The document bytes to extract frommimeType(string): MIME type of the documentconfig(ExtractionConfig | null): Optional extraction configuration
Returns:
ExtractionResult: Extraction result
Throws:
Error: If WASM module is not initialized or extraction fails
Example:
import { initWasm, extractBytesSync } from '@kreuzberg/wasm';
await initWasm();
const result = extractBytesSync(pdfBytes, 'application/pdf');
console.log(result.content);
batchExtractBytesSync()¶
Extract content from multiple byte arrays synchronously.
Signature:
function batchExtractBytesSync(
dataList: Uint8Array[],
mimeTypes: string[],
config?: ExtractionConfig | null
): ExtractionResult[]
Parameters:
dataList(Uint8Array[]): Array of document bytesmimeTypes(string[]): Array of MIME typesconfig(ExtractionConfig | null): Optional extraction configuration
Returns:
- ExtractionResult array type: Array of extraction results
Throws:
Error: If WASM module is not initialized or any extraction fails
OCR Functions¶
enableOcr()¶
Enable OCR functionality with automatic backend selection.
Automatically selects the best available OCR backend based on build configuration and runtime:
- Native WASM OCR (preferred): If built with the
ocr-wasmfeature, useskreuzberg-tesseractcompiled directly into the WASM binary. Works in all environments (Browser, Node.js, Deno, Bun). - Browser fallback: Uses
TesseractWasmBackendwith thetesseract-wasmnpm package (requirescreateImageBitmapbrowser API).
Signature:
Throws:
Error: If WASM module is not initialized or no OCR backend is available
Requirements:
- Network access to jsDelivr CDN for training data (downloaded on first use per language)
- For native WASM OCR: WASM module built with
ocr-wasmfeature - For browser fallback:
createImageBitmapAPI support
Example - Basic OCR (works in all environments):
import { initWasm, enableOcr, extractBytes } from '@kreuzberg/wasm';
async function main() {
await initWasm();
await enableOcr();
const imageBytes = new Uint8Array(buffer);
const result = await extractBytes(imageBytes, 'image/png', {
ocr: { backend: 'kreuzberg-tesseract', language: 'eng' }
});
console.log(result.content);
}
main().catch(console.error);
Example - Node.js OCR:
import { initWasm, enableOcr, extractFile } from '@kreuzberg/wasm';
await initWasm();
await enableOcr(); // Uses native kreuzberg-tesseract backend
const result = await extractFile('./scanned_document.png', 'image/png', {
ocr: { backend: 'kreuzberg-tesseract', language: 'eng' }
});
console.log(result.content);
Example - Multi-language OCR:
import { initWasm, enableOcr, extractBytes } from '@kreuzberg/wasm';
await initWasm();
await enableOcr();
// Extract English text
const englishResult = await extractBytes(engImageBytes, 'image/png', {
ocr: { backend: 'kreuzberg-tesseract', language: 'eng' }
});
// Extract German text - training data is cached after first use
const germanResult = await extractBytes(deImageBytes, 'image/png', {
ocr: { backend: 'kreuzberg-tesseract', language: 'deu' }
});
Supported Languages (43):
eng, deu, fra, spa, ita, por, nld, rus, jpn, kor, chi_sim, chi_tra, pol, tur, swe, dan, fin, nor, ces, slk, ron, hun, hrv, srp, bul, ukr, ell, ara, heb, hin, tha, vie, mkd, ben, tam, tel, kan, mal, mya, khm, lao, sin
OCR Backend Management¶
registerOcrBackend()¶
Register a custom OCR backend.
Signature:
Parameters:
backend(OcrBackendProtocol): OCR backend implementing the OcrBackendProtocol interface. Required.
Throws:
Error: If backend validation fails
Example:
import { registerOcrBackend } from '@kreuzberg/wasm';
import { TesseractWasmBackend } from '@kreuzberg/wasm';
const backend = new TesseractWasmBackend();
await backend.initialize();
registerOcrBackend(backend);
getOcrBackend()¶
Get a registered OCR backend by name.
Signature:
Parameters:
name(string): Backend name. Required.
Returns:
OcrBackendProtocol | undefined: The OCR backend or undefined if not found
Example:
import { getOcrBackend } from '@kreuzberg/wasm';
const backend = getOcrBackend('tesseract-wasm');
if (backend) {
console.log('Available languages:', backend.supportedLanguages());
}
listOcrBackends()¶
List all registered OCR backends.
Signature:
Returns:
- string array type: Array of registered backend names
Example:
import { listOcrBackends } from '@kreuzberg/wasm';
const backends = listOcrBackends();
console.log('Available OCR backends:', backends);
unregisterOcrBackend()¶
Unregister an OCR backend.
Signature:
Parameters:
name(string): Backend name to unregister. Required.
Throws:
Error: If backend is not found
Example:
import { unregisterOcrBackend } from '@kreuzberg/wasm';
await unregisterOcrBackend('tesseract-wasm');
clearOcrBackends()¶
Clear all registered OCR backends and call their shutdown methods.
Signature:
Example:
import { clearOcrBackends } from '@kreuzberg/wasm';
// Clean up all backends when shutting down
await clearOcrBackends();
MIME Type Utilities¶
detectMimeFromBytes()¶
Auto-detect MIME type from file bytes.
Signature:
Parameters:
data(Uint8Array): File bytes to detect MIME type from. Required.
Returns:
string: Detected MIME type (e.g., 'application/pdf', 'image/jpeg')
Example:
import { detectMimeFromBytes } from '@kreuzberg/wasm';
const fileBytes = new Uint8Array(buffer);
const mimeType = detectMimeFromBytes(fileBytes);
console.log(`Detected MIME type: ${mimeType}`);
getMimeFromExtension()¶
Get MIME type from file extension.
Signature:
Parameters:
extension(string): File extension (with or without leading dot). Required.
Returns:
string | null: MIME type or null if extension is not recognized
Example:
import { getMimeFromExtension } from '@kreuzberg/wasm';
const mimeType = getMimeFromExtension('pdf'); // 'application/pdf'
const mimeType2 = getMimeFromExtension('.docx'); // 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
getExtensionsForMime()¶
Get file extensions for a MIME type.
Signature:
Parameters:
mimeType(string): MIME type to look up. Required.
Returns:
- string array type: Array of file extensions (without leading dots)
Example:
import { getExtensionsForMime } from '@kreuzberg/wasm';
const extensions = getExtensionsForMime('application/pdf'); // ['pdf']
const extensions2 = getExtensionsForMime('image/jpeg'); // ['jpg', 'jpeg']
normalizeMimeType()¶
Normalize MIME type to canonical form.
Signature:
Parameters:
mimeType(string): MIME type to normalize. Required.
Returns:
string: Normalized MIME type
Example:
import { normalizeMimeType } from '@kreuzberg/wasm';
const normalized = normalizeMimeType('application/PDF'); // 'application/pdf'
const normalized2 = normalizeMimeType('text/plain'); // 'text/plain'
Configuration Loading¶
Deprecated API
The enable_ocr parameter has been deprecated in favor of the new ocr configuration object.
Old pattern (no longer supported):
New pattern:
The new approach provides more granular control over OCR behavior through the OCR configuration object.
loadConfigFromString()¶
Load extraction configuration from a string in YAML, JSON, or TOML format.
Signature:
function loadConfigFromString(
content: string,
format: 'yaml' | 'toml' | 'json'
): ExtractionConfig
Parameters:
content(string): Configuration content as a string. Required.format('yaml' | 'toml' | 'json'): Configuration format. Required.
Returns:
ExtractionConfig: Parsed extraction configuration
Throws:
Error: If configuration parsing fails
Example - YAML configuration:
import { loadConfigFromString, extractBytes } from '@kreuzberg/wasm';
const yamlConfig = `
extract_tables: true
ocr:
backend: tesseract
languages: [eng, deu]
`;
const config = loadConfigFromString(yamlConfig, 'yaml');
const result = await extractBytes(data, 'application/pdf', config);
Example - JSON configuration:
import { loadConfigFromString } from '@kreuzberg/wasm';
const jsonConfig = '{"extract_tables":true}';
const config = loadConfigFromString(jsonConfig, 'json');
Example - TOML configuration:
import { loadConfigFromString } from '@kreuzberg/wasm';
const tomlConfig = `
extract_tables = true
// OCR now configured via config.ocr.backend
[ocr_config]
languages = ["eng", "deu"]
`;
const config = loadConfigFromString(tomlConfig, 'toml');
Runtime Detection¶
detectRuntime()¶
Detect the current JavaScript runtime environment.
Signature:
Returns:
RuntimeType: One of 'browser', 'node', 'deno', 'bun', or 'unknown'
Example:
import { detectRuntime } from '@kreuzberg/wasm';
const runtime = detectRuntime();
switch (runtime) {
case 'browser':
console.log('Running in browser');
break;
case 'node':
console.log('Running in Node.js');
break;
case 'deno':
console.log('Running in Deno');
break;
case 'bun':
console.log('Running in Bun');
break;
}
getWasmCapabilities()¶
Get WebAssembly capabilities available in the current runtime.
Signature:
Returns:
WasmCapabilities: Object containing capability flags:runtime(RuntimeType): Detected runtimehasWasm(boolean): WebAssembly supporthasWasmStreaming(boolean): Streaming WASM instantiationhasFileApi(boolean): File API (browser)hasBlob(boolean): Blob APIhasWorkers(boolean): Web Worker supporthasSharedArrayBuffer(boolean): SharedArrayBuffer (restricted)hasModuleWorkers(boolean): Module WorkershasBigInt(boolean): BigInt supportruntimeVersion(string | undefined): Runtime version if available
Example:
import { getWasmCapabilities } from '@kreuzberg/wasm';
const caps = getWasmCapabilities();
console.log(`Runtime: ${caps.runtime}`);
console.log(`WASM: ${caps.hasWasm}`);
console.log(`Workers: ${caps.hasWorkers}`);
if (caps.hasSharedArrayBuffer) {
console.log('Multi-threading available');
} else {
console.log('Running in single-threaded mode');
}
isBrowser(), isNode(), isDeno(), isBun()¶
Check if code is running in a specific runtime.
Signature:
function isBrowser(): boolean
function isNode(): boolean
function isDeno(): boolean
function isBun(): boolean
Returns:
boolean: True if running in the specified runtime
Example:
import { isBrowser, isNode, extractFile } from '@kreuzberg/wasm';
if (isNode()) {
// Node.js: use extractFile() for file system access
const result = await extractFile('./document.pdf');
} else if (isBrowser()) {
// Browser: use extractFromFile() or extractBytes()
const result = await extractFromFile(fileInput.files[0]);
}
hasWorkers(), hasSharedArrayBuffer()¶
Check for specific WASM capabilities.
Signature:
Returns:
boolean: True if the capability is available
Example:
import { hasWorkers, hasSharedArrayBuffer } from '@kreuzberg/wasm';
if (hasSharedArrayBuffer()) {
console.log('Multi-threading with SharedArrayBuffer enabled');
}
if (!hasWorkers()) {
console.warn('Web Workers not available - some features may be limited');
}
Type Adapter Utilities¶
fileToUint8Array()¶
Convert a File or Blob to Uint8Array.
Handles both browser File API and server-side Blob-like objects with a unified interface.
Signature:
Parameters:
file(File | Blob): The File or Blob to convert. Required.
Returns:
Promise<Uint8Array>: The byte array
Throws:
Error: If file cannot be read or exceeds size limit (512 MB)
Example:
import { fileToUint8Array, extractBytes } from '@kreuzberg/wasm';
const file = document.getElementById('input').files[0];
const bytes = await fileToUint8Array(file);
const result = await extractBytes(bytes, file.type);
configToJS()¶
Normalize ExtractionConfig for WASM processing.
Converts TypeScript configuration objects to WASM-compatible format, handling null values and nested structures.
Signature:
Parameters:
config(ExtractionConfig | null): The extraction configuration or null
Returns:
Record<string, unknown>: Normalized configuration object
Example:
import { configToJS } from '@kreuzberg/wasm/adapters/wasm-adapter';
const config = {
ocr: { backend: 'tesseract' },
chunking: { maxChars: 1000 }
};
const wasmConfig = configToJS(config);
jsToExtractionResult()¶
Parse WASM extraction result and convert to TypeScript type.
Handles conversion of WASM-returned objects to proper ExtractionResult types with full validation.
Signature:
Parameters:
jsValue(unknown): The raw WASM result value
Returns:
ExtractionResult: Properly typed extraction result
Throws:
Error: If result structure is invalid
isValidExtractionResult()¶
Validate that a value conforms to ExtractionResult structure.
Performs structural validation without full type checking.
Signature:
Parameters:
value(unknown): The value to validate
Returns:
boolean: True if value appears to be a valid ExtractionResult
Type Definitions¶
All types are exported from the @kreuzberg/wasm package and shared from @kreuzberg/core. Use these types for complete type safety when working with configuration and results.
Importing Types¶
import type {
ExtractionResult,
ExtractionConfig,
OcrConfig,
ChunkingConfig,
ImageConfig,
KeywordsConfig,
Table,
ExtractedImage,
Chunk,
Metadata,
OcrBackendProtocol,
RuntimeType,
WasmCapabilities
} from '@kreuzberg/wasm';
Types¶
All types are shared via the @kreuzberg/core package. Import them for type-safe configuration and results:
import type {
ExtractionResult,
ExtractionConfig,
OcrConfig,
ChunkingConfig,
ImageConfig,
KeywordsConfig,
Table,
ExtractedImage,
Chunk,
Metadata,
OcrBackendProtocol
} from '@kreuzberg/core';
ExtractionResult¶
The main result object returned from extraction functions.
Fields:
content(string): Extracted text contentmimeType(string): MIME type of the documentmetadata(Metadata): Document metadata (page count, encoding, etc.)tables(Table[] | null): Extracted tables (ifextract_tablesenabled)images(ExtractedImage[] | null): Extracted images (ifextract_imagesenabled)chunks(Chunk[] | null): Text chunks (ifenable_chunkingenabled)detectedLanguages(string[] | null): Detected language codes (ifenable_language_detectionenabled)
ExtractionConfig¶
Configuration object for extraction. All fields are optional; defaults are used if not provided.
Fields:
extract_tables(boolean): Extract tables as structured dataextract_images(boolean): Extract embedded imagesextract_metadata(boolean): Extract document metadataocr_config(OcrConfig): OCR configurationenable_chunking(boolean): Split text into semantic chunkschunking_config(ChunkingConfig): Text chunking configurationenable_language_detection(boolean): Detect document languageenable_quality(boolean): Enable encoding detection and normalizationextract_keywords(boolean): Extract important keywordskeywords_config(KeywordsConfig): Keyword extraction settings
OcrConfig¶
Configuration for OCR extraction.
Fields:
backend(string): OCR backend name (e.g., 'tesseract-wasm')language(string): Language code for OCR (e.g., 'eng', 'deu', 'fra')languages(string[]): Multiple languages for OCRdpi(number): DPI for OCR processingpreprocessing(OcrPreprocessing): Image preprocessing settings
ChunkingConfig¶
Configuration for text chunking.
Fields:
maxChars(number): Maximum characters per chunk (default: 1000)maxOverlap(number): Overlap between chunks in characters (default: 200)embedding(EmbeddingConfig | undefined): Optional embedding configurationpreset(string | undefined): Chunking preset name
ImageConfig¶
Configuration for image extraction.
Fields:
extractImages(boolean): Extract images from documentstargetDpi(number): Target DPI for extracted imagesmaxImageDimension(number): Maximum pixel dimension for images
KeywordsConfig¶
Configuration for keyword extraction.
Fields:
maxKeywords(number): Maximum number of keywords to extractmethod(string): Keyword extraction method (e.g., 'yake')
Table¶
Extracted table structure.
Fields:
cells: string array type (2D array of table cells)markdown(string): Table in Markdown formatpageNumber(number): Page number where table appears
ExtractedImage¶
Image extracted from document.
Fields:
data(Uint8Array): Image bytesformat(string): Image format (e.g., 'png', 'jpeg')imageIndex(number): Index within documentpageNumber(number | null): Page number (if applicable)width(number | null): Image width in pixelsheight(number | null): Image height in pixelscolorspace(string | null): Color space (e.g., 'RGB', 'CMYK')bitsPerComponent(number | null): Bits per color componentisMask(boolean): Whether this is a mask imagedescription(string | null): Image description if available
Chunk¶
Text chunk from chunking operation.
Fields:
content(string): Chunk text contentmetadata(ChunkMetadata): Metadata about the chunkembedding(number[] | null): Vector embedding (if available)
ChunkMetadata:
byte_start(number): Starting byte offset (UTF-8 boundary)byte_end(number): Ending byte offset (UTF-8 boundary)chunk_index(number): Index of this chunktotal_chunks(number): Total number of chunkstoken_count(number | null): Token count if availablefirst_page(number | null): First page this chunk appears onlast_page(number | null): Last page this chunk appears on
Metadata¶
Document metadata.
Fields:
pageCount(number | null): Number of pages (if applicable)encoding(string | null): Text encodingformat(string): Document formatauthor(string | null): Document authortitle(string | null): Document titlecreatedAt(string | null): Creation timestampmodifiedAt(string | null): Last modification timestamp- [Additional format-specific fields]
Platform-Specific Notes¶
Browser¶
Requirements:
- Modern browser with WebAssembly support (Chrome 91+, Firefox 90+, Safari 16.4+)
- File API for file uploads
SharedArrayBuffer for Multi-Threading:
To enable multi-threaded extraction, set these HTTP headers:
Example with Express.js:
import express from 'express';
const app = express();
app.use((req, res, next) => {
res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
next();
});
Example with Vite:
import { defineConfig } from 'vite';
export default defineConfig({
server: {
headers: {
'Cross-Origin-Opener-Policy': 'same-origin',
'Cross-Origin-Embedder-Policy': 'require-corp'
}
}
});
Node.js¶
Requirements:
- Node.js 18 or higher
- WASM support (available by default)
Example:
import { extractFile, initWasm } from '@kreuzberg/wasm';
async function main() {
await initWasm();
const result = await extractFile('./document.pdf');
console.log(result.content);
}
main().catch(console.error);
Deno¶
Requirements:
- Deno 1.0 or higher
- Read permissions for files (
--allow-read) - Network permissions for OCR training data (
--allow-net)
Import:
import { extractFile, initWasm } from "npm:@kreuzberg/wasm@^4.2.7";
// Must run with: deno run --allow-read --allow-net script.ts
Example:
import { extractFile, initWasm } from "npm:@kreuzberg/wasm@^4.2.7";
async function main() {
await initWasm();
const result = await extractFile('./document.pdf');
console.log(result.content);
}
main().catch(console.error);
Bun¶
Requirements:
- Bun 1.x or higher
- WASM support (available by default)
Example:
import { extractFile, initWasm } from '@kreuzberg/wasm';
async function main() {
await initWasm();
const result = await extractFile('./document.pdf');
console.log(result.content);
}
main().catch(console.error);
Cloudflare Workers¶
Requirements:
- Cloudflare Workers runtime
- Bundle size considerations (10MB limit compressed)
HTTP Headers:
Cloudflare Workers automatically handle necessary CORS headers. For multi-threading, ensure:
export default {
async fetch(request: Request): Promise<Response> {
const response = new Response(body);
response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
return response;
}
};
Memory Constraints:
For large documents, use chunking to reduce memory usage:
import { extractBytes } from '@kreuzberg/wasm';
export default {
async fetch(request: Request): Promise<Response> {
const formData = await request.formData();
const file = formData.get('file') as File;
const arrayBuffer = await file.arrayBuffer();
const bytes = new Uint8Array(arrayBuffer);
const result = await extractBytes(bytes, file.type, {
chunking_config: { maxChars: 1000 }
});
return Response.json({
text: result.content,
metadata: result.metadata
});
}
};
Common Patterns¶
Pattern: Runtime-Aware File Loading¶
Automatically select the appropriate extraction function based on runtime:
import {
extractFile,
extractFromFile,
isNode,
isBrowser,
initWasm
} from '@kreuzberg/wasm';
await initWasm();
async function extractAny(input: string | File): Promise<ExtractionResult> {
if (isNode() && typeof input === 'string') {
return await extractFile(input);
} else if (isBrowser() && input instanceof File) {
return await extractFromFile(input);
} else {
throw new Error('Invalid input for current runtime');
}
}
Pattern: Graceful OCR Initialization¶
Initialize OCR with fallback to text-only extraction:
import { initWasm, enableOcr, extractBytes } from '@kreuzberg/wasm';
async function extractWithOcrFallback(bytes: Uint8Array, mimeType: string) {
await initWasm();
let config = {};
try {
await enableOcr();
config = { ocr: { backend: 'kreuzberg-tesseract', language: 'eng' } };
} catch (error) {
console.warn('OCR unavailable, continuing with text extraction', error);
}
return await extractBytes(bytes, mimeType, config);
}
Pattern: Batch Processing with Progress¶
Extract multiple files with progress tracking:
import { initWasm, batchExtractBytes } from '@kreuzberg/wasm';
async function extractWithProgress(
files: File[],
onProgress: (current: number, total: number) => void
) {
await initWasm();
const results = [];
for (let i = 0; i < files.length; i++) {
const fileBytes = await files[i].arrayBuffer();
const result = await extractBytes(
new Uint8Array(fileBytes),
files[i].type
);
results.push(result);
onProgress(i + 1, files.length);
}
return results;
}
Pattern: Configuration Management¶
Load configuration from environment or file:
import { loadConfigFromString, extractBytes } from '@kreuzberg/wasm';
async function extractWithConfig(bytes: Uint8Array, mimeType: string) {
let config = null;
// Try to load from environment variable
const configStr = process.env.KREUZBERG_CONFIG;
if (configStr) {
try {
config = loadConfigFromString(configStr, 'json');
} catch (error) {
console.warn('Failed to parse config from environment:', error);
}
}
// Default config if not loaded
if (!config) {
config = {
extract_tables: true,
extract_metadata: true
};
}
return await extractBytes(bytes, mimeType, config);
}
Supported Formats¶
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
| Images | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
| Web | HTML, XHTML, XML, EPUB |
| Text | TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TeX, FB2 |
| EML, MSG | |
| Archives | ZIP, TAR, 7Z |
| Other | And 30+ more formats |
Supported MIME Types¶
Common MIME types supported by Kreuzberg WASM:
Documents¶
application/pdf- PDF documentsapplication/vnd.openxmlformats-officedocument.wordprocessingml.document- DOCX (Word)application/msword- DOC (Word 97-2003)application/vnd.openxmlformats-officedocument.presentationml.presentation- PPTX (PowerPoint)application/vnd.ms-powerpoint- PPT (PowerPoint 97-2003)application/vnd.openxmlformats-officedocument.spreadsheetml.sheet- XLSX (Excel)application/vnd.ms-excel- XLS (Excel 97-2003)application/vnd.oasis.opendocument.text- ODT (OpenDocument Text)application/vnd.oasis.opendocument.presentation- ODP (OpenDocument Presentation)application/vnd.oasis.opendocument.spreadsheet- ODS (OpenDocument Spreadsheet)text/rtf- RTF (Rich Text Format)
Images¶
image/png- PNGimage/jpeg- JPEGimage/webp- WebPimage/bmp- BMPimage/tiff- TIFFimage/gif- GIF
Text¶
text/plain- Plain texttext/markdown- Markdowntext/html- HTMLapplication/json- JSONtext/xml- XMLapplication/xml- XML (alternative)text/yaml- YAMLtext/csv- CSVtext/tab-separated-values- TSV
Archives¶
application/zip- ZIPapplication/x-tar- TARapplication/x-7z-compressed- 7Z
Platform Support Matrix¶
| Function | Browser | Node.js | Deno | Bun | Workers |
|---|---|---|---|---|---|
initWasm() | Yes | Yes | Yes | Yes | Yes |
extractBytes() | Yes | Yes | Yes | Yes | Yes |
extractFile() | No | Yes | Yes | Yes | No |
extractFromFile() | Yes | No | No | No | No |
enableOcr() | Yes | Yes* | Yes* | Yes* | Yes* |
initThreadPool() | Yes | No | No | No | No |
batchExtractFiles() | Yes | No | No | No | No |
* OCR in non-browser environments requires the WASM module to be built with the ocr-wasm feature flag, which statically links kreuzberg-tesseract into the WASM binary. When available, native WASM OCR works in all environments without any browser-specific APIs. The browser-only TesseractWasmBackend fallback (using createImageBitmap) is used only when native WASM OCR is not available.
PDF support in Node.js/Deno: PDFium is automatically loaded from the filesystem when running in Node.js or Deno. Set the KREUZBERG_PDFIUM_PATH environment variable to customize the PDFium module location.
Troubleshooting¶
"WASM module failed to initialize"¶
Ensure your bundler is configured to handle WASM files:
Vite:
Webpack:
"Module not found: @kreuzberg/core"¶
The @kreuzberg/core package is a peer dependency. Install it:
"SharedArrayBuffer is not available"¶
This is expected in some browsers or when headers are not set. Multi-threading will not be available, but extraction will continue in single-threaded mode.
To enable multi-threading, set the required HTTP headers (see Platform-Specific Notes > Browser).
Memory Issues in Cloudflare Workers¶
For large documents, process in smaller chunks:
const result = await extractBytes(pdfBytes, 'application/pdf', {
chunking_config: { maxChars: 1000 }
});
WASM Module Not Loading¶
Symptoms: "Failed to load WASM module" error on initialization
Causes: - Network issues preventing WASM download - Bundler misconfiguration (not handling .wasm files correctly) - CORS restrictions blocking module fetch - Module not included in bundle
Solutions: 1. Check browser network tab for failed requests 2. Configure bundler (see "WASM module failed to initialize" section) 3. Ensure CORS headers allow WASM requests 4. Use CDN-delivered version as fallback
SharedArrayBuffer Not Available¶
Symptoms: Multi-threading features disabled, or "SharedArrayBuffer is not available" warning
Causes: - HTTPS context not used (required for security) - Missing Cross-Origin-Opener-Policy (COOP) headers - Missing Cross-Origin-Embedder-Policy (COEP) headers - Old browser version without SharedArrayBuffer support
Solutions: 1. Ensure application runs over HTTPS in production 2. Set required headers (see Platform-Specific Notes > Browser section): - Cross-Origin-Opener-Policy: same-origin - Cross-Origin-Embedder-Policy: require-corp 3. Update browser to latest version 4. Application will automatically fall back to single-threaded mode
OCR Not Available or Not Working¶
Symptoms: "No OCR backend available" error or OCR produces no output
Causes: - WASM module not built with ocr-wasm feature (for native OCR) - Not in browser environment and native OCR unavailable (for browser fallback) - Training data not loading from jsDelivr CDN - Language model not available for selected language
Solutions: 1. Enable native WASM OCR by building with the ocr-wasm feature flag. This embeds kreuzberg-tesseract into the WASM binary and works in all environments.
-
Check if OCR is available after enabling:
check_ocr.tsimport { enableOcr, listOcrBackends } from '@kreuzberg/wasm'; try { await enableOcr(); const backends = listOcrBackends(); console.log('Available OCR backends:', backends); // Expected: ['kreuzberg-tesseract'] (native) or ['tesseract-wasm'] (browser fallback) } catch (error) { console.warn('OCR not available:', error); } -
Check supported languages:
-
Ensure network access to jsDelivr CDN:
- First OCR call per language downloads training data from CDN
- Subsequent calls use cached data
-
May fail without internet connection
-
Handle initialization errors gracefully:
ocr_graceful.tsimport { enableOcr, extractBytes } from '@kreuzberg/wasm'; let ocrEnabled = false; try { await enableOcr(); ocrEnabled = true; } catch (error) { console.warn('OCR initialization failed:', error); } const config = ocrEnabled ? { ocr: { backend: 'kreuzberg-tesseract', language: 'eng' } } : {}; const result = await extractBytes(bytes, 'application/pdf', config);
WASM Module Size and Performance¶
Symptoms: Large bundle size or slow initial load
Context: - WASM module: ~5MB uncompressed - Gzip compressed: ~1.5-2MB - OCR training data (per language): ~20-50MB (downloaded on demand, cached)
Optimization strategies: 1. Use code splitting to load WASM only when needed 2. Compress with gzip/brotli (bundlers do this automatically) 3. Load training data selectively (only load languages you need) 4. Use extractBytes() for in-memory processing to avoid file I/O 5. For large documents, enable chunking to reduce memory usage
Multi-Threading with wasm-bindgen-rayon¶
Kreuzberg WASM leverages wasm-bindgen-rayon to enable multi-threaded document processing with SharedArrayBuffer support.
Initializing Thread Pool¶
Initialize the thread pool with available CPU cores:
import { initThreadPool } from '@kreuzberg/wasm';
// Initialize thread pool for multi-threaded extraction
await initThreadPool(navigator.hardwareConcurrency);
// Now extractions will use multiple threads for better performance
const result = await extractBytes(pdfBytes, 'application/pdf');
Graceful Degradation¶
The library handles thread pool initialization gracefully:
import { initThreadPool } from '@kreuzberg/wasm';
try {
await initThreadPool(navigator.hardwareConcurrency);
console.log('Multi-threading enabled');
} catch (error) {
// Fall back to single-threaded processing
console.warn('Multi-threading unavailable:', error);
console.log('Using single-threaded extraction');
}
// Extraction will work in both cases
const result = await extractBytes(pdfBytes, 'application/pdf');