PHP API Reference¶
Complete reference for the Kreuzberg PHP API.
Installation¶
Composer Installation¶
PHP Extension Setup¶
The Kreuzberg PHP package requires the native extension to be installed and enabled.
1. Install the extension:
Download the appropriate extension for your platform from the releases page, or build from source.
2. Enable the extension:
Add to your php.ini:
3. Verify installation:
<?php
if (!extension_loaded('kreuzberg')) {
die("Kreuzberg extension not loaded\n");
}
echo "Kreuzberg extension loaded successfully!\n";
echo "Version: " . \Kreuzberg\Kreuzberg::version() . "\n";
Quick Start¶
Basic Usage (OOP API)¶
<?php
use Kreuzberg\Kreuzberg;
$kreuzberg = new Kreuzberg();
$result = $kreuzberg->extractFile('document.pdf');
echo $result->content;
echo "Pages: {$result->metadata->pageCount}\n";
Basic Usage (Procedural API)¶
<?php
use function Kreuzberg\extract_file;
$result = extract_file('document.pdf');
echo $result->content;
echo "Pages: {$result->metadata->pageCount}\n";
Extract from Bytes¶
<?php
use Kreuzberg\Kreuzberg;
$kreuzberg = new Kreuzberg();
$data = file_get_contents('document.pdf');
$result = $kreuzberg->extractBytes($data, 'application/pdf');
echo $result->content;
Core Classes¶
Kreuzberg¶
Main API class for document extraction.
Signature:
final readonly class Kreuzberg
{
public function __construct(?ExtractionConfig $defaultConfig = null);
public function extractFile(
string $filePath,
?string $mimeType = null,
?ExtractionConfig $config = null
): ExtractionResult;
public function extractBytes(
string $data,
string $mimeType,
?ExtractionConfig $config = null
): ExtractionResult;
public function batchExtractFiles(
array $paths,
?ExtractionConfig $config = null
): array;
public function batchExtractBytes(
array $dataList,
array $mimeTypes,
?ExtractionConfig $config = null
): array;
public static function version(): string;
}
Example:
<?php
use Kreuzberg\Kreuzberg;
use Kreuzberg\Config\ExtractionConfig;
// Create with default config
$kreuzberg = new Kreuzberg();
$result = $kreuzberg->extractFile('document.pdf');
// Create with custom config
$config = new ExtractionConfig(extractTables: true);
$kreuzberg = new Kreuzberg($config);
$result = $kreuzberg->extractFile('document.pdf');
// Override config per call
$overrideConfig = new ExtractionConfig(extractImages: true);
$result = $kreuzberg->extractFile('document.pdf', config: $overrideConfig);
Kreuzberg::extractFile()¶
Extract content from a file.
Signature:
public function extractFile(
string $filePath,
?string $mimeType = null,
?ExtractionConfig $config = null
): ExtractionResult
Parameters:
$filePath(string): Path to the file to extract$mimeType(string|null): Optional MIME type hint. If null, MIME type is auto-detected$config(ExtractionConfig|null): Extraction configuration. Uses constructor config if null
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Throws:
KreuzbergException: If extraction fails
Examples:
<?php
use Kreuzberg\Kreuzberg;
use Kreuzberg\Config\ExtractionConfig;
use Kreuzberg\Config\OcrConfig;
$kreuzberg = new Kreuzberg();
// Basic extraction
$result = $kreuzberg->extractFile('document.pdf');
// With MIME type hint
$result = $kreuzberg->extractFile('document.pdf', 'application/pdf');
// With configuration
$config = new ExtractionConfig(
ocr: new OcrConfig(backend: 'tesseract', language: 'eng')
);
$result = $kreuzberg->extractFile('scanned.pdf', config: $config);
Kreuzberg::extractBytes()¶
Extract content from bytes.
Signature:
public function extractBytes(
string $data,
string $mimeType,
?ExtractionConfig $config = null
): ExtractionResult
Parameters:
$data(string): File content as bytes$mimeType(string): MIME type of the data (required for format detection)$config(ExtractionConfig|null): Extraction configuration. Uses constructor config if null
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Examples:
<?php
use Kreuzberg\Kreuzberg;
$kreuzberg = new Kreuzberg();
// Extract from file data
$data = file_get_contents('document.pdf');
$result = $kreuzberg->extractBytes($data, 'application/pdf');
// Extract from uploaded file
$uploadedData = $_FILES['document']['tmp_name'];
$data = file_get_contents($uploadedData);
$result = $kreuzberg->extractBytes($data, 'application/pdf');
Kreuzberg::batchExtractFiles()¶
Extract content from multiple files in parallel.
Signature:
Parameters:
$paths(array): Array of file paths to extract $config(ExtractionConfig|null): Extraction configuration applied to all files
Returns:
array<ExtractionResult>: Array of extraction results (one per file)
Examples:
<?php
use Kreuzberg\Kreuzberg;
$kreuzberg = new Kreuzberg();
$files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
$results = $kreuzberg->batchExtractFiles($files);
foreach ($results as $i => $result) {
echo "{$files[$i]}: {$result->content}\n";
}
Kreuzberg::batchExtractBytes()¶
Extract content from multiple byte arrays in parallel.
Signature:
public function batchExtractBytes(
array $dataList,
array $mimeTypes,
?ExtractionConfig $config = null
): array
Parameters:
$dataList(array): Array of file contents as bytes $mimeTypes(array): Array of MIME types (one per data item, same length as dataList) $config(ExtractionConfig|null): Extraction configuration applied to all items
Returns:
array<ExtractionResult>: Array of extraction results (one per data item)
Examples:
<?php
use Kreuzberg\Kreuzberg;
$kreuzberg = new Kreuzberg();
$dataList = [
file_get_contents('doc1.pdf'),
file_get_contents('doc2.docx'),
];
$mimeTypes = [
'application/pdf',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
];
$results = $kreuzberg->batchExtractBytes($dataList, $mimeTypes);
Procedural Functions¶
extract_file()¶
Extract content from a file (procedural API).
Signature:
function extract_file(
string $filePath,
?string $mimeType = null,
?ExtractionConfig $config = null
): ExtractionResult
Parameters:
Same as Kreuzberg::extractFile().
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
Examples:
<?php
use function Kreuzberg\extract_file;
use Kreuzberg\Config\ExtractionConfig;
// Basic extraction
$result = extract_file('document.pdf');
// With configuration
$config = new ExtractionConfig(extractTables: true);
$result = extract_file('document.pdf', config: $config);
extract_bytes()¶
Extract content from bytes (procedural API).
Signature:
function extract_bytes(
string $data,
string $mimeType,
?ExtractionConfig $config = null
): ExtractionResult
Parameters:
Same as Kreuzberg::extractBytes().
Returns:
ExtractionResult: Extraction result containing content, metadata, and tables
batch_extract_files()¶
Extract content from multiple files in parallel (procedural API).
Signature:
Parameters:
Same as Kreuzberg::batchExtractFiles().
Returns:
array<ExtractionResult>: Array of extraction results (one per file)
Examples:
<?php
use function Kreuzberg\batch_extract_files;
$files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
$results = batch_extract_files($files);
foreach ($results as $result) {
echo $result->content;
}
batch_extract_bytes()¶
Extract content from multiple byte arrays in parallel (procedural API).
Signature:
function batch_extract_bytes(
array $dataList,
array $mimeTypes,
?ExtractionConfig $config = null
): array
Parameters:
Same as Kreuzberg::batchExtractBytes().
Returns:
array<ExtractionResult>: Array of extraction results (one per data item)
detect_mime_type()¶
Detect MIME type from file bytes.
Signature:
Parameters:
$data(string): File content as bytes
Returns:
string: Detected MIME type (e.g., "application/pdf", "image/png")
Examples:
<?php
use function Kreuzberg\detect_mime_type;
$data = file_get_contents('unknown.file');
$mimeType = detect_mime_type($data);
echo $mimeType; // "application/pdf"
detect_mime_type_from_path()¶
Detect MIME type from file path.
Signature:
Parameters:
$path(string): Path to the file
Returns:
string: Detected MIME type (e.g., "application/pdf", "text/plain")
Examples:
<?php
use function Kreuzberg\detect_mime_type_from_path;
$mimeType = detect_mime_type_from_path('document.pdf');
echo $mimeType; // "application/pdf"
Configuration¶
ExtractionConfig¶
Main configuration class for extraction operations.
Signature:
readonly class ExtractionConfig
{
public function __construct(
public ?OcrConfig $ocr = null,
public ?PdfConfig $pdf = null,
public ?ChunkingConfig $chunking = null,
public ?EmbeddingConfig $embedding = null,
public ?ImageExtractionConfig $imageExtraction = null,
public ?PageConfig $page = null,
public ?LanguageDetectionConfig $languageDetection = null,
public ?KeywordConfig $keyword = null,
public bool $extractImages = false,
public bool $extractTables = true,
public bool $preserveFormatting = false,
public ?string $outputFormat = null,
);
}
Fields:
$ocr(OcrConfig|null): OCR configuration. Default: null (no OCR)$pdf(PdfConfig|null): PDF-specific configuration. Default: null$chunking(ChunkingConfig|null): Text chunking configuration. Default: null$embedding(EmbeddingConfig|null): Embedding generation configuration. Default: null$imageExtraction(ImageExtractionConfig|null): Image extraction configuration. Default: null$page(PageConfig|null): Page extraction configuration. Default: null$languageDetection(LanguageDetectionConfig|null): Language detection configuration. Default: null$keyword(KeywordConfig|null): Keyword extraction configuration. Default: null$extractImages(bool): Extract images from documents. Default: false$extractTables(bool): Extract tables from documents. Default: true$preserveFormatting(bool): Preserve original formatting. Default: false$outputFormat(string|null): Output format ("markdown", "plain"). Default: null
Examples:
<?php
use Kreuzberg\Config\ExtractionConfig;
use Kreuzberg\Config\OcrConfig;
use Kreuzberg\Config\PdfConfig;
// Basic configuration
$config = new ExtractionConfig(
extractImages: true,
extractTables: true
);
// Advanced configuration
$config = new ExtractionConfig(
ocr: new OcrConfig(
backend: 'tesseract',
language: 'eng'
),
pdf: new PdfConfig(
extractImages: true,
extractMetadata: true
),
extractImages: true,
extractTables: true,
preserveFormatting: true,
outputFormat: 'markdown'
);
OcrConfig¶
OCR processing configuration.
Signature:
readonly class OcrConfig
{
public function __construct(
public string $backend = 'tesseract',
public string $language = 'eng',
public ?TesseractConfig $tesseractConfig = null,
);
}
Fields:
$backend(string): OCR backend to use. Options: "tesseract". Default: "tesseract"$language(string): Language code for OCR (ISO 639-3). Default: "eng"$tesseractConfig(TesseractConfig|null): Tesseract-specific configuration. Default: null
Examples:
<?php
use Kreuzberg\Config\OcrConfig;
// Basic OCR
$config = new OcrConfig(
backend: 'tesseract',
language: 'eng'
);
// Multilingual OCR
$config = new OcrConfig(
backend: 'tesseract',
language: 'eng+fra+deu'
);
TesseractConfig¶
Tesseract OCR backend configuration.
Signature:
readonly class TesseractConfig
{
public function __construct(
public int $psm = 3,
public int $oem = 3,
public bool $enableTableDetection = false,
public ?string $tesseditCharWhitelist = null,
public ?string $tesseditCharBlacklist = null,
);
}
Fields:
$psm(int): Page segmentation mode (0-13). Default: 3 (auto)$oem(int): OCR engine mode (0-3). Default: 3 (LSTM only)$enableTableDetection(bool): Enable table detection and extraction. Default: false$tesseditCharWhitelist(string|null): Character whitelist (e.g., "0123456789" for digits only). Default: null$tesseditCharBlacklist(string|null): Character blacklist. Default: null
Examples:
<?php
use Kreuzberg\Config\OcrConfig;
use Kreuzberg\Config\TesseractConfig;
$config = new OcrConfig(
backend: 'tesseract',
language: 'eng',
tesseractConfig: new TesseractConfig(
psm: 6,
enableTableDetection: true,
tesseditCharWhitelist: '0123456789'
)
);
PdfConfig¶
PDF-specific configuration.
Signature:
readonly class PdfConfig
{
public function __construct(
public bool $extractImages = false,
public bool $extractMetadata = true,
public bool $ocrFallback = false,
public ?int $startPage = null,
public ?int $endPage = null,
public int $imageQuality = 95,
);
}
Fields:
$extractImages(bool): Extract images from PDF. Default: false$extractMetadata(bool): Extract PDF metadata. Default: true$ocrFallback(bool): Use OCR if text extraction fails. Default: false$startPage(int|null): Start page for extraction (0-indexed). Default: null (all pages)$endPage(int|null): End page for extraction (0-indexed). Default: null (all pages)$imageQuality(int): JPEG quality for extracted images (1-100). Default: 95
Examples:
<?php
use Kreuzberg\Config\PdfConfig;
// Extract specific page range
$config = new PdfConfig(
extractImages: true,
startPage: 0,
endPage: 4
);
// OCR fallback for scanned PDFs
$config = new PdfConfig(
ocrFallback: true,
extractImages: true
);
ChunkingConfig¶
Text chunking configuration for splitting long documents.
Signature:
readonly class ChunkingConfig
{
public function __construct(
public int $maxChunkSize = 512,
public int $chunkOverlap = 50,
public bool $respectSentences = true,
public bool $respectParagraphs = false,
);
}
Fields:
$maxChunkSize(int): Maximum chunk size in characters. Default: 512$chunkOverlap(int): Overlap between chunks in characters. Default: 50$respectSentences(bool): Try to split at sentence boundaries. Default: true$respectParagraphs(bool): Try to split at paragraph boundaries. Default: false
Examples:
<?php
use Kreuzberg\Config\ChunkingConfig;
$config = new ChunkingConfig(
maxChunkSize: 1000,
chunkOverlap: 100,
respectSentences: true,
respectParagraphs: true
);
EmbeddingConfig¶
Embedding generation configuration.
Signature:
readonly class EmbeddingConfig
{
public function __construct(
public string $model = 'all-MiniLM-L6-v2',
public bool $normalize = true,
public int $batchSize = 32,
);
}
Fields:
$model(string): Embedding model name. Default: "all-MiniLM-L6-v2"$normalize(bool): Normalize embeddings. Default: true$batchSize(int): Batch size for embedding generation. Default: 32
ImageExtractionConfig¶
Image extraction configuration.
Signature:
readonly class ImageExtractionConfig
{
public function __construct(
public bool $extractImages = false,
public bool $performOcr = false,
public int $minWidth = 100,
public int $minHeight = 100,
);
}
Fields:
$extractImages(bool): Enable image extraction from documents. Default: false$performOcr(bool): Perform OCR on extracted images. Default: false$minWidth(int): Minimum image width in pixels. Default: 100$minHeight(int): Minimum image height in pixels. Default: 100
Examples:
<?php
use Kreuzberg\Config\ImageExtractionConfig;
use Kreuzberg\Config\OcrConfig;
$config = new ImageExtractionConfig(
extractImages: true,
performOcr: true,
minWidth: 200,
minHeight: 200
);
PageConfig¶
Page extraction configuration.
Signature:
readonly class PageConfig
{
public function __construct(
public bool $extractPages = false,
public bool $insertPageMarkers = false,
public string $markerFormat = '--- Page {page_number} ---',
);
}
Fields:
$extractPages(bool): Extract per-page content. Default: false$insertPageMarkers(bool): Insert page markers in content. Default: false$markerFormat(string): Page marker format string. Default: "--- Page {page_number} ---"
Examples:
<?php
use Kreuzberg\Config\PageConfig;
$config = new PageConfig(
extractPages: true,
insertPageMarkers: true,
markerFormat: '=== Page {page_number} ==='
);
LanguageDetectionConfig¶
Language detection configuration.
Signature:
readonly class LanguageDetectionConfig
{
public function __construct(
public bool $enabled = true,
public int $maxLanguages = 3,
public float $confidenceThreshold = 0.8,
);
}
Fields:
$enabled(bool): Enable language detection. Default: true$maxLanguages(int): Maximum number of languages to detect. Default: 3$confidenceThreshold(float): Minimum confidence threshold (0.0-1.0). Default: 0.8
Examples:
<?php
use Kreuzberg\Config\LanguageDetectionConfig;
$config = new LanguageDetectionConfig(
enabled: true,
maxLanguages: 5,
confidenceThreshold: 0.9
);
KeywordConfig¶
Keyword extraction configuration.
Signature:
readonly class KeywordConfig
{
public function __construct(
public bool $enabled = false,
public string $algorithm = 'rake',
public int $maxKeywords = 10,
);
}
Fields:
$enabled(bool): Enable keyword extraction. Default: false$algorithm(string): Keyword extraction algorithm. Options: "rake". Default: "rake"$maxKeywords(int): Maximum number of keywords to extract. Default: 10
Examples:
<?php
use Kreuzberg\Config\KeywordConfig;
$config = new KeywordConfig(
enabled: true,
algorithm: 'rake',
maxKeywords: 15
);
ImagePreprocessingConfig¶
Image preprocessing configuration for OCR.
Signature:
readonly class ImagePreprocessingConfig
{
public function __construct(
public int $targetDpi = 300,
public bool $autoRotate = true,
public bool $denoise = false,
);
}
Fields:
$targetDpi(int): Target DPI for image preprocessing. Default: 300$autoRotate(bool): Auto-rotate images based on orientation. Default: true$denoise(bool): Apply denoising filter. Default: false
Results & Types¶
ExtractionResult¶
Result object returned by all extraction functions.
Signature:
readonly class ExtractionResult
{
public function __construct(
public string $content,
public string $mimeType,
public Metadata $metadata,
public array $tables = [],
public ?array $detectedLanguages = null,
public ?array $chunks = null,
public ?array $images = null,
public ?array $pages = null,
);
}
Fields:
$content(string): Extracted text content$mimeType(string): MIME type of the processed document$metadata(Metadata): Document metadata (format-specific fields)$tables(array): Array of extracted tables
$detectedLanguages(array|null): Array of detected language codes (ISO 639-1) if language detection is enabled $chunks(array|null): Text chunks with embeddings and metadata $images(array|null): Extracted images (with nested OCR results) $pages(array|null): Per-page extracted content when page extraction is enabled Examples:
extraction_result_examples.php<?php use function Kreuzberg\extract_file; $result = extract_file('document.pdf'); echo "Content: {$result->content}\n"; echo "MIME type: {$result->mimeType}\n"; echo "Page count: {$result->metadata->pageCount}\n"; echo "Tables: " . count($result->tables) . "\n"; if ($result->detectedLanguages !== null) { echo "Languages: " . implode(', ', $result->detectedLanguages) . "\n"; }
Metadata¶
Document metadata with format-specific fields.
Signature:
PHPreadonly class Metadata { public function __construct( public ?string $language = null, public ?string $date = null, public ?string $subject = null, public ?string $formatType = null, public ?string $title = null, public ?array $authors = null, public ?array $keywords = null, public ?string $createdAt = null, public ?string $modifiedAt = null, public ?string $createdBy = null, public ?string $producer = null, public ?int $pageCount = null, public array $custom = [], ); }Common Fields:
$language(string|null): Document language (ISO 639-1 code)$date(string|null): Document date (ISO 8601 format)$subject(string|null): Document subject$formatType(string|null): Format discriminator ("pdf", "excel", "email", etc.)$title(string|null): Document title$authors(array|null): Document authors $keywords(array|null): Document keywords $createdAt(string|null): Creation date (ISO 8601)$modifiedAt(string|null): Modification date (ISO 8601)$createdBy(string|null): Creator/application name$producer(string|null): Producer/generator$pageCount(int|null): Number of pages$custom(array): Additional custom metadata from postprocessors
Examples:
metadata_examples.php<?php use function Kreuzberg\extract_file; $result = extract_file('document.pdf'); $metadata = $result->metadata; echo "Title: " . ($metadata->title ?? 'N/A') . "\n"; echo "Authors: " . ($metadata->authors ? implode(', ', $metadata->authors) : 'N/A') . "\n"; echo "Pages: " . ($metadata->pageCount ?? 'N/A') . "\n"; echo "Created: " . ($metadata->createdAt ?? 'N/A') . "\n"; echo "Format: " . ($metadata->formatType ?? 'N/A') . "\n";
Table¶
Extracted table structure.
Signature:
PHPreadonly class Table { public function __construct( public array $cells, public string $markdown, public int $pageNumber, ); }Fields:
$cells(array>): 2D array of table cells (rows x columns) $markdown(string): Table rendered as markdown$pageNumber(int): Page number where table was found
Examples:
table_examples.php<?php use function Kreuzberg\extract_file; $result = extract_file('invoice.pdf'); foreach ($result->tables as $i => $table) { echo "Table " . ($i + 1) . " (Page {$table->pageNumber}):\n"; echo $table->markdown . "\n\n"; // Access cells directly foreach ($table->cells as $row) { foreach ($row as $cell) { echo "$cell | "; } echo "\n"; } }
Chunk¶
Text chunk with optional embeddings and metadata.
Signature:
PHPreadonly class Chunk { public function __construct( public string $text, public ?array $embedding, public ChunkMetadata $metadata, ); }Fields:
$text(string): Chunk text content$embedding(array|null): Embedding vector (if embedding generation is enabled) $metadata(ChunkMetadata): Chunk metadata (byte offsets, page numbers, etc.)
ChunkMetadata¶
Metadata for a single text chunk.
Signature:
PHPreadonly class ChunkMetadata { public function __construct( public int $byteStart, public int $byteEnd, public int $charCount, public ?int $tokenCount = null, public ?int $firstPage = null, public ?int $lastPage = null, ); }Fields:
$byteStart(int): UTF-8 byte offset in content (inclusive)$byteEnd(int): UTF-8 byte offset in content (exclusive)$charCount(int): Number of characters in chunk$tokenCount(int|null): Estimated token count (if configured)$firstPage(int|null): First page this chunk appears on (1-indexed)$lastPage(int|null): Last page this chunk appears on (1-indexed)
ExtractedImage¶
Extracted image with optional OCR results.
Signature:
PHPreadonly class ExtractedImage { public function __construct( public string $data, public string $format, public int $width, public int $height, public int $pageNumber, public ?ExtractionResult $ocrResult = null, ); }Fields:
$data(string): Image data (base64 encoded or raw bytes)$format(string): Image format (e.g., "png", "jpeg")$width(int): Image width in pixels$height(int): Image height in pixels$pageNumber(int): Page number where image was found$ocrResult(ExtractionResult|null): OCR result if OCR was performed on the image
PageContent¶
Per-page extracted content.
Signature:
PHPreadonly class PageContent { public function __construct( public int $pageNumber, public string $content, public array $tables, public array $images, ); }Fields:
$pageNumber(int): Page number (1-indexed)$content(string): Text content for this page$tables(array): Tables on this page
$images(array): Images on this page Examples:
page_content_examples.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Config\ExtractionConfig; use Kreuzberg\Config\PageConfig; $config = new ExtractionConfig( page: new PageConfig(extractPages: true) ); $kreuzberg = new Kreuzberg($config); $result = $kreuzberg->extractFile('document.pdf'); if ($result->pages !== null) { foreach ($result->pages as $page) { echo "Page {$page->pageNumber}:\n"; echo " Content: " . strlen($page->content) . " chars\n"; echo " Tables: " . count($page->tables) . "\n"; echo " Images: " . count($page->images) . "\n"; } }
Exceptions¶
KreuzbergException¶
Base exception class for all Kreuzberg errors.
Signature:
PHPclass KreuzbergException extends Exception { public function __construct( string $message = '', int $code = 0, ?Exception $previous = null ); public static function validation(string $message): self; public static function parsing(string $message): self; public static function ocr(string $message): self; public static function missingDependency(string $message): self; public static function io(string $message): self; public static function plugin(string $message): self; public static function unsupportedFormat(string $message): self; }Exception Types:
- Validation Error (code 1): Invalid configuration or input
- Parsing Error (code 2): Document parsing failure
- OCR Error (code 3): OCR processing failure
- Missing Dependency (code 4): Required system dependency not found
- I/O Error (code 5): File system or network error
- Plugin Error (code 6): Plugin execution error
- Unsupported Format (code 7): Unsupported file format
Examples:
exception_handling.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Exceptions\KreuzbergException; try { $kreuzberg = new Kreuzberg(); $result = $kreuzberg->extractFile('document.pdf'); echo $result->content; } catch (KreuzbergException $e) { echo "Error: {$e->getMessage()}\n"; echo "Code: {$e->getCode()}\n"; // Handle specific error types switch ($e->getCode()) { case 1: echo "Validation error\n"; break; case 2: echo "Parsing error\n"; break; case 3: echo "OCR error\n"; break; case 4: echo "Missing dependency\n"; break; default: echo "Unknown error\n"; } }
Advanced Topics¶
Error Handling¶
Robust error handling strategies for production applications.
Basic Error Handling:
basic_error_handling.php<?php use function Kreuzberg\extract_file; use Kreuzberg\Exceptions\KreuzbergException; try { $result = extract_file('document.pdf'); echo $result->content; } catch (KreuzbergException $e) { error_log("Extraction failed: {$e->getMessage()}"); // Handle error gracefully }Retry with Exponential Backoff:
retry_logic.php<?php use function Kreuzberg\extract_file; use Kreuzberg\Exceptions\KreuzbergException; function extractWithRetry( string $filePath, int $maxRetries = 3, int $initialDelay = 1000 ): ?string { $attempt = 0; $delay = $initialDelay; while ($attempt < $maxRetries) { try { $result = extract_file($filePath); return $result->content; } catch (KreuzbergException $e) { $attempt++; if ($attempt >= $maxRetries) { error_log("Max retries exceeded: {$e->getMessage()}"); return null; } usleep($delay * 1000); $delay *= 2; // Exponential backoff } } return null; } $content = extractWithRetry('document.pdf');Fallback Strategies:
fallback_strategies.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Config\ExtractionConfig; use Kreuzberg\Config\OcrConfig; use Kreuzberg\Exceptions\KreuzbergException; function extractWithFallback(string $filePath): ?string { // Try normal extraction try { $kreuzberg = new Kreuzberg(); $result = $kreuzberg->extractFile($filePath); if (!empty($result->content)) { return $result->content; } } catch (KreuzbergException $e) { // Fallback to OCR try { $config = new ExtractionConfig( ocr: new OcrConfig(backend: 'tesseract', language: 'eng') ); $kreuzberg = new Kreuzberg($config); $result = $kreuzberg->extractFile($filePath); return $result->content; } catch (KreuzbergException $e) { error_log("All extraction methods failed: {$e->getMessage()}"); } } return null; }
Performance Tuning¶
Optimize extraction performance for high-throughput applications.
Batch Processing:
Always use batch APIs for processing multiple documents:
batch_performance.php<?php use Kreuzberg\Kreuzberg; $kreuzberg = new Kreuzberg(); // Good: Batch processing $files = glob('documents/*.pdf'); $results = $kreuzberg->batchExtractFiles($files); // Bad: Individual processing foreach ($files as $file) { $result = $kreuzberg->extractFile($file); }Memory Management:
For large files, process in chunks or use streaming:
memory_management.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Config\ExtractionConfig; use Kreuzberg\Config\ChunkingConfig; // Enable chunking for large documents $config = new ExtractionConfig( chunking: new ChunkingConfig( maxChunkSize: 1000, chunkOverlap: 100 ) ); $kreuzberg = new Kreuzberg($config); $result = $kreuzberg->extractFile('large_document.pdf'); // Process chunks individually if ($result->chunks !== null) { foreach ($result->chunks as $chunk) { // Process chunk processChunk($chunk->text); // Free memory unset($chunk); } }Caching:
Implement caching to avoid re-extracting unchanged documents:
caching_strategy.php<?php use Kreuzberg\Kreuzberg; function extractWithCache(string $filePath): string { $cacheKey = 'extraction_' . md5($filePath . filemtime($filePath)); // Check cache $cached = apcu_fetch($cacheKey); if ($cached !== false) { return $cached; } // Extract and cache $kreuzberg = new Kreuzberg(); $result = $kreuzberg->extractFile($filePath); apcu_store($cacheKey, $result->content, 3600); return $result->content; }
Working with Images¶
Extract and process images from documents.
Basic Image Extraction:
image_extraction.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Config\ExtractionConfig; use Kreuzberg\Config\ImageExtractionConfig; $config = new ExtractionConfig( imageExtraction: new ImageExtractionConfig( extractImages: true, minWidth: 200, minHeight: 200 ) ); $kreuzberg = new Kreuzberg($config); $result = $kreuzberg->extractFile('presentation.pptx'); if ($result->images !== null) { foreach ($result->images as $i => $image) { // Save image to disk $filename = "image_{$i}.{$image->format}"; file_put_contents($filename, base64_decode($image->data)); echo "Saved {$filename}: {$image->width}x{$image->height}\n"; } }Image OCR:
image_ocr.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Config\ExtractionConfig; use Kreuzberg\Config\ImageExtractionConfig; use Kreuzberg\Config\OcrConfig; $config = new ExtractionConfig( imageExtraction: new ImageExtractionConfig( extractImages: true, performOcr: true ), ocr: new OcrConfig( backend: 'tesseract', language: 'eng' ) ); $kreuzberg = new Kreuzberg($config); $result = $kreuzberg->extractFile('scanned_images.pdf'); if ($result->images !== null) { foreach ($result->images as $image) { if ($image->ocrResult !== null) { echo "Image on page {$image->pageNumber}:\n"; echo $image->ocrResult->content . "\n\n"; } } }
Working with Tables¶
Extract and process tables from documents.
Basic Table Extraction:
table_extraction.php<?php use function Kreuzberg\extract_file; $result = extract_file('invoice.pdf'); foreach ($result->tables as $table) { echo "Table on page {$table->pageNumber}:\n"; echo $table->markdown . "\n\n"; }Convert Tables to CSV:
tables_to_csv.php<?php use function Kreuzberg\extract_file; $result = extract_file('spreadsheet.xlsx'); foreach ($result->tables as $i => $table) { $filename = "table_{$i}.csv"; $fp = fopen($filename, 'w'); foreach ($table->cells as $row) { fputcsv($fp, $row); } fclose($fp); echo "Saved {$filename}\n"; }
Multi-Format Processing¶
Handle different file formats dynamically.
Dynamic Configuration:
multi_format.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Config\ExtractionConfig; use Kreuzberg\Config\PdfConfig; use Kreuzberg\Config\OcrConfig; use function Kreuzberg\detect_mime_type_from_path; function createConfigForMimeType(string $mimeType): ExtractionConfig { return match (true) { str_contains($mimeType, 'pdf') => new ExtractionConfig( pdf: new PdfConfig(extractImages: true), extractTables: true ), str_contains($mimeType, 'image') => new ExtractionConfig( ocr: new OcrConfig(backend: 'tesseract', language: 'eng') ), str_contains($mimeType, 'spreadsheet') => new ExtractionConfig( extractTables: true ), default => new ExtractionConfig(), }; } $filePath = 'document.pdf'; $mimeType = detect_mime_type_from_path($filePath); $config = createConfigForMimeType($mimeType); $kreuzberg = new Kreuzberg($config); $result = $kreuzberg->extractFile($filePath);
System Requirements¶
PHP: 8.1.0 or higher
Native Extension:
The Kreuzberg native extension must be compiled and installed for your PHP version and platform.
Native Dependencies:
- Tesseract OCR (for OCR support):
- macOS:
brew install tesseract - Ubuntu:
apt-get install tesseract-ocr -
Windows: Download from GitHub releases
-
LibreOffice (for legacy Office formats):
- macOS:
brew install libreoffice - Ubuntu:
apt-get install libreoffice - Windows: Download from LibreOffice website
Platforms:
- Linux (x64, arm64)
- macOS (x64, arm64)
- Windows (x64)
Thread Safety¶
All Kreuzberg functions are thread-safe and can be called from multiple threads concurrently. However, for better performance with multiple files, use the batch APIs instead of threading.
Example:
thread_safety.php<?php use Kreuzberg\Kreuzberg; // Good: Use batch API $kreuzberg = new Kreuzberg(); $files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']; $results = $kreuzberg->batchExtractFiles($files); // Avoid: Manual threading // PHP's threading support is limited, and batch APIs are more efficient
Version Information¶
Troubleshooting¶
Extension Not Loading¶
If you get an error that the Kreuzberg extension is not loaded:
Check if extension is installed:
check_extension.php<?php if (!extension_loaded('kreuzberg')) { echo "Extension not loaded!\n"; echo "Check your php.ini and ensure 'extension=kreuzberg.so' is present.\n"; exit(1); } echo "Extension loaded successfully!\n";Verify php.ini configuration:
Check extension directory:
OCR Issues¶
Tesseract not found:
check_tesseract.php<?php // Check if Tesseract is available $output = shell_exec('tesseract --version 2>&1'); if ($output === null) { echo "Tesseract not found in PATH!\n"; echo "Install: brew install tesseract (macOS) or apt install tesseract-ocr (Linux)\n"; } else { echo "Tesseract version:\n{$output}\n"; } // Check for language data $langDataPaths = [ '/usr/local/share/tessdata/', // macOS Homebrew '/usr/share/tesseract-ocr/4.00/tessdata/', // Ubuntu/Debian '/usr/share/tessdata/', // Alternative Linux path ]; foreach ($langDataPaths as $path) { if (is_dir($path)) { echo "\nLanguage data found in: {$path}\n"; $files = glob($path . '*.traineddata'); echo "Available languages: " . count($files) . "\n"; foreach ($files as $file) { $lang = basename($file, '.traineddata'); echo "- {$lang}\n"; } break; } }Poor OCR accuracy:
Try these configurations:
improve_ocr.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Config\ExtractionConfig; use Kreuzberg\Config\OcrConfig; use Kreuzberg\Config\TesseractConfig; use Kreuzberg\Config\ImagePreprocessingConfig; // Configuration 1: Increase DPI $config1 = new ExtractionConfig( ocr: new OcrConfig( backend: 'tesseract', language: 'eng', imagePreprocessing: new ImagePreprocessingConfig( targetDpi: 600 // Higher DPI for small text ) ) ); // Configuration 2: Enable denoising $config2 = new ExtractionConfig( ocr: new OcrConfig( backend: 'tesseract', language: 'eng', imagePreprocessing: new ImagePreprocessingConfig( denoise: true ) ) ); // Configuration 3: Different PSM mode $config3 = new ExtractionConfig( ocr: new OcrConfig( backend: 'tesseract', language: 'eng', tesseractConfig: new TesseractConfig( psm: 11 // Sparse text mode ) ) ); // Try each configuration $kreuzberg = new Kreuzberg(); $result = $kreuzberg->extractFile('poor_quality_scan.pdf', config: $config1);Memory Issues¶
Out of memory errors:
memory_optimization.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Config\ExtractionConfig; use Kreuzberg\Config\ChunkingConfig; // Option 1: Increase memory limit ini_set('memory_limit', '512M'); // Option 2: Use chunking $config = new ExtractionConfig( chunking: new ChunkingConfig( maxChunkSize: 1000, chunkOverlap: 100 ) ); // Option 3: Process in batches $files = glob('documents/*.pdf'); $batchSize = 10; foreach (array_chunk($files, $batchSize) as $batch) { $results = batch_extract_files($batch); // Process results... // Free memory unset($results); gc_collect_cycles(); }File Permission Errors¶
check_permissions.php<?php $file = 'document.pdf'; // Check if file exists if (!file_exists($file)) { echo "File not found: {$file}\n"; exit(1); } // Check if file is readable if (!is_readable($file)) { echo "File not readable: {$file}\n"; echo "Current permissions: " . substr(sprintf('%o', fileperms($file)), -4) . "\n"; echo "Run: chmod 644 {$file}\n"; exit(1); } echo "File is accessible!\n";
Migration Guide¶
Upgrading from 3.x to 4.x¶
Breaking Changes:
-
PHP Version Requirement
-
Old: PHP 8.1+
-
New: PHP 8.2+
-
Configuration Classes Now Readonly
// Old (PHP 8.1) $config = new ExtractionConfig(); $config->extractImages = true; // Not possible anymore // New (PHP 8.2+) $config = new ExtractionConfig(extractImages: true);- Namespace Changes
- Method Signature Changes
// Old $result = $kreuzberg->extract('file.pdf'); // New $result = $kreuzberg->extractFile('file.pdf');Migration Steps:
-
Update PHP to 8.2+:
-
Update Composer dependency:
-
Update extension:
-
Update code:
// Before use Kreuzberg\Config\Config; $config = new Config(); $config->extractImages = true; $result = $kreuzberg->extract('file.pdf', $config); // After use Kreuzberg\Config\ExtractionConfig; $config = new ExtractionConfig(extractImages: true); $result = $kreuzberg->extractFile('file.pdf', config: $config);
Best Practices¶
1. Error Handling¶
Always wrap extraction calls in try-catch blocks:
best_error_handling.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Exceptions\KreuzbergException; function extractDocument(string $path): ?string { try { $kreuzberg = new Kreuzberg(); $result = $kreuzberg->extractFile($path); return $result->content; } catch (KreuzbergException $e) { // Log the error error_log("Extraction failed for {$path}: " . $e->getMessage()); // Return null or throw return null; } }2. Configuration Reuse¶
Reuse configuration objects for better performance:
reuse_config.php<?php use Kreuzberg\Kreuzberg; use Kreuzberg\Config\ExtractionConfig; use Kreuzberg\Config\OcrConfig; // Good: Create config once $config = new ExtractionConfig( ocr: new OcrConfig(backend: 'tesseract', language: 'eng'), extractTables: true ); $kreuzberg = new Kreuzberg($config); // Reuse for multiple files $result1 = $kreuzberg->extractFile('doc1.pdf'); $result2 = $kreuzberg->extractFile('doc2.pdf'); $result3 = $kreuzberg->extractFile('doc3.pdf'); // Bad: Creating new config for each file foreach ($files as $file) { $config = new ExtractionConfig(...); // Wasteful $result = (new Kreuzberg($config))->extractFile($file); }3. Batch Processing¶
Use batch APIs for multiple files:
best_batch.php<?php use Kreuzberg\Kreuzberg; $kreuzberg = new Kreuzberg(); $files = glob('documents/*.pdf'); // Good: Batch processing $results = $kreuzberg->batchExtractFiles($files); // Bad: Individual processing in a loop foreach ($files as $file) { $result = $kreuzberg->extractFile($file); // Inefficient }4. Resource Management¶
Free resources when processing large datasets:
resource_management.php<?php use function Kreuzberg\batch_extract_files; $files = glob('documents/*.pdf'); // Process in chunks to manage memory foreach (array_chunk($files, 50) as $batch) { $results = batch_extract_files($batch); // Process results... foreach ($results as $result) { // Do something with result processResult($result); } // Free memory unset($results); gc_collect_cycles(); }5. Type Validation¶
Validate inputs before processing:
validation.php<?php use Kreuzberg\Kreuzberg; function extractSafely(string $path): ?ExtractionResult { // Validate file exists if (!file_exists($path)) { throw new InvalidArgumentException("File not found: {$path}"); } // Validate file is readable if (!is_readable($path)) { throw new RuntimeException("File not readable: {$path}"); } // Validate file size (e.g., max 100MB) $maxSize = 100 * 1024 * 1024; if (filesize($path) > $maxSize) { throw new RuntimeException("File too large: {$path}"); } // Extract $kreuzberg = new Kreuzberg(); return $kreuzberg->extractFile($path); }
Framework Integration¶
Laravel¶
Service Provider:
app/Providers/KreuzbergServiceProvider.php<?php namespace App\Providers; use Illuminate\Support\ServiceProvider; use Kreuzberg\Kreuzberg; use Kreuzberg\Config\ExtractionConfig; class KreuzbergServiceProvider extends ServiceProvider { public function register(): void { $this->app->singleton(Kreuzberg::class, function ($app) { $config = new ExtractionConfig( extractTables: true, extractImages: config('kreuzberg.extract_images', false), ); return new Kreuzberg($config); }); } }Configuration:
config/kreuzberg.php<?php return [ 'extract_images' => env('KREUZBERG_EXTRACT_IMAGES', false), 'extract_tables' => env('KREUZBERG_EXTRACT_TABLES', true), 'ocr_language' => env('KREUZBERG_OCR_LANGUAGE', 'eng'), ];Usage:
app/Http/Controllers/DocumentController.php<?php namespace App\Http\Controllers; use Illuminate\Http\Request; use Kreuzberg\Kreuzberg; class DocumentController extends Controller { public function extract(Request $request, Kreuzberg $kreuzberg) { $request->validate([ 'document' => 'required|file|mimes:pdf,docx,xlsx|max:10240', ]); $file = $request->file('document'); $result = $kreuzberg->extractFile($file->getPathname()); return response()->json([ 'content' => $result->content, 'metadata' => $result->metadata, 'tables' => $result->tables, ]); } }Symfony¶
Service Configuration:
config/services.yamlservices: Kreuzberg\Kreuzberg: arguments: $defaultConfig: '@Kreuzberg\Config\ExtractionConfig' Kreuzberg\Config\ExtractionConfig: factory: ['App\Factory\KreuzbergConfigFactory', 'create']Factory:
src/Factory/KreuzbergConfigFactory.php<?php namespace App\Factory; use Kreuzberg\Config\ExtractionConfig; class KreuzbergConfigFactory { public static function create(): ExtractionConfig { return new ExtractionConfig( extractTables: true, extractImages: false, ); } }Usage:
src/Controller/DocumentController.php<?php namespace App\Controller; use Kreuzberg\Kreuzberg; use Symfony\Bundle\FrameworkBundle\Controller\AbstractController; use Symfony\Component\HttpFoundation\Request; use Symfony\Component\HttpFoundation\JsonResponse; class DocumentController extends AbstractController { public function extract(Request $request, Kreuzberg $kreuzberg): JsonResponse { $file = $request->files->get('document'); $result = $kreuzberg->extractFile($file->getPathname()); return $this->json([ 'content' => $result->content, 'metadata' => $result->metadata, ]); } }