Skip to content

PHP API Reference

Complete reference for the Kreuzberg PHP API.

Installation

Composer Installation

Terminal
composer require kreuzberg/kreuzberg

PHP Extension Setup

The Kreuzberg PHP package requires the native extension to be installed and enabled.

1. Install the extension:

Download the appropriate extension for your platform from the releases page, or build from source.

2. Enable the extension:

Add to your php.ini:

php.ini
extension=kreuzberg.so  # Linux/macOS
extension=kreuzberg.dll  # Windows

3. Verify installation:

verify.php
<?php

if (!extension_loaded('kreuzberg')) {
    die("Kreuzberg extension not loaded\n");
}

echo "Kreuzberg extension loaded successfully!\n";
echo "Version: " . \Kreuzberg\Kreuzberg::version() . "\n";

Quick Start

Basic Usage (OOP API)

basic_oop.php
<?php

use Kreuzberg\Kreuzberg;

$kreuzberg = new Kreuzberg();
$result = $kreuzberg->extractFile('document.pdf');

echo $result->content;
echo "Pages: {$result->metadata->pageCount}\n";

Basic Usage (Procedural API)

basic_procedural.php
<?php

use function Kreuzberg\extract_file;

$result = extract_file('document.pdf');

echo $result->content;
echo "Pages: {$result->metadata->pageCount}\n";

Extract from Bytes

extract_bytes.php
<?php

use Kreuzberg\Kreuzberg;

$kreuzberg = new Kreuzberg();
$data = file_get_contents('document.pdf');
$result = $kreuzberg->extractBytes($data, 'application/pdf');

echo $result->content;

Core Classes

Kreuzberg

Main API class for document extraction.

Signature:

PHP
final readonly class Kreuzberg
{
    public function __construct(?ExtractionConfig $defaultConfig = null);

    public function extractFile(
        string $filePath,
        ?string $mimeType = null,
        ?ExtractionConfig $config = null
    ): ExtractionResult;

    public function extractBytes(
        string $data,
        string $mimeType,
        ?ExtractionConfig $config = null
    ): ExtractionResult;

    public function batchExtractFiles(
        array $paths,
        ?ExtractionConfig $config = null
    ): array;

    public function batchExtractBytes(
        array $dataList,
        array $mimeTypes,
        ?ExtractionConfig $config = null
    ): array;

    public static function version(): string;
}

Example:

kreuzberg_class.php
<?php

use Kreuzberg\Kreuzberg;
use Kreuzberg\Config\ExtractionConfig;

// Create with default config
$kreuzberg = new Kreuzberg();
$result = $kreuzberg->extractFile('document.pdf');

// Create with custom config
$config = new ExtractionConfig(extractTables: true);
$kreuzberg = new Kreuzberg($config);
$result = $kreuzberg->extractFile('document.pdf');

// Override config per call
$overrideConfig = new ExtractionConfig(extractImages: true);
$result = $kreuzberg->extractFile('document.pdf', config: $overrideConfig);

Kreuzberg::extractFile()

Extract content from a file.

Signature:

PHP
public function extractFile(
    string $filePath,
    ?string $mimeType = null,
    ?ExtractionConfig $config = null
): ExtractionResult

Parameters:

  • $filePath (string): Path to the file to extract
  • $mimeType (string|null): Optional MIME type hint. If null, MIME type is auto-detected
  • $config (ExtractionConfig|null): Extraction configuration. Uses constructor config if null

Returns:

  • ExtractionResult: Extraction result containing content, metadata, and tables

Throws:

  • KreuzbergException: If extraction fails

Examples:

extract_file_examples.php
<?php

use Kreuzberg\Kreuzberg;
use Kreuzberg\Config\ExtractionConfig;
use Kreuzberg\Config\OcrConfig;

$kreuzberg = new Kreuzberg();

// Basic extraction
$result = $kreuzberg->extractFile('document.pdf');

// With MIME type hint
$result = $kreuzberg->extractFile('document.pdf', 'application/pdf');

// With configuration
$config = new ExtractionConfig(
    ocr: new OcrConfig(backend: 'tesseract', language: 'eng')
);
$result = $kreuzberg->extractFile('scanned.pdf', config: $config);

Kreuzberg::extractBytes()

Extract content from bytes.

Signature:

PHP
public function extractBytes(
    string $data,
    string $mimeType,
    ?ExtractionConfig $config = null
): ExtractionResult

Parameters:

  • $data (string): File content as bytes
  • $mimeType (string): MIME type of the data (required for format detection)
  • $config (ExtractionConfig|null): Extraction configuration. Uses constructor config if null

Returns:

  • ExtractionResult: Extraction result containing content, metadata, and tables

Examples:

extract_bytes_examples.php
<?php

use Kreuzberg\Kreuzberg;

$kreuzberg = new Kreuzberg();

// Extract from file data
$data = file_get_contents('document.pdf');
$result = $kreuzberg->extractBytes($data, 'application/pdf');

// Extract from uploaded file
$uploadedData = $_FILES['document']['tmp_name'];
$data = file_get_contents($uploadedData);
$result = $kreuzberg->extractBytes($data, 'application/pdf');

Kreuzberg::batchExtractFiles()

Extract content from multiple files in parallel.

Signature:

PHP
public function batchExtractFiles(
    array $paths,
    ?ExtractionConfig $config = null
): array

Parameters:

  • $paths (array): Array of file paths to extract
  • $config (ExtractionConfig|null): Extraction configuration applied to all files

Returns:

  • array<ExtractionResult>: Array of extraction results (one per file)

Examples:

batch_extract_files.php
<?php

use Kreuzberg\Kreuzberg;

$kreuzberg = new Kreuzberg();

$files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
$results = $kreuzberg->batchExtractFiles($files);

foreach ($results as $i => $result) {
    echo "{$files[$i]}: {$result->content}\n";
}

Kreuzberg::batchExtractBytes()

Extract content from multiple byte arrays in parallel.

Signature:

PHP
public function batchExtractBytes(
    array $dataList,
    array $mimeTypes,
    ?ExtractionConfig $config = null
): array

Parameters:

  • $dataList (array): Array of file contents as bytes
  • $mimeTypes (array): Array of MIME types (one per data item, same length as dataList)
  • $config (ExtractionConfig|null): Extraction configuration applied to all items

Returns:

  • array<ExtractionResult>: Array of extraction results (one per data item)

Examples:

batch_extract_bytes.php
<?php

use Kreuzberg\Kreuzberg;

$kreuzberg = new Kreuzberg();

$dataList = [
    file_get_contents('doc1.pdf'),
    file_get_contents('doc2.docx'),
];

$mimeTypes = [
    'application/pdf',
    'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
];

$results = $kreuzberg->batchExtractBytes($dataList, $mimeTypes);

Procedural Functions

extract_file()

Extract content from a file (procedural API).

Signature:

PHP
function extract_file(
    string $filePath,
    ?string $mimeType = null,
    ?ExtractionConfig $config = null
): ExtractionResult

Parameters:

Same as Kreuzberg::extractFile().

Returns:

  • ExtractionResult: Extraction result containing content, metadata, and tables

Examples:

procedural_extract_file.php
<?php

use function Kreuzberg\extract_file;
use Kreuzberg\Config\ExtractionConfig;

// Basic extraction
$result = extract_file('document.pdf');

// With configuration
$config = new ExtractionConfig(extractTables: true);
$result = extract_file('document.pdf', config: $config);

extract_bytes()

Extract content from bytes (procedural API).

Signature:

PHP
function extract_bytes(
    string $data,
    string $mimeType,
    ?ExtractionConfig $config = null
): ExtractionResult

Parameters:

Same as Kreuzberg::extractBytes().

Returns:

  • ExtractionResult: Extraction result containing content, metadata, and tables

batch_extract_files()

Extract content from multiple files in parallel (procedural API).

Signature:

PHP
function batch_extract_files(
    array $paths,
    ?ExtractionConfig $config = null
): array

Parameters:

Same as Kreuzberg::batchExtractFiles().

Returns:

  • array<ExtractionResult>: Array of extraction results (one per file)

Examples:

procedural_batch.php
<?php

use function Kreuzberg\batch_extract_files;

$files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
$results = batch_extract_files($files);

foreach ($results as $result) {
    echo $result->content;
}

batch_extract_bytes()

Extract content from multiple byte arrays in parallel (procedural API).

Signature:

PHP
function batch_extract_bytes(
    array $dataList,
    array $mimeTypes,
    ?ExtractionConfig $config = null
): array

Parameters:

Same as Kreuzberg::batchExtractBytes().

Returns:

  • array<ExtractionResult>: Array of extraction results (one per data item)

detect_mime_type()

Detect MIME type from file bytes.

Signature:

PHP
function detect_mime_type(string $data): string

Parameters:

  • $data (string): File content as bytes

Returns:

  • string: Detected MIME type (e.g., "application/pdf", "image/png")

Examples:

detect_mime.php
<?php

use function Kreuzberg\detect_mime_type;

$data = file_get_contents('unknown.file');
$mimeType = detect_mime_type($data);
echo $mimeType; // "application/pdf"

detect_mime_type_from_path()

Detect MIME type from file path.

Signature:

PHP
function detect_mime_type_from_path(string $path): string

Parameters:

  • $path (string): Path to the file

Returns:

  • string: Detected MIME type (e.g., "application/pdf", "text/plain")

Examples:

detect_mime_path.php
<?php

use function Kreuzberg\detect_mime_type_from_path;

$mimeType = detect_mime_type_from_path('document.pdf');
echo $mimeType; // "application/pdf"

Configuration

ExtractionConfig

Main configuration class for extraction operations.

Signature:

PHP
readonly class ExtractionConfig
{
    public function __construct(
        public ?OcrConfig $ocr = null,
        public ?PdfConfig $pdf = null,
        public ?ChunkingConfig $chunking = null,
        public ?EmbeddingConfig $embedding = null,
        public ?ImageExtractionConfig $imageExtraction = null,
        public ?PageConfig $page = null,
        public ?LanguageDetectionConfig $languageDetection = null,
        public ?KeywordConfig $keyword = null,
        public bool $extractImages = false,
        public bool $extractTables = true,
        public bool $preserveFormatting = false,
        public ?string $outputFormat = null,
    );
}

Fields:

  • $ocr (OcrConfig|null): OCR configuration. Default: null (no OCR)
  • $pdf (PdfConfig|null): PDF-specific configuration. Default: null
  • $chunking (ChunkingConfig|null): Text chunking configuration. Default: null
  • $embedding (EmbeddingConfig|null): Embedding generation configuration. Default: null
  • $imageExtraction (ImageExtractionConfig|null): Image extraction configuration. Default: null
  • $page (PageConfig|null): Page extraction configuration. Default: null
  • $languageDetection (LanguageDetectionConfig|null): Language detection configuration. Default: null
  • $keyword (KeywordConfig|null): Keyword extraction configuration. Default: null
  • $extractImages (bool): Extract images from documents. Default: false
  • $extractTables (bool): Extract tables from documents. Default: true
  • $preserveFormatting (bool): Preserve original formatting. Default: false
  • $outputFormat (string|null): Output format ("markdown", "plain"). Default: null

Examples:

extraction_config_examples.php
<?php

use Kreuzberg\Config\ExtractionConfig;
use Kreuzberg\Config\OcrConfig;
use Kreuzberg\Config\PdfConfig;

// Basic configuration
$config = new ExtractionConfig(
    extractImages: true,
    extractTables: true
);

// Advanced configuration
$config = new ExtractionConfig(
    ocr: new OcrConfig(
        backend: 'tesseract',
        language: 'eng'
    ),
    pdf: new PdfConfig(
        extractImages: true,
        extractMetadata: true
    ),
    extractImages: true,
    extractTables: true,
    preserveFormatting: true,
    outputFormat: 'markdown'
);

OcrConfig

OCR processing configuration.

Signature:

PHP
readonly class OcrConfig
{
    public function __construct(
        public string $backend = 'tesseract',
        public string $language = 'eng',
        public ?TesseractConfig $tesseractConfig = null,
    );
}

Fields:

  • $backend (string): OCR backend to use. Options: "tesseract". Default: "tesseract"
  • $language (string): Language code for OCR (ISO 639-3). Default: "eng"
  • $tesseractConfig (TesseractConfig|null): Tesseract-specific configuration. Default: null

Examples:

ocr_config_examples.php
<?php

use Kreuzberg\Config\OcrConfig;

// Basic OCR
$config = new OcrConfig(
    backend: 'tesseract',
    language: 'eng'
);

// Multilingual OCR
$config = new OcrConfig(
    backend: 'tesseract',
    language: 'eng+fra+deu'
);

TesseractConfig

Tesseract OCR backend configuration.

Signature:

PHP
readonly class TesseractConfig
{
    public function __construct(
        public int $psm = 3,
        public int $oem = 3,
        public bool $enableTableDetection = false,
        public ?string $tesseditCharWhitelist = null,
        public ?string $tesseditCharBlacklist = null,
    );
}

Fields:

  • $psm (int): Page segmentation mode (0-13). Default: 3 (auto)
  • $oem (int): OCR engine mode (0-3). Default: 3 (LSTM only)
  • $enableTableDetection (bool): Enable table detection and extraction. Default: false
  • $tesseditCharWhitelist (string|null): Character whitelist (e.g., "0123456789" for digits only). Default: null
  • $tesseditCharBlacklist (string|null): Character blacklist. Default: null

Examples:

tesseract_config_examples.php
<?php

use Kreuzberg\Config\OcrConfig;
use Kreuzberg\Config\TesseractConfig;

$config = new OcrConfig(
    backend: 'tesseract',
    language: 'eng',
    tesseractConfig: new TesseractConfig(
        psm: 6,
        enableTableDetection: true,
        tesseditCharWhitelist: '0123456789'
    )
);

PdfConfig

PDF-specific configuration.

Signature:

PHP
readonly class PdfConfig
{
    public function __construct(
        public bool $extractImages = false,
        public bool $extractMetadata = true,
        public bool $ocrFallback = false,
        public ?int $startPage = null,
        public ?int $endPage = null,
        public int $imageQuality = 95,
    );
}

Fields:

  • $extractImages (bool): Extract images from PDF. Default: false
  • $extractMetadata (bool): Extract PDF metadata. Default: true
  • $ocrFallback (bool): Use OCR if text extraction fails. Default: false
  • $startPage (int|null): Start page for extraction (0-indexed). Default: null (all pages)
  • $endPage (int|null): End page for extraction (0-indexed). Default: null (all pages)
  • $imageQuality (int): JPEG quality for extracted images (1-100). Default: 95

Examples:

pdf_config_examples.php
<?php

use Kreuzberg\Config\PdfConfig;

// Extract specific page range
$config = new PdfConfig(
    extractImages: true,
    startPage: 0,
    endPage: 4
);

// OCR fallback for scanned PDFs
$config = new PdfConfig(
    ocrFallback: true,
    extractImages: true
);

ChunkingConfig

Text chunking configuration for splitting long documents.

Signature:

PHP
readonly class ChunkingConfig
{
    public function __construct(
        public int $maxChunkSize = 512,
        public int $chunkOverlap = 50,
        public bool $respectSentences = true,
        public bool $respectParagraphs = false,
    );
}

Fields:

  • $maxChunkSize (int): Maximum chunk size in characters. Default: 512
  • $chunkOverlap (int): Overlap between chunks in characters. Default: 50
  • $respectSentences (bool): Try to split at sentence boundaries. Default: true
  • $respectParagraphs (bool): Try to split at paragraph boundaries. Default: false

Examples:

chunking_config_examples.php
<?php

use Kreuzberg\Config\ChunkingConfig;

$config = new ChunkingConfig(
    maxChunkSize: 1000,
    chunkOverlap: 100,
    respectSentences: true,
    respectParagraphs: true
);

EmbeddingConfig

Embedding generation configuration.

Signature:

PHP
readonly class EmbeddingConfig
{
    public function __construct(
        public string $model = 'all-MiniLM-L6-v2',
        public bool $normalize = true,
        public int $batchSize = 32,
    );
}

Fields:

  • $model (string): Embedding model name. Default: "all-MiniLM-L6-v2"
  • $normalize (bool): Normalize embeddings. Default: true
  • $batchSize (int): Batch size for embedding generation. Default: 32

ImageExtractionConfig

Image extraction configuration.

Signature:

PHP
readonly class ImageExtractionConfig
{
    public function __construct(
        public bool $extractImages = false,
        public bool $performOcr = false,
        public int $minWidth = 100,
        public int $minHeight = 100,
    );
}

Fields:

  • $extractImages (bool): Enable image extraction from documents. Default: false
  • $performOcr (bool): Perform OCR on extracted images. Default: false
  • $minWidth (int): Minimum image width in pixels. Default: 100
  • $minHeight (int): Minimum image height in pixels. Default: 100

Examples:

image_extraction_config.php
<?php

use Kreuzberg\Config\ImageExtractionConfig;
use Kreuzberg\Config\OcrConfig;

$config = new ImageExtractionConfig(
    extractImages: true,
    performOcr: true,
    minWidth: 200,
    minHeight: 200
);

PageConfig

Page extraction configuration.

Signature:

PHP
readonly class PageConfig
{
    public function __construct(
        public bool $extractPages = false,
        public bool $insertPageMarkers = false,
        public string $markerFormat = '--- Page {page_number} ---',
    );
}

Fields:

  • $extractPages (bool): Extract per-page content. Default: false
  • $insertPageMarkers (bool): Insert page markers in content. Default: false
  • $markerFormat (string): Page marker format string. Default: "--- Page {page_number} ---"

Examples:

page_config_examples.php
<?php

use Kreuzberg\Config\PageConfig;

$config = new PageConfig(
    extractPages: true,
    insertPageMarkers: true,
    markerFormat: '=== Page {page_number} ==='
);

LanguageDetectionConfig

Language detection configuration.

Signature:

PHP
readonly class LanguageDetectionConfig
{
    public function __construct(
        public bool $enabled = true,
        public int $maxLanguages = 3,
        public float $confidenceThreshold = 0.8,
    );
}

Fields:

  • $enabled (bool): Enable language detection. Default: true
  • $maxLanguages (int): Maximum number of languages to detect. Default: 3
  • $confidenceThreshold (float): Minimum confidence threshold (0.0-1.0). Default: 0.8

Examples:

language_detection_config.php
<?php

use Kreuzberg\Config\LanguageDetectionConfig;

$config = new LanguageDetectionConfig(
    enabled: true,
    maxLanguages: 5,
    confidenceThreshold: 0.9
);

KeywordConfig

Keyword extraction configuration.

Signature:

PHP
readonly class KeywordConfig
{
    public function __construct(
        public bool $enabled = false,
        public string $algorithm = 'rake',
        public int $maxKeywords = 10,
    );
}

Fields:

  • $enabled (bool): Enable keyword extraction. Default: false
  • $algorithm (string): Keyword extraction algorithm. Options: "rake". Default: "rake"
  • $maxKeywords (int): Maximum number of keywords to extract. Default: 10

Examples:

keyword_config_examples.php
<?php

use Kreuzberg\Config\KeywordConfig;

$config = new KeywordConfig(
    enabled: true,
    algorithm: 'rake',
    maxKeywords: 15
);

ImagePreprocessingConfig

Image preprocessing configuration for OCR.

Signature:

PHP
readonly class ImagePreprocessingConfig
{
    public function __construct(
        public int $targetDpi = 300,
        public bool $autoRotate = true,
        public bool $denoise = false,
    );
}

Fields:

  • $targetDpi (int): Target DPI for image preprocessing. Default: 300
  • $autoRotate (bool): Auto-rotate images based on orientation. Default: true
  • $denoise (bool): Apply denoising filter. Default: false

Results & Types

ExtractionResult

Result object returned by all extraction functions.

Signature:

PHP
readonly class ExtractionResult
{
    public function __construct(
        public string $content,
        public string $mimeType,
        public Metadata $metadata,
        public array $tables = [],
        public ?array $detectedLanguages = null,
        public ?array $chunks = null,
        public ?array $images = null,
        public ?array $pages = null,
    );
}

Fields:

  • $content (string): Extracted text content
  • $mimeType (string): MIME type of the processed document
  • $metadata (Metadata): Document metadata (format-specific fields)
  • $tables (array): Array of extracted tables
  • $detectedLanguages (array|null): Array of detected language codes (ISO 639-1) if language detection is enabled
  • $chunks (array|null): Text chunks with embeddings and metadata
  • $images (array|null): Extracted images (with nested OCR results)
  • $pages (array|null): Per-page extracted content when page extraction is enabled
  • Examples:

    extraction_result_examples.php
    <?php
    
    use function Kreuzberg\extract_file;
    
    $result = extract_file('document.pdf');
    
    echo "Content: {$result->content}\n";
    echo "MIME type: {$result->mimeType}\n";
    echo "Page count: {$result->metadata->pageCount}\n";
    echo "Tables: " . count($result->tables) . "\n";
    
    if ($result->detectedLanguages !== null) {
        echo "Languages: " . implode(', ', $result->detectedLanguages) . "\n";
    }
    

    Metadata

    Document metadata with format-specific fields.

    Signature:

    PHP
    readonly class Metadata
    {
        public function __construct(
            public ?string $language = null,
            public ?string $date = null,
            public ?string $subject = null,
            public ?string $formatType = null,
            public ?string $title = null,
            public ?array $authors = null,
            public ?array $keywords = null,
            public ?string $createdAt = null,
            public ?string $modifiedAt = null,
            public ?string $createdBy = null,
            public ?string $producer = null,
            public ?int $pageCount = null,
            public array $custom = [],
        );
    }
    

    Common Fields:

    • $language (string|null): Document language (ISO 639-1 code)
    • $date (string|null): Document date (ISO 8601 format)
    • $subject (string|null): Document subject
    • $formatType (string|null): Format discriminator ("pdf", "excel", "email", etc.)
    • $title (string|null): Document title
    • $authors (array|null): Document authors
    • $keywords (array|null): Document keywords
    • $createdAt (string|null): Creation date (ISO 8601)
    • $modifiedAt (string|null): Modification date (ISO 8601)
    • $createdBy (string|null): Creator/application name
    • $producer (string|null): Producer/generator
    • $pageCount (int|null): Number of pages
    • $custom (array): Additional custom metadata from postprocessors

    Examples:

    metadata_examples.php
    <?php
    
    use function Kreuzberg\extract_file;
    
    $result = extract_file('document.pdf');
    $metadata = $result->metadata;
    
    echo "Title: " . ($metadata->title ?? 'N/A') . "\n";
    echo "Authors: " . ($metadata->authors ? implode(', ', $metadata->authors) : 'N/A') . "\n";
    echo "Pages: " . ($metadata->pageCount ?? 'N/A') . "\n";
    echo "Created: " . ($metadata->createdAt ?? 'N/A') . "\n";
    echo "Format: " . ($metadata->formatType ?? 'N/A') . "\n";
    

    Table

    Extracted table structure.

    Signature:

    PHP
    readonly class Table
    {
        public function __construct(
            public array $cells,
            public string $markdown,
            public int $pageNumber,
        );
    }
    

    Fields:

    • $cells (array>): 2D array of table cells (rows x columns)
    • $markdown (string): Table rendered as markdown
    • $pageNumber (int): Page number where table was found

    Examples:

    table_examples.php
    <?php
    
    use function Kreuzberg\extract_file;
    
    $result = extract_file('invoice.pdf');
    
    foreach ($result->tables as $i => $table) {
        echo "Table " . ($i + 1) . " (Page {$table->pageNumber}):\n";
        echo $table->markdown . "\n\n";
    
        // Access cells directly
        foreach ($table->cells as $row) {
            foreach ($row as $cell) {
                echo "$cell | ";
            }
            echo "\n";
        }
    }
    

    Chunk

    Text chunk with optional embeddings and metadata.

    Signature:

    PHP
    readonly class Chunk
    {
        public function __construct(
            public string $text,
            public ?array $embedding,
            public ChunkMetadata $metadata,
        );
    }
    

    Fields:

    • $text (string): Chunk text content
    • $embedding (array|null): Embedding vector (if embedding generation is enabled)
    • $metadata (ChunkMetadata): Chunk metadata (byte offsets, page numbers, etc.)

    ChunkMetadata

    Metadata for a single text chunk.

    Signature:

    PHP
    readonly class ChunkMetadata
    {
        public function __construct(
            public int $byteStart,
            public int $byteEnd,
            public int $charCount,
            public ?int $tokenCount = null,
            public ?int $firstPage = null,
            public ?int $lastPage = null,
        );
    }
    

    Fields:

    • $byteStart (int): UTF-8 byte offset in content (inclusive)
    • $byteEnd (int): UTF-8 byte offset in content (exclusive)
    • $charCount (int): Number of characters in chunk
    • $tokenCount (int|null): Estimated token count (if configured)
    • $firstPage (int|null): First page this chunk appears on (1-indexed)
    • $lastPage (int|null): Last page this chunk appears on (1-indexed)

    ExtractedImage

    Extracted image with optional OCR results.

    Signature:

    PHP
    readonly class ExtractedImage
    {
        public function __construct(
            public string $data,
            public string $format,
            public int $width,
            public int $height,
            public int $pageNumber,
            public ?ExtractionResult $ocrResult = null,
        );
    }
    

    Fields:

    • $data (string): Image data (base64 encoded or raw bytes)
    • $format (string): Image format (e.g., "png", "jpeg")
    • $width (int): Image width in pixels
    • $height (int): Image height in pixels
    • $pageNumber (int): Page number where image was found
    • $ocrResult (ExtractionResult|null): OCR result if OCR was performed on the image

    PageContent

    Per-page extracted content.

    Signature:

    PHP
    readonly class PageContent
    {
        public function __construct(
            public int $pageNumber,
            public string $content,
            public array $tables,
            public array $images,
        );
    }
    

    Fields:

    • $pageNumber (int): Page number (1-indexed)
    • $content (string): Text content for this page
    • $tables (array
    ): Tables on this page
  • $images (array): Images on this page
  • Examples:

    page_content_examples.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Config\ExtractionConfig;
    use Kreuzberg\Config\PageConfig;
    
    $config = new ExtractionConfig(
        page: new PageConfig(extractPages: true)
    );
    
    $kreuzberg = new Kreuzberg($config);
    $result = $kreuzberg->extractFile('document.pdf');
    
    if ($result->pages !== null) {
        foreach ($result->pages as $page) {
            echo "Page {$page->pageNumber}:\n";
            echo "  Content: " . strlen($page->content) . " chars\n";
            echo "  Tables: " . count($page->tables) . "\n";
            echo "  Images: " . count($page->images) . "\n";
        }
    }
    

    Exceptions

    KreuzbergException

    Base exception class for all Kreuzberg errors.

    Signature:

    PHP
    class KreuzbergException extends Exception
    {
        public function __construct(
            string $message = '',
            int $code = 0,
            ?Exception $previous = null
        );
    
        public static function validation(string $message): self;
        public static function parsing(string $message): self;
        public static function ocr(string $message): self;
        public static function missingDependency(string $message): self;
        public static function io(string $message): self;
        public static function plugin(string $message): self;
        public static function unsupportedFormat(string $message): self;
    }
    

    Exception Types:

    • Validation Error (code 1): Invalid configuration or input
    • Parsing Error (code 2): Document parsing failure
    • OCR Error (code 3): OCR processing failure
    • Missing Dependency (code 4): Required system dependency not found
    • I/O Error (code 5): File system or network error
    • Plugin Error (code 6): Plugin execution error
    • Unsupported Format (code 7): Unsupported file format

    Examples:

    exception_handling.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Exceptions\KreuzbergException;
    
    try {
        $kreuzberg = new Kreuzberg();
        $result = $kreuzberg->extractFile('document.pdf');
        echo $result->content;
    } catch (KreuzbergException $e) {
        echo "Error: {$e->getMessage()}\n";
        echo "Code: {$e->getCode()}\n";
    
        // Handle specific error types
        switch ($e->getCode()) {
            case 1:
                echo "Validation error\n";
                break;
            case 2:
                echo "Parsing error\n";
                break;
            case 3:
                echo "OCR error\n";
                break;
            case 4:
                echo "Missing dependency\n";
                break;
            default:
                echo "Unknown error\n";
        }
    }
    

    Advanced Topics

    Error Handling

    Robust error handling strategies for production applications.

    Basic Error Handling:

    basic_error_handling.php
    <?php
    
    use function Kreuzberg\extract_file;
    use Kreuzberg\Exceptions\KreuzbergException;
    
    try {
        $result = extract_file('document.pdf');
        echo $result->content;
    } catch (KreuzbergException $e) {
        error_log("Extraction failed: {$e->getMessage()}");
        // Handle error gracefully
    }
    

    Retry with Exponential Backoff:

    retry_logic.php
    <?php
    
    use function Kreuzberg\extract_file;
    use Kreuzberg\Exceptions\KreuzbergException;
    
    function extractWithRetry(
        string $filePath,
        int $maxRetries = 3,
        int $initialDelay = 1000
    ): ?string {
        $attempt = 0;
        $delay = $initialDelay;
    
        while ($attempt < $maxRetries) {
            try {
                $result = extract_file($filePath);
                return $result->content;
            } catch (KreuzbergException $e) {
                $attempt++;
                if ($attempt >= $maxRetries) {
                    error_log("Max retries exceeded: {$e->getMessage()}");
                    return null;
                }
    
                usleep($delay * 1000);
                $delay *= 2; // Exponential backoff
            }
        }
    
        return null;
    }
    
    $content = extractWithRetry('document.pdf');
    

    Fallback Strategies:

    fallback_strategies.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Config\ExtractionConfig;
    use Kreuzberg\Config\OcrConfig;
    use Kreuzberg\Exceptions\KreuzbergException;
    
    function extractWithFallback(string $filePath): ?string
    {
        // Try normal extraction
        try {
            $kreuzberg = new Kreuzberg();
            $result = $kreuzberg->extractFile($filePath);
            if (!empty($result->content)) {
                return $result->content;
            }
        } catch (KreuzbergException $e) {
            // Fallback to OCR
            try {
                $config = new ExtractionConfig(
                    ocr: new OcrConfig(backend: 'tesseract', language: 'eng')
                );
                $kreuzberg = new Kreuzberg($config);
                $result = $kreuzberg->extractFile($filePath);
                return $result->content;
            } catch (KreuzbergException $e) {
                error_log("All extraction methods failed: {$e->getMessage()}");
            }
        }
    
        return null;
    }
    

    Performance Tuning

    Optimize extraction performance for high-throughput applications.

    Batch Processing:

    Always use batch APIs for processing multiple documents:

    batch_performance.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    
    $kreuzberg = new Kreuzberg();
    
    // Good: Batch processing
    $files = glob('documents/*.pdf');
    $results = $kreuzberg->batchExtractFiles($files);
    
    // Bad: Individual processing
    foreach ($files as $file) {
        $result = $kreuzberg->extractFile($file);
    }
    

    Memory Management:

    For large files, process in chunks or use streaming:

    memory_management.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Config\ExtractionConfig;
    use Kreuzberg\Config\ChunkingConfig;
    
    // Enable chunking for large documents
    $config = new ExtractionConfig(
        chunking: new ChunkingConfig(
            maxChunkSize: 1000,
            chunkOverlap: 100
        )
    );
    
    $kreuzberg = new Kreuzberg($config);
    $result = $kreuzberg->extractFile('large_document.pdf');
    
    // Process chunks individually
    if ($result->chunks !== null) {
        foreach ($result->chunks as $chunk) {
            // Process chunk
            processChunk($chunk->text);
    
            // Free memory
            unset($chunk);
        }
    }
    

    Caching:

    Implement caching to avoid re-extracting unchanged documents:

    caching_strategy.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    
    function extractWithCache(string $filePath): string
    {
        $cacheKey = 'extraction_' . md5($filePath . filemtime($filePath));
    
        // Check cache
        $cached = apcu_fetch($cacheKey);
        if ($cached !== false) {
            return $cached;
        }
    
        // Extract and cache
        $kreuzberg = new Kreuzberg();
        $result = $kreuzberg->extractFile($filePath);
    
        apcu_store($cacheKey, $result->content, 3600);
    
        return $result->content;
    }
    

    Working with Images

    Extract and process images from documents.

    Basic Image Extraction:

    image_extraction.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Config\ExtractionConfig;
    use Kreuzberg\Config\ImageExtractionConfig;
    
    $config = new ExtractionConfig(
        imageExtraction: new ImageExtractionConfig(
            extractImages: true,
            minWidth: 200,
            minHeight: 200
        )
    );
    
    $kreuzberg = new Kreuzberg($config);
    $result = $kreuzberg->extractFile('presentation.pptx');
    
    if ($result->images !== null) {
        foreach ($result->images as $i => $image) {
            // Save image to disk
            $filename = "image_{$i}.{$image->format}";
            file_put_contents($filename, base64_decode($image->data));
    
            echo "Saved {$filename}: {$image->width}x{$image->height}\n";
        }
    }
    

    Image OCR:

    image_ocr.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Config\ExtractionConfig;
    use Kreuzberg\Config\ImageExtractionConfig;
    use Kreuzberg\Config\OcrConfig;
    
    $config = new ExtractionConfig(
        imageExtraction: new ImageExtractionConfig(
            extractImages: true,
            performOcr: true
        ),
        ocr: new OcrConfig(
            backend: 'tesseract',
            language: 'eng'
        )
    );
    
    $kreuzberg = new Kreuzberg($config);
    $result = $kreuzberg->extractFile('scanned_images.pdf');
    
    if ($result->images !== null) {
        foreach ($result->images as $image) {
            if ($image->ocrResult !== null) {
                echo "Image on page {$image->pageNumber}:\n";
                echo $image->ocrResult->content . "\n\n";
            }
        }
    }
    

    Working with Tables

    Extract and process tables from documents.

    Basic Table Extraction:

    table_extraction.php
    <?php
    
    use function Kreuzberg\extract_file;
    
    $result = extract_file('invoice.pdf');
    
    foreach ($result->tables as $table) {
        echo "Table on page {$table->pageNumber}:\n";
        echo $table->markdown . "\n\n";
    }
    

    Convert Tables to CSV:

    tables_to_csv.php
    <?php
    
    use function Kreuzberg\extract_file;
    
    $result = extract_file('spreadsheet.xlsx');
    
    foreach ($result->tables as $i => $table) {
        $filename = "table_{$i}.csv";
        $fp = fopen($filename, 'w');
    
        foreach ($table->cells as $row) {
            fputcsv($fp, $row);
        }
    
        fclose($fp);
        echo "Saved {$filename}\n";
    }
    

    Multi-Format Processing

    Handle different file formats dynamically.

    Dynamic Configuration:

    multi_format.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Config\ExtractionConfig;
    use Kreuzberg\Config\PdfConfig;
    use Kreuzberg\Config\OcrConfig;
    use function Kreuzberg\detect_mime_type_from_path;
    
    function createConfigForMimeType(string $mimeType): ExtractionConfig
    {
        return match (true) {
            str_contains($mimeType, 'pdf') => new ExtractionConfig(
                pdf: new PdfConfig(extractImages: true),
                extractTables: true
            ),
            str_contains($mimeType, 'image') => new ExtractionConfig(
                ocr: new OcrConfig(backend: 'tesseract', language: 'eng')
            ),
            str_contains($mimeType, 'spreadsheet') => new ExtractionConfig(
                extractTables: true
            ),
            default => new ExtractionConfig(),
        };
    }
    
    $filePath = 'document.pdf';
    $mimeType = detect_mime_type_from_path($filePath);
    $config = createConfigForMimeType($mimeType);
    
    $kreuzberg = new Kreuzberg($config);
    $result = $kreuzberg->extractFile($filePath);
    

    System Requirements

    PHP: 8.1.0 or higher

    Native Extension:

    The Kreuzberg native extension must be compiled and installed for your PHP version and platform.

    Native Dependencies:

    • Tesseract OCR (for OCR support):
    • macOS: brew install tesseract
    • Ubuntu: apt-get install tesseract-ocr
    • Windows: Download from GitHub releases

    • LibreOffice (for legacy Office formats):

    • macOS: brew install libreoffice
    • Ubuntu: apt-get install libreoffice
    • Windows: Download from LibreOffice website

    Platforms:

    • Linux (x64, arm64)
    • macOS (x64, arm64)
    • Windows (x64)

    Thread Safety

    All Kreuzberg functions are thread-safe and can be called from multiple threads concurrently. However, for better performance with multiple files, use the batch APIs instead of threading.

    Example:

    thread_safety.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    
    // Good: Use batch API
    $kreuzberg = new Kreuzberg();
    $files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'];
    $results = $kreuzberg->batchExtractFiles($files);
    
    // Avoid: Manual threading
    // PHP's threading support is limited, and batch APIs are more efficient
    

    Version Information

    version.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    
    echo "Kreuzberg version: " . Kreuzberg::version() . "\n";
    

    Troubleshooting

    Extension Not Loading

    If you get an error that the Kreuzberg extension is not loaded:

    Check if extension is installed:

    check_extension.php
    <?php
    
    if (!extension_loaded('kreuzberg')) {
        echo "Extension not loaded!\n";
        echo "Check your php.ini and ensure 'extension=kreuzberg.so' is present.\n";
        exit(1);
    }
    
    echo "Extension loaded successfully!\n";
    

    Verify php.ini configuration:

    php --ini
    php -m | grep kreuzberg
    

    Check extension directory:

    php -i | grep extension_dir
    

    OCR Issues

    Tesseract not found:

    check_tesseract.php
    <?php
    
    // Check if Tesseract is available
    $output = shell_exec('tesseract --version 2>&1');
    
    if ($output === null) {
        echo "Tesseract not found in PATH!\n";
        echo "Install: brew install tesseract (macOS) or apt install tesseract-ocr (Linux)\n";
    } else {
        echo "Tesseract version:\n{$output}\n";
    }
    
    // Check for language data
    $langDataPaths = [
        '/usr/local/share/tessdata/',  // macOS Homebrew
        '/usr/share/tesseract-ocr/4.00/tessdata/',  // Ubuntu/Debian
        '/usr/share/tessdata/',  // Alternative Linux path
    ];
    
    foreach ($langDataPaths as $path) {
        if (is_dir($path)) {
            echo "\nLanguage data found in: {$path}\n";
            $files = glob($path . '*.traineddata');
            echo "Available languages: " . count($files) . "\n";
            foreach ($files as $file) {
                $lang = basename($file, '.traineddata');
                echo "- {$lang}\n";
            }
            break;
        }
    }
    

    Poor OCR accuracy:

    Try these configurations:

    improve_ocr.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Config\ExtractionConfig;
    use Kreuzberg\Config\OcrConfig;
    use Kreuzberg\Config\TesseractConfig;
    use Kreuzberg\Config\ImagePreprocessingConfig;
    
    // Configuration 1: Increase DPI
    $config1 = new ExtractionConfig(
        ocr: new OcrConfig(
            backend: 'tesseract',
            language: 'eng',
            imagePreprocessing: new ImagePreprocessingConfig(
                targetDpi: 600  // Higher DPI for small text
            )
        )
    );
    
    // Configuration 2: Enable denoising
    $config2 = new ExtractionConfig(
        ocr: new OcrConfig(
            backend: 'tesseract',
            language: 'eng',
            imagePreprocessing: new ImagePreprocessingConfig(
                denoise: true
            )
        )
    );
    
    // Configuration 3: Different PSM mode
    $config3 = new ExtractionConfig(
        ocr: new OcrConfig(
            backend: 'tesseract',
            language: 'eng',
            tesseractConfig: new TesseractConfig(
                psm: 11  // Sparse text mode
            )
        )
    );
    
    // Try each configuration
    $kreuzberg = new Kreuzberg();
    $result = $kreuzberg->extractFile('poor_quality_scan.pdf', config: $config1);
    

    Memory Issues

    Out of memory errors:

    memory_optimization.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Config\ExtractionConfig;
    use Kreuzberg\Config\ChunkingConfig;
    
    // Option 1: Increase memory limit
    ini_set('memory_limit', '512M');
    
    // Option 2: Use chunking
    $config = new ExtractionConfig(
        chunking: new ChunkingConfig(
            maxChunkSize: 1000,
            chunkOverlap: 100
        )
    );
    
    // Option 3: Process in batches
    $files = glob('documents/*.pdf');
    $batchSize = 10;
    
    foreach (array_chunk($files, $batchSize) as $batch) {
        $results = batch_extract_files($batch);
    
        // Process results...
    
        // Free memory
        unset($results);
        gc_collect_cycles();
    }
    

    File Permission Errors

    check_permissions.php
    <?php
    
    $file = 'document.pdf';
    
    // Check if file exists
    if (!file_exists($file)) {
        echo "File not found: {$file}\n";
        exit(1);
    }
    
    // Check if file is readable
    if (!is_readable($file)) {
        echo "File not readable: {$file}\n";
        echo "Current permissions: " . substr(sprintf('%o', fileperms($file)), -4) . "\n";
        echo "Run: chmod 644 {$file}\n";
        exit(1);
    }
    
    echo "File is accessible!\n";
    

    Migration Guide

    Upgrading from 3.x to 4.x

    Breaking Changes:

    1. PHP Version Requirement

    2. Old: PHP 8.1+

    3. New: PHP 8.2+

    4. Configuration Classes Now Readonly

    // Old (PHP 8.1)
    $config = new ExtractionConfig();
    $config->extractImages = true;  // Not possible anymore
    
    // New (PHP 8.2+)
    $config = new ExtractionConfig(extractImages: true);
    
    1. Namespace Changes
    // Old
    use Kreuzberg\Config\Config;
    
    // New
    use Kreuzberg\Config\ExtractionConfig;
    
    1. Method Signature Changes
    // Old
    $result = $kreuzberg->extract('file.pdf');
    
    // New
    $result = $kreuzberg->extractFile('file.pdf');
    

    Migration Steps:

    1. Update PHP to 8.2+:

      php -v  # Check current version
      

    2. Update Composer dependency:

      composer require kreuzberg/kreuzberg:^4.0
      

    3. Update extension:

      pie install kreuzberg/kreuzberg-ext:^4.0
      

    4. Update code:

      // Before
      use Kreuzberg\Config\Config;
      
      $config = new Config();
      $config->extractImages = true;
      $result = $kreuzberg->extract('file.pdf', $config);
      
      // After
      use Kreuzberg\Config\ExtractionConfig;
      
      $config = new ExtractionConfig(extractImages: true);
      $result = $kreuzberg->extractFile('file.pdf', config: $config);
      


    Best Practices

    1. Error Handling

    Always wrap extraction calls in try-catch blocks:

    best_error_handling.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Exceptions\KreuzbergException;
    
    function extractDocument(string $path): ?string
    {
        try {
            $kreuzberg = new Kreuzberg();
            $result = $kreuzberg->extractFile($path);
            return $result->content;
        } catch (KreuzbergException $e) {
            // Log the error
            error_log("Extraction failed for {$path}: " . $e->getMessage());
    
            // Return null or throw
            return null;
        }
    }
    

    2. Configuration Reuse

    Reuse configuration objects for better performance:

    reuse_config.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Config\ExtractionConfig;
    use Kreuzberg\Config\OcrConfig;
    
    // Good: Create config once
    $config = new ExtractionConfig(
        ocr: new OcrConfig(backend: 'tesseract', language: 'eng'),
        extractTables: true
    );
    
    $kreuzberg = new Kreuzberg($config);
    
    // Reuse for multiple files
    $result1 = $kreuzberg->extractFile('doc1.pdf');
    $result2 = $kreuzberg->extractFile('doc2.pdf');
    $result3 = $kreuzberg->extractFile('doc3.pdf');
    
    // Bad: Creating new config for each file
    foreach ($files as $file) {
        $config = new ExtractionConfig(...);  // Wasteful
        $result = (new Kreuzberg($config))->extractFile($file);
    }
    

    3. Batch Processing

    Use batch APIs for multiple files:

    best_batch.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    
    $kreuzberg = new Kreuzberg();
    $files = glob('documents/*.pdf');
    
    // Good: Batch processing
    $results = $kreuzberg->batchExtractFiles($files);
    
    // Bad: Individual processing in a loop
    foreach ($files as $file) {
        $result = $kreuzberg->extractFile($file);  // Inefficient
    }
    

    4. Resource Management

    Free resources when processing large datasets:

    resource_management.php
    <?php
    
    use function Kreuzberg\batch_extract_files;
    
    $files = glob('documents/*.pdf');
    
    // Process in chunks to manage memory
    foreach (array_chunk($files, 50) as $batch) {
        $results = batch_extract_files($batch);
    
        // Process results...
        foreach ($results as $result) {
            // Do something with result
            processResult($result);
        }
    
        // Free memory
        unset($results);
        gc_collect_cycles();
    }
    

    5. Type Validation

    Validate inputs before processing:

    validation.php
    <?php
    
    use Kreuzberg\Kreuzberg;
    
    function extractSafely(string $path): ?ExtractionResult
    {
        // Validate file exists
        if (!file_exists($path)) {
            throw new InvalidArgumentException("File not found: {$path}");
        }
    
        // Validate file is readable
        if (!is_readable($path)) {
            throw new RuntimeException("File not readable: {$path}");
        }
    
        // Validate file size (e.g., max 100MB)
        $maxSize = 100 * 1024 * 1024;
        if (filesize($path) > $maxSize) {
            throw new RuntimeException("File too large: {$path}");
        }
    
        // Extract
        $kreuzberg = new Kreuzberg();
        return $kreuzberg->extractFile($path);
    }
    

    Framework Integration

    Laravel

    Service Provider:

    app/Providers/KreuzbergServiceProvider.php
    <?php
    
    namespace App\Providers;
    
    use Illuminate\Support\ServiceProvider;
    use Kreuzberg\Kreuzberg;
    use Kreuzberg\Config\ExtractionConfig;
    
    class KreuzbergServiceProvider extends ServiceProvider
    {
        public function register(): void
        {
            $this->app->singleton(Kreuzberg::class, function ($app) {
                $config = new ExtractionConfig(
                    extractTables: true,
                    extractImages: config('kreuzberg.extract_images', false),
                );
    
                return new Kreuzberg($config);
            });
        }
    }
    

    Configuration:

    config/kreuzberg.php
    <?php
    
    return [
        'extract_images' => env('KREUZBERG_EXTRACT_IMAGES', false),
        'extract_tables' => env('KREUZBERG_EXTRACT_TABLES', true),
        'ocr_language' => env('KREUZBERG_OCR_LANGUAGE', 'eng'),
    ];
    

    Usage:

    app/Http/Controllers/DocumentController.php
    <?php
    
    namespace App\Http\Controllers;
    
    use Illuminate\Http\Request;
    use Kreuzberg\Kreuzberg;
    
    class DocumentController extends Controller
    {
        public function extract(Request $request, Kreuzberg $kreuzberg)
        {
            $request->validate([
                'document' => 'required|file|mimes:pdf,docx,xlsx|max:10240',
            ]);
    
            $file = $request->file('document');
            $result = $kreuzberg->extractFile($file->getPathname());
    
            return response()->json([
                'content' => $result->content,
                'metadata' => $result->metadata,
                'tables' => $result->tables,
            ]);
        }
    }
    

    Symfony

    Service Configuration:

    config/services.yaml
    services:
        Kreuzberg\Kreuzberg:
            arguments:
                $defaultConfig: '@Kreuzberg\Config\ExtractionConfig'
    
        Kreuzberg\Config\ExtractionConfig:
            factory: ['App\Factory\KreuzbergConfigFactory', 'create']
    

    Factory:

    src/Factory/KreuzbergConfigFactory.php
    <?php
    
    namespace App\Factory;
    
    use Kreuzberg\Config\ExtractionConfig;
    
    class KreuzbergConfigFactory
    {
        public static function create(): ExtractionConfig
        {
            return new ExtractionConfig(
                extractTables: true,
                extractImages: false,
            );
        }
    }
    

    Usage:

    src/Controller/DocumentController.php
    <?php
    
    namespace App\Controller;
    
    use Kreuzberg\Kreuzberg;
    use Symfony\Bundle\FrameworkBundle\Controller\AbstractController;
    use Symfony\Component\HttpFoundation\Request;
    use Symfony\Component\HttpFoundation\JsonResponse;
    
    class DocumentController extends AbstractController
    {
        public function extract(Request $request, Kreuzberg $kreuzberg): JsonResponse
        {
            $file = $request->files->get('document');
            $result = $kreuzberg->extractFile($file->getPathname());
    
            return $this->json([
                'content' => $result->content,
                'metadata' => $result->metadata,
            ]);
        }
    }