Skip to content

Plugin System v5.0.0

Kreuzberg's extraction pipeline is entirely plugin-driven. Every format extractor, OCR engine, post-processor, validator, and renderer is a plugin that registers itself into a typed registry. The pipeline queries these registries at each stage to find the right handler. You extend Kreuzberg by writing your own plugin and registering it. The pipeline picks it up automatically.

This page explains the five plugin types, the registry mechanism, the plugin lifecycle, and how plugins work across language boundaries.


Overview

The plugin system has three layers: plugins, registries, and the pipeline. Plugins implement a trait. Registries store them by key (MIME type, name, or processing stage). The pipeline queries the registries during extraction.

flowchart TB
    subgraph layer1 ["You write plugins"]
        direction LR
        E["DocumentExtractor\n<i>Handles a file format</i>"]
        O["OcrBackend\n<i>Runs OCR on images</i>"]
        V["Validator\n<i>Rejects bad results</i>"]
        P["PostProcessor\n<i>Transforms results</i>"]
        R["Renderer\n<i>Formats output</i>"]
    end

    subgraph layer2 ["Registries store them"]
        direction LR
        ER["Extractor Registry\n<i>MIME type → extractor</i>"]
        OR["OCR Registry\n<i>name → backend</i>"]
        VR["Validator Registry\n<i>name → validator</i>"]
        PR["Processor Registry\n<i>stage → processors</i>"]
        RR["Renderer Registry\n<i>name → renderer</i>"]
    end

    subgraph layer3 ["Pipeline uses them"]
        direction LR
        P1["Format\nextraction"]
        P2["OCR"]
        P3["Validation"]
        P4["Post-\nprocessing"]
        P5["Rendering"]
    end

    E --> ER
    O --> OR
    V --> VR
    P --> PR
    R --> RR

    ER --> P1
    OR --> P2
    VR --> P3
    PR --> P4
    RR --> P5

    style ER fill:#bbdefb,stroke:#1565c0
    style OR fill:#c8e6c9,stroke:#2e7d32
    style VR fill:#ffccbc,stroke:#d84315
    style PR fill:#fff9c4,stroke:#f9a825
    style RR fill:#e1bee7,stroke:#7b1fa2

You register a plugin once. From that point on, the pipeline uses it wherever the MIME type, name, or stage matches. No wiring, no config files, no boilerplate.


The Five Plugin Types

DocumentExtractor

A DocumentExtractor teaches Kreuzberg how to extract text from a specific file format. It declares supported MIME types and provides async methods to extract from file paths or raw bytes.

See DocumentExtractor for the trait signature.

Kreuzberg ships with built-in extractors for PDF, Excel, images (routed to OCR), XML, plain text, email, and Office formats (DOCX, PPTX).

Priority resolution. When two extractors are registered for the same MIME type, the one with the higher priority() value wins. Every built-in extractor has a priority of 0. To override the built-in PDF extractor with your own, register yours with a higher priority:

override_builtin.rs
impl DocumentExtractor for BetterPDFExtractor {
    fn priority(&self) -> i32 { 100 }
    // ...
}

Now when the pipeline encounters application/pdf, it selects BetterPDFExtractor instead of the default.


OcrBackend

An OcrBackend performs optical character recognition on image data. It declares supported languages and provides async methods to process image bytes or files.

See OcrBackend for the trait signature.

Three backends ship out of the box:

Backend Engine Strengths
Tesseract Native Rust bindings Fast, general-purpose, default backend. Good accuracy for Latin scripts.
PaddleOCR ONNX Runtime Best accuracy for CJK (Chinese, Japanese, Korean) scripts. No Python dependency.
EasyOCR Python + PyTorch Supports 80+ languages including Arabic, Hindi, and Thai. Only available through Python bindings.

You can register your own OCR backend (for example, a cloud-based API, a custom model) using the same trait.


PostProcessor

A PostProcessor transforms extraction results after the main extraction and OCR stages are complete. Each processor declares a processing stage that determines its execution order.

See PostProcessor for the trait signature.

The three stages execute in fixed order:

Stage Runs Purpose Examples
Early First Clean up raw text Strip control characters, fix encoding, normalize whitespace
Middle Second Analyze content Extract named entities, detect language, classify document type
Late Third Final output shaping Format output, generate summaries, redact PII

Error handling: Post-processor errors do not fail the extraction. Errors are logged and the pipeline continues unchanged, ensuring no processor can take down extraction.


Validator

A Validator inspects extraction results and can reject them if they don't meet requirements. Unlike post-processors, validator errors stop the pipeline immediately — they're a hard gate.

See Validator for the trait signature.

Two common validator patterns:

example_validators.py
class MinimumLengthValidator:
    """Reject extractions that produce less than 100 characters."""
    def validate(self, result, config):
        if len(result.content) < 100:
            raise ValidationError("Text too short")

class QualityThresholdValidator:
    """Reject extractions with a quality score below 0.5."""
    def validate(self, result, config):
        if (result.quality_score or 0.0) < 0.5:
            raise ValidationError("Quality below threshold")

Validators run before post-processors. This means you can catch and reject bad results before any transformation work happens.


Renderer

A Renderer converts the internal document representation into a specific output format. It declares a name and provides a render method.

pub trait Renderer: Send + Sync {
    fn name(&self) -> &str;
    fn render(&self, document: &InternalDocument) -> Result<String>;
}

Kreuzberg ships with four built-in renderers:

Renderer Output Description
Markdown GFM Markdown GitHub Flavored Markdown via comrak AST bridge. Tables, headings, lists.
HTML HTML5 Full HTML5 rendering via comrak.
djot Djot Djot markup format.
plain Plain text Raw text with no markup.

To register a custom renderer:

custom_renderer.rs
use kreuzberg::plugins::registry::get_renderer_registry;
use std::sync::Arc;

let registry = get_renderer_registry();
let mut registry = registry.write().unwrap();
registry.register(Arc::new(MyCustomRenderer))?;

Custom renderers participate in the pipeline just like built-in ones. When the user requests your renderer's name via --content-format, the RendererRegistry dispatches to your implementation.


Plugin Lifecycle

Every plugin follows the same lifecycle from creation to shutdown.

stateDiagram-v2
    [*] --> Created: new()
    Created --> Registered: registry.register()
    Registered --> Active: initialize()
    Active --> Active: called by pipeline
    Active --> [*]: shutdown()

See Plugin for the base trait signature.

Key behaviors: initialize() is called lazily the first time the plugin is used, not at registration. This avoids startup overhead for plugins that may never be invoked. shutdown() runs when the plugin is unregistered or on process exit. Both have default no-op implementations — override only if your plugin needs setup or cleanup.


Registering Plugins

Get the appropriate registry for your plugin type and call register(). Once registered, the pipeline automatically dispatches to your plugin based on MIME type (extractors), backend name (OCR), processing stage (post-processors), or validator name.


Cross-Language Plugins

Plugins written in Python can integrate directly with the Rust extraction pipeline via PyO3 FFI. The bridge layer handles all type conversion automatically.

sequenceDiagram
    participant P as Python Plugin
    participant B as PyO3 Bridge
    participant R as Rust Pipeline

    P->>B: register(plugin)
    B->>R: Store as Arc<dyn DocumentExtractor>

    Note over R: During extraction...
    R->>B: extract_file(path, mime, config)
    B->>P: Call plugin.extract_file()
    P-->>B: Return result as dict
    B-->>R: Convert to ExtractionResult

Type mapping: Vec<u8>bytes, Stringstr, Rust structs ↔ Python dataclasses. Large buffers use Python's buffer protocol to minimize copying.


Thread Safety

All plugins must implement Send + Sync because the extraction pipeline invokes them concurrently from Tokio's worker thread pool. For mutable internal state, use Mutex, RwLock, or atomic types. The compiler will enforce this requirement.


Plugin Discovery

Plugins can be registered in two ways:

  1. Built-in — automatically registered when Kreuzberg initializes. These are the default extractors, OCR backends, and processors.
  2. Programmatic — registered manually via the registry API at runtime.

Edit this page on GitHub