Plugin System v4.0.0¶
Kreuzberg's extraction pipeline is entirely plugin-driven. Every format extractor, OCR engine, post-processor, validator, and renderer is a plugin that registers itself into a typed registry. The pipeline queries these registries at each stage to find the right handler. You extend Kreuzberg by writing your own plugin and registering it. The pipeline picks it up automatically.
This page explains the five plugin types, the registry mechanism, the plugin lifecycle, and how plugins work across language boundaries.
Overview¶
The plugin system has three layers: plugins, registries, and the pipeline. Plugins implement a trait. Registries store them by key (MIME type, name, or processing stage). The pipeline queries the registries during extraction.
flowchart TB
subgraph layer1 ["You write plugins"]
direction LR
E["DocumentExtractor\n<i>Handles a file format</i>"]
O["OcrBackend\n<i>Runs OCR on images</i>"]
V["Validator\n<i>Rejects bad results</i>"]
P["PostProcessor\n<i>Transforms results</i>"]
R["Renderer\n<i>Formats output</i>"]
end
subgraph layer2 ["Registries store them"]
direction LR
ER["Extractor Registry\n<i>MIME type → extractor</i>"]
OR["OCR Registry\n<i>name → backend</i>"]
VR["Validator Registry\n<i>name → validator</i>"]
PR["Processor Registry\n<i>stage → processors</i>"]
RR["Renderer Registry\n<i>name → renderer</i>"]
end
subgraph layer3 ["Pipeline uses them"]
direction LR
P1["Format\nextraction"]
P2["OCR"]
P3["Validation"]
P4["Post-\nprocessing"]
P5["Rendering"]
end
E --> ER
O --> OR
V --> VR
P --> PR
R --> RR
ER --> P1
OR --> P2
VR --> P3
PR --> P4
RR --> P5
style ER fill:#bbdefb,stroke:#1565c0
style OR fill:#c8e6c9,stroke:#2e7d32
style VR fill:#ffccbc,stroke:#d84315
style PR fill:#fff9c4,stroke:#f9a825
style RR fill:#e1bee7,stroke:#7b1fa2
You register a plugin once. From that point on, the pipeline uses it wherever the MIME type, name, or stage matches. No wiring, no config files, no boilerplate.
The Five Plugin Types¶
DocumentExtractor¶
A DocumentExtractor teaches Kreuzberg how to extract text from a specific file format. It declares which MIME types it supports and provides two extraction methods: one for file paths, one for raw bytes.
#[async_trait]
pub trait DocumentExtractor: Plugin {
fn supported_mime_types(&self) -> Vec<&str>;
fn priority(&self) -> i32 { 0 }
async fn extract_file(
&self, path: &Path, mime_type: &str, config: &ExtractionConfig,
) -> Result<ExtractionResult>;
async fn extract_bytes(
&self, data: &[u8], mime_type: &str, config: &ExtractionConfig,
) -> Result<ExtractionResult>;
}
Kreuzberg ships with built-in extractors for PDF (via pdfium), Excel (via calamine), images (routes to OCR), XML, plain text, email, and Office formats (DOCX, PPTX).
Priority resolution. When two extractors are registered for the same MIME type, the one with the higher priority() value wins. Every built-in extractor has a priority of 0. To override the built-in PDF extractor with your own, register yours with a higher priority:
impl DocumentExtractor for BetterPDFExtractor {
fn priority(&self) -> i32 { 100 }
// ...
}
Now when the pipeline encounters application/pdf, it selects BetterPDFExtractor instead of the default.
OcrBackend¶
An OcrBackend performs optical character recognition on image data. It declares which languages it supports and provides a process_image method that takes raw image bytes and returns recognized text.
#[async_trait]
pub trait OcrBackend: Plugin {
fn supports_language(&self, language: &str) -> bool;
async fn process_image(
&self, image_data: &[u8], language: Option<&str>, config: &OcrConfig,
) -> Result<OcrResult>;
}
Three backends ship out of the box:
| Backend | Engine | Strengths |
|---|---|---|
| Tesseract | Native Rust bindings | Fast, general-purpose, default backend. Good accuracy for Latin scripts. |
| PaddleOCR | ONNX Runtime | Best accuracy for CJK (Chinese, Japanese, Korean) scripts. No Python dependency. |
| EasyOCR | Python + PyTorch | Supports 80+ languages including Arabic, Hindi, and Thai. Only available through Python bindings. |
You can register your own OCR backend (for example, a cloud-based API, a custom model) using the same trait.
PostProcessor¶
A PostProcessor transforms the extraction result after the main extraction and OCR stages are complete. Each processor declares a stage that determines its execution order relative to other processors.
#[async_trait]
pub trait PostProcessor: Plugin {
fn stage(&self) -> ProcessingStage;
fn should_process(&self, result: &ExtractionResult, config: &ExtractionConfig) -> bool {
true
}
async fn process(
&self, result: ExtractionResult, config: &ExtractionConfig,
) -> Result<ExtractionResult>;
}
The three stages execute in fixed order:
| Stage | Runs | Purpose | Examples |
|---|---|---|---|
Early |
First | Clean up raw text | Strip control characters, fix encoding, normalize whitespace |
Middle |
Second | Analyze content | Extract named entities, detect language, classify document type |
Late |
Third | Final output shaping | Format output, generate summaries, redact PII |
A design decision worth noting: post-processor errors do not fail the extraction. If a processor throws an exception, the error is logged and the pipeline continues with the result unchanged. This ensures a buggy or experimental processor can't take down your extraction pipeline.
Validator¶
A Validator inspects the extraction result and can reject it if it doesn't meet your requirements. Unlike post-processors, validator errors stop the pipeline immediately. They're a hard gate.
#[async_trait]
pub trait Validator: Plugin {
fn should_validate(&self, result: &ExtractionResult, config: &ExtractionConfig) -> bool {
true
}
async fn validate(&self, result: &ExtractionResult, config: &ExtractionConfig) -> Result<()>;
}
Two common validator patterns:
class MinimumLengthValidator:
"""Reject extractions that produce less than 100 characters."""
def validate(self, result, config):
if len(result.content) < 100:
raise ValidationError("Text too short")
class QualityThresholdValidator:
"""Reject extractions with a quality score below 0.5."""
def validate(self, result, config):
if (result.quality_score or 0.0) < 0.5:
raise ValidationError("Quality below threshold")
Validators run before post-processors. This means you can catch and reject bad results before any transformation work happens.
Renderer v4.7.0¶
A Renderer converts the internal document representation into a specific output format. It declares a name and provides a render method that takes an InternalDocument and produces formatted text.
pub trait Renderer: Plugin {
fn name(&self) -> &str;
fn render(&self, document: &InternalDocument) -> Result<String>;
}
Kreuzberg ships with four built-in renderers:
| Renderer | Output | Description |
|---|---|---|
| markdown | GFM Markdown | GitHub Flavored Markdown via comrak AST bridge. Tables, headings, lists. |
| html | HTML5 | Full HTML5 rendering via comrak. |
| djot | Djot | Djot markup format. |
| plain | Plain text | Raw text with no markup. |
To register a custom renderer:
use kreuzberg::plugins::registry::get_renderer_registry;
use std::sync::Arc;
let registry = get_renderer_registry();
let mut registry = registry.write().unwrap();
registry.register(Arc::new(MyCustomRenderer))?;
Custom renderers participate in the pipeline just like built-in ones. When the user requests your renderer's name via --content-format, the RendererRegistry dispatches to your implementation.
Plugin Lifecycle¶
Every plugin, regardless of type, follows the same lifecycle from creation to shutdown.
stateDiagram-v2
[*] --> Created: new()
Created --> Registered: registry.register()
Registered --> Active: initialize()
Active --> Active: called by pipeline
Active --> [*]: shutdown()
The base Plugin trait that all four plugin types extend:
pub trait Plugin: Send + Sync {
fn name(&self) -> &str;
fn version(&self) -> String;
fn initialize(&self) -> Result<()>;
fn shutdown(&self) -> Result<()>;
}
initialize() is called lazily the first time the plugin is used, not when it's registered. This avoids startup overhead for plugins that may never be invoked. shutdown() runs when the plugin is removed from the registry or when the process exits. Both methods have default no-op implementations, so you only override them if your plugin needs setup or cleanup logic.
Registering Plugins¶
The registration pattern is the same in every language. Get the registry, call register.
Once registered, the pipeline automatically uses your plugin for any matching MIME type (extractors), backend name (OCR), processing stage (post-processors), or validator name (validators).
Cross-Language Plugins¶
Plugins written in Python can integrate directly with the Rust extraction pipeline. The PyO3 bridge layer handles all type conversion between Python and Rust automatically.
Here is what happens when a Python plugin is registered and then invoked during extraction:
sequenceDiagram
participant P as Python Plugin
participant B as PyO3 Bridge
participant R as Rust Pipeline
P->>B: register(plugin)
B->>R: Store as Arc<dyn DocumentExtractor>
Note over R: During extraction...
R->>B: extract_file(path, mime, config)
B->>P: Call plugin.extract_file()
P-->>B: Return result as dict
B-->>R: Convert to ExtractionResult struct
The bridge handles type conversion between languages:
| Rust | Python | TypeScript |
|---|---|---|
Vec<u8> |
bytes |
Buffer |
String |
str |
string |
| Structs | Dataclasses | Plain objects |
For large data like file bytes and image buffers, the bindings are designed to minimize copying and use buffer protocols where supported. A Python plugin may receive file data as bytes or another buffer-compatible type, depending on the binding implementation and runtime behavior.
Thread Safety¶
All plugins must implement Send + Sync because the extraction pipeline invokes them concurrently from Tokio's worker thread pool.
Sendmeans the plugin value can be moved to a different thread.Syncmeans multiple threads can hold references to the plugin simultaneously.
If your plugin needs mutable internal state (counters, connection pools, caches), wrap it in Mutex, RwLock, or use atomic types. The compiler will enforce this at build time.
Plugin Discovery¶
Plugins can be registered in three ways:
- Built-in — automatically registered when Kreuzberg initializes. These are the default extractors, OCR backends, and processors.
- Programmatic — registered manually via the registry API, as shown above.
- Configuration-based — loaded from
kreuzberg.tomlat startup (planned for a future release).
What to Read Next¶
- Creating Plugins — step-by-step guide to building a custom plugin
- Extraction Pipeline — where each plugin type fits in the extraction flow
- Architecture — overall system design
- API Reference — plugin API documentation