Skip to content

Configuration Reference v4.0.0

This page provides complete documentation for all Kreuzberg configuration types and fields. For quick-start examples and common use cases, see the Configuration Guide.

Getting Started

New users should start with the Configuration Guide which covers:

  • Configuration discovery mechanism
  • Quick-start examples in all languages
  • Common use cases (OCR setup, chunking for RAG)
  • Configuration file formats (TOML, YAML, JSON)

This reference page is the comprehensive source for:

  • All configuration field details
  • Default values and constraints
  • Technical specifications for each config type

ServerConfig

NEW in v4.2.7: The ServerConfig controls API server and network settings.

API server configuration for the Kreuzberg HTTP server, including host/port settings, CORS configuration, and upload size limits. All settings can be overridden via environment variables.

Overview

ServerConfig is used to customize the Kreuzberg API server behavior when running kreuzberg serve or embedding a Kreuzberg API server in your application. It controls network binding, cross-origin resource sharing (CORS), and file upload size constraints.

Fields

Field Type Default Description
host String "127.0.0.1" Server host address (for example, "127.0.0.1", "0.0.0.0")
port u16 8000 Server port number (1-65535)
cors_origins Vec<String> empty CORS allowed origins. Empty list allows all origins.
max_request_body_bytes usize 104857600 Maximum request body size in bytes (100 MB default)
max_multipart_field_bytes usize 104857600 Maximum multipart field size in bytes (100 MB default)

Configuration Precedence

Settings are applied in this order (highest priority first):

  1. Environment Variables - KREUZBERG_* variables override everything
  2. Configuration File - TOML, YAML, or JSON values
  3. Programmatic Defaults - Hard-coded defaults

CORS Security Warning

The default configuration (empty cors_origins list) allows requests from any origin. This is suitable for development and internal APIs, but you should explicitly configure cors_origins for production deployments to prevent unauthorized cross-origin requests.

Recommended for production:

Production CORS Configuration
cors_origins = ["https://yourdomain.com", "https://app.yourdomain.com"]

Configuration Examples

basic_server_config.rs
use kreuzberg::core::ServerConfig;

// Basic configuration with defaults
let config = ServerConfig::default();
assert_eq!(config.host, "127.0.0.1");
assert_eq!(config.port, 8000);

// Custom configuration
let mut config = ServerConfig::default();
config.host = "0.0.0.0".to_string();
config.port = 3000;

// Listen address helper
println!("Server listening on: {}", config.listen_addr());
cors_server_config.rs
use kreuzberg::core::ServerConfig;

// Allow specific origins only (secure)
let mut config = ServerConfig::default();
config.cors_origins = vec![
    "https://app.example.com".to_string(),
    "https://admin.example.com".to_string(),
];

// Check if origin is allowed
assert!(config.is_origin_allowed("https://app.example.com"));
assert!(!config.is_origin_allowed("https://evil.com"));

// Check if allowing all origins
assert!(!config.cors_allows_all());
size_limits_config.rs
use kreuzberg::core::ServerConfig;

// Custom size limits (200 MB)
let mut config = ServerConfig::default();
config.max_request_body_bytes = 200 * 1_048_576;  // 200 MB
config.max_multipart_field_bytes = 200 * 1_048_576;  // 200 MB

// Get sizes in MB
println!("Max request body: {} MB", config.max_request_body_mb());
println!("Max file upload: {} MB", config.max_multipart_field_mb());
load_server_config.rs
use kreuzberg::core::ServerConfig;

// Auto-detect format from extension (.toml, .yaml, .json)
let mut config = ServerConfig::from_file("server.toml")?;

// Or use specific loaders
let config = ServerConfig::from_toml_file("server.toml")?;
let config = ServerConfig::from_yaml_file("server.yaml")?;
let config = ServerConfig::from_json_file("server.json")?;

// Apply environment variable overrides
config.apply_env_overrides()?;

Environment Variable Overrides

All settings can be overridden via environment variables with KREUZBERG_ prefix:

Terminal
# Network settings
export KREUZBERG_HOST="0.0.0.0"
export KREUZBERG_PORT="3000"

# CORS configuration (comma-separated)
export KREUZBERG_CORS_ORIGINS="https://app1.com, https://app2.com"

# Size limits (in bytes)
export KREUZBERG_MAX_REQUEST_BODY_BYTES="209715200"      # 200 MB
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES="209715200"   # 200 MB

kreuzberg serve

Configuration File Examples

TOML Format

server.toml
# Basic server configuration
host = "0.0.0.0"          # Listen on all interfaces
port = 8000               # API port

# CORS configuration (empty = allow all)
cors_origins = [
    "https://app.example.com",
    "https://admin.example.com"
]

# Upload size limits (default: 100 MB)
max_request_body_bytes = 104857600      # 100 MB
max_multipart_field_bytes = 104857600   # 100 MB

YAML Format

server.yaml
host: 0.0.0.0
port: 8000

cors_origins:
  - https://app.example.com
  - https://admin.example.com

max_request_body_bytes: 104857600
max_multipart_field_bytes: 104857600

JSON Format

server.json
{
  "host": "0.0.0.0",
  "port": 8000,
  "cors_origins": ["https://app.example.com", "https://admin.example.com"],
  "max_request_body_bytes": 104857600,
  "max_multipart_field_bytes": 104857600
}

Docker Integration

When deploying Kreuzberg in Docker, use environment variables to configure the server:

Dockerfile
FROM kreuzberg:latest

ENV KREUZBERG_HOST="0.0.0.0"
ENV KREUZBERG_PORT="8000"
ENV KREUZBERG_CORS_ORIGINS="https://yourdomain.com"
ENV KREUZBERG_MAX_MULTIPART_FIELD_BYTES="524288000"

EXPOSE 8000

CMD ["kreuzberg", "serve"]
Terminal - Run with Docker
docker run -it \
  -e KREUZBERG_HOST="0.0.0.0" \
  -e KREUZBERG_PORT="3000" \
  -e KREUZBERG_CORS_ORIGINS="https://api.example.com" \
  -p 3000:3000 \
  kreuzberg:latest kreuzberg serve

ExtractionConfig

Main extraction configuration controlling all aspects of document processing.

Field Type Default Description
use_cache bool true Enable caching of extraction results for faster re-processing
enable_quality_processing bool true Enable quality post-processing (deduplication, mojibake fixing, etc.)
force_ocr bool false Force OCR even for searchable PDFs with text layers
disable_ocr bool false Disable OCR entirely — image files return empty content instead of raising errors (v4.7.0+)
ocr OcrConfig? None OCR configuration (if None, OCR disabled)
pdf_options PdfConfig? None PDF-specific configuration options
images ImageExtractionConfig? None Image extraction configuration
chunking ChunkingConfig? None Text chunking configuration for splitting into chunks
content_filter ContentFilterConfig? v4.8.0 None Header, footer, watermark, and repeating-text filtering. See ContentFilterConfig.
token_reduction TokenReductionConfig? None Token reduction configuration for optimizing LLM context
language_detection LanguageDetectionConfig? None Automatic language detection configuration
postprocessor PostProcessorConfig? None Post-processing pipeline configuration
pages PageConfig? None Page extraction and tracking configuration
max_concurrent_extractions int? None Maximum concurrent batch extractions (defaults to num_cpus * 2)
concurrency ConcurrencyConfig? v4.5.0 None Concurrency configuration for threading (max_threads caps Rayon, ONNX intra-op threads, and batch semaphore)
result_format OutputFormat Unified Result structure format: Unified (content in single field) or ElementBased (semantic elements array)
output_format OutputFormat Plain Output format for extracted text content (Plain, Markdown, Djot, Html, Structured)
html_options ConversionOptions None HTML to Markdown conversion options (heading styles, list formatting, code block styles). Only available with html feature.
html_output HtmlOutputConfig? v4.8.1 None Styled HTML output configuration: theme selection, custom CSS, class prefix. When set alongside output_format = Html, activates the styled renderer with kb-* class hooks. Only available with html feature.
security_limits SecurityLimits? None (uses defaults) Archive security thresholds: max archive size (500MB), compression ratio (100:1), file count (10K), nesting depth, content size, XML depth, table cells. Only available with archives feature.
layout LayoutDetectionConfig? None Layout detection configuration for document structure analysis. Only available with layout-detection feature.
acceleration AccelerationConfig? None Hardware acceleration configuration for ONNX Runtime inference (layout detection and embeddings). See AccelerationConfig.
include_document_structure bool false Enable structured document model output. When true, the document field on ExtractionResult is populated with a tree-based representation of document content.
tree_sitter TreeSitterConfig? None Tree-sitter code intelligence configuration. Controls code analysis features when extracting source code files. Only available with tree-sitter feature.
structured_extraction StructuredExtractionConfig? None Structured extraction configuration for LLM-powered schema-based extraction. When set, extraction results include a structured_output field with data conforming to the provided JSON schema. Only available with llm feature.

Result Format vs Output Format

Important distinction: These two fields control different aspects of extraction results:

  • result_format - Controls the structure of the result:
  • Unified (default): All content returned in the content field as a single string
  • ElementBased: Content returned as semantic elements in the elements array (Unstructured-compatible format)

  • output_format - Controls the text format within the content:

  • Plain (default): Raw extracted text
  • Markdown: Markdown formatted output
  • Djot: Djot markup format
  • Html: HTML formatted output

OutputFormat (result_format field)

Controls the structure of extraction results:

Value Description
unified All content in single content field (default)
element_based Semantic elements with type classification, IDs, and metadata

When result_format is set to ElementBased, the elements field contains an array of semantic elements with unique identifiers, element types (title, heading, narrative_text, etc.), and metadata for Unstructured-compatible processing.

OutputFormat (output_format field)

Output format for extraction content. Controls how extracted text is formatted in the result.

Value Description
plain Plain text content only (default)
markdown Markdown formatted output
djot Djot markup format
html HTML formatted output
structured Structured JSON with full OCR element data (bounding boxes, confidence)

Environment Variable: KREUZBERG_OUTPUT_FORMAT - Set output format via environment (plain, markdown, djot, html, structured)

HtmlOutputConfig

Configuration for the styled HTML renderer. When set on ExtractionConfig.html_output alongside output_format = Html, the pipeline produces HTML with semantic kb-* class hooks instead of plain HTML.

Field Type Default Description
theme HtmlTheme Unstyled Built-in colour/typography theme
css string? None Inline CSS string appended after theme stylesheet
css_file path? None CSS file loaded at render time (max 1 MiB)
class_prefix string "kb-" CSS class prefix (alphanumeric + hyphens + underscores only)
embed_css bool true Embed CSS in <style> block. Set false for external stylesheets

HtmlTheme

Built-in theme selection for styled HTML output.

Value Description
Unstyled (default) No built-in stylesheet. CSS custom properties defined on :root for user stylesheets
Default System font stack, neutral colours, readable line measure
GitHub GitHub Markdown-inspired palette and spacing
Dark Dark background, light text
Light Minimal light theme with generous whitespace

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    UseCache = true,
    EnableQualityProcessing = true,
    ForceOcr = false,
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    useCache := true
    enableQP := true

    result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
        UseCache:                &useCache,
        EnableQualityProcessing: &enableQP,
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .useCache(true)
    .enableQualityProcessing(true)
    .build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
Python
import asyncio
from kreuzberg import extract_file, ExtractionConfig

async def main() -> None:
    config = ExtractionConfig(
        use_cache=True,
        enable_quality_processing=True
    )
    result = await extract_file("document.pdf", config=config)
    print(result.content)

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  use_cache: true,
  enable_quality_processing: true
)

result = Kreuzberg.extract_file_sync('document.pdf', config: config)
R
library(kreuzberg)

file_path <- "document.pdf"

config <- extraction_config(
  output_format = "markdown"
)

result <- extract_file_sync(file_path, config = config)

cat(sprintf("MIME type: %s\n", result$mime_type))
cat(sprintf("Content length: %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
Rust
use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        use_cache: true,
        enable_quality_processing: true,
        ..Default::default()
    };

    let result = extract_file("document.pdf", None, &config).await?;
    println!("{}", result.content);
    Ok(())
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    useCache: true,
    enableQualityProcessing: true,
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

FileExtractionConfig v4.5.0

Per-file extraction configuration overrides for batch operations. All fields are optional — None means "use the batch-level default from ExtractionConfig."

When passed as an optional parameter to batch_extract_file / batch_extract_bytes (or their sync variants), each file in the batch can specify its own overrides that are merged with the shared batch-level ExtractionConfig.

Overridable Fields

Field Type Description
enable_quality_processing bool? Override quality post-processing for this file
ocr OcrConfig? Override OCR configuration
force_ocr bool? Override force OCR
disable_ocr bool? Override disable OCR (v4.7.0+)
chunking ChunkingConfig? Override text chunking
content_filter ContentFilterConfig? Override content filtering
images ImageExtractionConfig? Override image extraction
pdf_options PdfConfig? Override PDF-specific options
token_reduction TokenReductionConfig? Override token reduction
language_detection LanguageDetectionConfig? Override language detection
pages PageConfig? Override page extraction
keywords KeywordConfig? Override keyword extraction
postprocessor PostProcessorConfig? Override post-processing
html_options ConversionOptions? Override HTML conversion options
result_format OutputFormat? Override result structure format
output_format OutputFormat? Override output content format
include_document_structure bool? Override document structure output
layout LayoutDetectionConfig? Override layout detection

Batch-Level Only Fields (Not Overridable)

These ExtractionConfig fields cannot be overridden per file:

  • max_concurrent_extractions — controls batch parallelism
  • use_cache — global caching policy
  • acceleration — shared ONNX execution provider
  • security_limits — global archive security policy

Merge Semantics

For each file in a batch, the effective configuration is computed by overlaying the per-file FileExtractionConfig onto the batch-level ExtractionConfig. A field set to None in FileExtractionConfig falls through to the batch default. A field set to Some(value) replaces the batch default entirely for that file.

Example

per_file_config.rs
use kreuzberg::{
    batch_extract_file, ExtractionConfig, FileExtractionConfig, OcrConfig,
};
use std::path::PathBuf;

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let batch_config = ExtractionConfig::default();

    let paths = vec![
        PathBuf::from("report.pdf"),
        PathBuf::from("scanned.pdf"),
    ];

    let file_configs = vec![
        None, // Use batch defaults for this PDF
        Some(FileExtractionConfig { // Force OCR for this scanned document
            force_ocr: Some(true),
            ocr: Some(OcrConfig {
                backend: "tesseract".to_string(),
                language: "deu".to_string(),
                ..Default::default()
            }),
            ..Default::default()
        }),
    ];

    let results = batch_extract_file(paths, &batch_config, Some(&file_configs)).await?;
    Ok(())
}
per_file_config.py
from kreuzberg import (
    batch_extract_files_sync,
    ExtractionConfig,
    FileExtractionConfig,
    OcrConfig,
)

config = ExtractionConfig()

paths = ["report.pdf", "scanned.pdf"]
file_configs = [
    None,  # use batch defaults
    FileExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(backend="tesseract", language="deu"),
    ),
]

results = batch_extract_files_sync(paths, config, file_configs=file_configs)
per_file_config.ts
import { batchExtractFilesSync } from '@kreuzberg/node';

const results = batchExtractFilesSync(
  ['report.pdf', 'scanned.pdf'],
  undefined, // use default config
  [
    null,  // use batch defaults
    {      // per-file overrides
      forceOcr: true,
      ocr: { backend: 'tesseract', language: 'deu' },
    },
  ],
);

ContentFilterConfig v4.8.0

Controls whether headers, footers, watermarks, and repeating cross-page text are kept in or stripped from extraction output. Applies to PDF, DOCX, RTF, ODT, HTML, EPUB, and PPT extractors with format-specific behavior.

When content_filter is None on ExtractionConfig, each extractor uses its built-in defaults (the same values listed below).

Fields

Field Type Default Description
include_headers bool False Keep running headers. PDF skips top-margin furniture stripping; DOCX includes header parts; HTML/EPUB keep <header> content.
include_footers bool False Keep running footers. PDF skips bottom-margin furniture stripping; DOCX includes footer parts; HTML/EPUB keep <footer> content.
strip_repeating_text bool True Detect text that repeats verbatim across most pages and remove it. Disable if brand names or repeated headings are being incorrectly stripped. Primarily PDF.
include_watermarks bool False Keep watermark text and arXiv-style identifiers. PDF only.

The strip_repeating_text flag also gates paragraph deduplication: when set to False, near-duplicate paragraphs are preserved as well (kreuzberg/kreuzberg#681, fixed in v4.8.1).

When a layout-detection model is active, it can independently classify regions as PageHeader or PageFooter and strip them per page. To preserve those regions in addition to disabling the cross-page heuristic, set include_headers = True and/or include_footers = True.

Configuration Examples

content_filter_config.py
from kreuzberg import ExtractionConfig, ContentFilterConfig

# Keep headers and footers for legal/forms work
config = ExtractionConfig(
    content_filter=ContentFilterConfig(
        include_headers=True,
        include_footers=True,
    ),
)
content_filter_config.ts
import { extract } from "@kreuzberg/node";

// Disable cross-page repeating-text detection
const result = await extract("report.pdf", {
  contentFilter: {
    stripRepeatingText: false,
  },
});
content_filter_config.rs
use kreuzberg::{ExtractionConfig, ContentFilterConfig};

let config = ExtractionConfig {
    content_filter: Some(ContentFilterConfig {
        include_headers: true,
        include_footers: true,
        strip_repeating_text: true,
        include_watermarks: false,
    }),
    ..Default::default()
};

Configuration File Examples

kreuzberg.toml
[content_filter]
include_headers = true
include_footers = true
strip_repeating_text = true
include_watermarks = false
kreuzberg.yaml
content_filter:
  include_headers: true
  include_footers: true
  strip_repeating_text: true
  include_watermarks: false

OcrConfig

Configuration for OCR (Optical Character Recognition) processing on images and scanned PDFs.

Field Type Default Description
backend str "tesseract" OCR backend to use: "tesseract", "easyocr", "paddleocr"
language str "eng" Language code(s) for OCR, for example, "eng", "eng+fra", "eng+deu+fra"
tesseract_config TesseractConfig? None Tesseract-specific configuration options
paddle_ocr_config PaddleOcrConfig? None PaddleOCR-specific configuration options
vlm_config LlmConfig? None Vision Language Model configuration for VLM-based OCR. When set, enables using a VLM as an OCR backend. Requires the llm feature.
vlm_prompt String? None Custom prompt for VLM-based OCR. Overrides the default OCR prompt sent to the vision model. Useful for domain-specific extraction instructions.

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Backend = "tesseract",
        Language = "eng+fra",
        TesseractConfig = new TesseractConfig { Psm = 3 }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine(result.Content);
Go
package main

import "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"

func main() {
    language := "eng+fra"
    psm := 3

    _ = &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend:  "tesseract",
            Language: &language,
            Tesseract: &kreuzberg.TesseractConfig{
                PSM: &psm,
            },
        },
    }
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .language("eng+fra")
        .tesseractConfig(TesseractConfig.builder()
            .psm(3)
            .build())
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        ocr=OcrConfig(
            backend="tesseract", language="eng+fra",
            tesseract_config=TesseractConfig(psm=3)
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(result.content)

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    backend: 'tesseract',
    language: 'eng+fra',
    tesseract_config: Kreuzberg::Config::Tesseract.new(psm: 3)
  )
)
R
library(kreuzberg)

ocr_cfg <- ocr_config(backend = "tesseract", language = "eng", dpi = 300L)
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)

result <- extract_file_sync("document.pdf", "application/pdf", config)
cat(sprintf("Extracted content length: %d\n", nchar(result$content)))
cat(sprintf("Detected language: %s\n", result$detected_language))
Rust
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: "eng+deu+fra".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("multilingual.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
        language: 'eng+fra',
        tesseractConfig: {
            psm: 3,
        },
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

PaddleOcrConfig v4.5.0

PaddleOCR-specific configuration for model selection and detection tuning.

Field Type Default Description
model_tier v4.5.0 str "mobile" Model tier: "mobile" (lightweight, ~21MB total, fast) or "server" (high accuracy, ~172MB, best with GPU)
padding v4.5.0 int 10 Padding in pixels (0-100) added around the image before detection

TesseractConfig

Tesseract OCR engine configuration with fine-grained control over recognition parameters.

Field Type Default Description
language str "eng" Language code(s), for example, "eng", "eng+fra"
psm int 3 Page Segmentation Mode (0-13, see below)
output_format str "markdown" Output format: "text", "markdown", "hocr"
oem int 3 OCR Engine Mode (0-3, see below)
min_confidence float 0.0 Minimum confidence threshold (0.0-100.0)
preprocessing ImagePreprocessingConfig? None Image preprocessing configuration
enable_table_detection bool true Enable automatic table detection and reconstruction
table_min_confidence float 0.0 Minimum confidence for table cell recognition (0.0-1.0)
table_column_threshold int 50 Pixel threshold for detecting table columns
table_row_threshold_ratio float 0.5 Row threshold ratio for table detection (0.0-1.0)
use_cache bool true Enable OCR result caching for faster re-processing
classify_use_pre_adapted_templates bool true Use pre-adapted templates for character classification
language_model_ngram_on bool false Enable N-gram language model for better word recognition
tessedit_dont_blkrej_good_wds bool true Don't reject good words during block-level processing
tessedit_dont_rowrej_good_wds bool true Don't reject good words during row-level processing
tessedit_enable_dict_correction bool true Enable dictionary-based word correction
tessedit_char_whitelist str "" Allowed characters (empty = all allowed)
tessedit_char_blacklist str "" Forbidden characters (empty = none forbidden)
tessedit_use_primary_params_model bool true Use primary language params model
textord_space_size_is_variable bool true Enable variable-width space detection
thresholding_method bool false Use adaptive thresholding method

Page Segmentation Modes (PSM)

  • 0: Orientation and script detection only (no OCR)
  • 1: Automatic page segmentation with OSD (Orientation and Script Detection)
  • 2: Automatic page segmentation (no OSD, no OCR)
  • 3: Fully automatic page segmentation (default, best for most documents)
  • 4: Single column of text of variable sizes
  • 5: Single uniform block of vertically aligned text
  • 6: Single uniform block of text (best for clean documents)
  • 7: Single text line
  • 8: Single word
  • 9: Single word in a circle
  • 10: Single character
  • 11: Sparse text with no particular order (best for forms, invoices)
  • 12: Sparse text with OSD
  • 13: Raw line (bypass Tesseract's layout analysis)

OCR Engine Modes (OEM)

  • 0: Legacy Tesseract engine only (pre-2016)
  • 1: Neural nets LSTM engine only (recommended for best quality)
  • 2: Legacy + LSTM engines combined
  • 3: Default based on what's available (recommended for compatibility)

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Language = "eng+fra+deu",
        TesseractConfig = new TesseractConfig
        {
            Psm = 6,
            Oem = 1,
            MinConfidence = 0.8m,
            EnableTableDetection = true
        }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    psm := 6
    oem := 1
    minConf := 0.8
    lang := "eng+fra+deu"
    whitelist := "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?"

    config := &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend:  "tesseract",
            Language: &lang,
            Tesseract: &kreuzberg.TesseractConfig{
                PSM:              &psm,
                OEM:              &oem,
                MinConfidence:    &minConf,
                EnableTableDetection: kreuzberg.BoolPtr(true),
                TesseditCharWhitelist: whitelist,
            },
        },
    }

    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .language("eng+fra+deu")
        .tesseractConfig(TesseractConfig.builder()
            .psm(6)
            .oem(1)
            .minConfidence(0.8)
            .tesseditCharWhitelist("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?")
            .enableTableDetection(true)
            .build())
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        ocr=OcrConfig(
            language="eng+fra+deu",
            tesseract_config=TesseractConfig(
                psm=6,
                oem=1,
                min_confidence=0.8,
                enable_table_detection=True,
            ),
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    language: 'eng+fra+deu',
    tesseract_config: Kreuzberg::Config::Tesseract.new(
      psm: 6,
      oem: 1,
      min_confidence: 0.8,
      tessedit_char_whitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?',
      enable_table_detection: true
    )
  )
)
R
library(kreuzberg)

ocr_cfg <- ocr_config(
  backend = "tesseract",
  language = "eng+deu",
  dpi = 300L
)
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("Detected language: %s\n", result$detected_language))
cat(sprintf("Content length: %d characters\n", nchar(result$content)))
Rust
use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig};

fn main() {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            language: "eng+fra+deu".to_string(),
            tesseract_config: Some(TesseractConfig {
                psm: 6,
                oem: 1,
                min_confidence: 0.8,
                tessedit_char_whitelist: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?".to_string(),
                enable_table_detection: true,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };
    println!("{:?}", config.ocr);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
        language: 'eng+fra+deu',
        tesseractConfig: {
            psm: 6,
            tesseditCharWhitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?',
            enableTableDetection: true,
        },
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

ChunkingConfig

Configuration for splitting extracted text into overlapping chunks, useful for vector databases and LLM processing.

Field Type Default Description
max_characters int 1000 Maximum characters per chunk
overlap int 200 Overlap between consecutive chunks in characters
embedding EmbeddingConfig? None Optional embedding generation for each chunk
preset str? None Chunking preset: "small" (500/100), "medium" (1000/200), "large" (2000/400)
trim bool true Whether to trim whitespace from chunk boundaries
chunker_type ChunkerType Text Type of chunker: Text, Markdown, or Yaml
sizing v4.5.0 ChunkSizing Characters Controls how chunk size is measured. Characters counts characters (default). Tokenizer counts tokens using a HuggingFace tokenizer model. Requires the chunking-tokenizers feature

Note: max_chars and max_overlap are accepted as aliases for max_characters and overlap respectively for backwards compatibility.

When chunker_type is set to "markdown", the chunker populates heading_context on each chunk's metadata with the heading hierarchy (for example, # Title > ## Section) that the chunk falls under. This is useful for preserving semantic context in RAG pipelines.

Example

using Kreuzberg;

class Program { static async Task Main() { var config = new ExtractionConfig { Chunking = new ChunkingConfig { MaxChars = 1000, MaxOverlap = 200, Embedding = new EmbeddingConfig { Model = EmbeddingModelType.Preset("all-minilm-l6-v2"), Normalize = true, BatchSize = 32 } } };

    try
    {
        var result = await KreuzbergClient.ExtractFileAsync(
            "document.pdf",
            config
        ).ConfigureAwait(false);

        Console.WriteLine($"Chunks: {result.Chunks.Count}");
        foreach (var chunk in result.Chunks)
        {
            Console.WriteLine($"Content length: {chunk.Content.Length}");
            if (chunk.Embedding != null)
            {
                Console.WriteLine($"Embedding dimensions: {chunk.Embedding.Length}");
            }
        }
    }
    catch (KreuzbergException ex)
    {
        Console.WriteLine($"Error: {ex.Message}");
    }
}

static async Task PrependHeadingContextExample()
{
    var config = new ExtractionConfig
    {
        Chunking = new ChunkingConfig
        {
            MaxChars = 500,
            MaxOverlap = 50,
            PrependHeadingContext = true
        }
    };

    try
    {
        var result = await KreuzbergClient.ExtractFileAsync(
            "document.md",
            config
        ).ConfigureAwait(false);

        foreach (var chunk in result.Chunks)
        {
            // Each chunk's content is prefixed with its heading breadcrumb
            Console.WriteLine(chunk.Content[..Math.Min(100, chunk.Content.Length)]);
        }
    }
    catch (KreuzbergException ex)
    {
        Console.WriteLine($"Error: {ex.Message}");
    }
}

}

Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    maxChars := 1000
    maxOverlap := 200
    config := &kreuzberg.ExtractionConfig{
        Chunking: &kreuzberg.ChunkingConfig{
            MaxChars:   &maxChars,
            MaxOverlap: &maxOverlap,
        },
    }

    fmt.Printf("Config: MaxChars=%d, MaxOverlap=%d\n", *config.Chunking.MaxChars, *config.Chunking.MaxOverlap)
}
Go - Markdown with Heading Context
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    maxChars := 500
    maxOverlap := 50

    config := &kreuzberg.ExtractionConfig{
        Chunking: &kreuzberg.ChunkingConfig{
            MaxChars:   &maxChars,
            MaxOverlap: &maxOverlap,
            Sizing: &kreuzberg.ChunkSizingConfig{
                Type:  "tokenizer",
                Model: "Xenova/gpt-4o",
            },
        },
    }

    result, err := kreuzberg.ExtractFile("document.md", nil, config)
    if err != nil {
        panic(err)
    }

    for _, chunk := range result.Chunks {
        if chunk.Metadata != nil && chunk.Metadata.HeadingContext != nil {
            for _, heading := range chunk.Metadata.HeadingContext.Headings {
                fmt.Printf("Heading L%d: %s\n", heading.Level, heading.Text)
            }
        }
        fmt.Printf("Content: %.100s...\n", chunk.Content)
    }
}
Go - Prepend Heading Context
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func boolPtr(b bool) *bool { return &b }

func main() {
    maxChars := 500
    maxOverlap := 50

    config := &kreuzberg.ExtractionConfig{
        Chunking: &kreuzberg.ChunkingConfig{
            MaxChars:              &maxChars,
            MaxOverlap:            &maxOverlap,
            PrependHeadingContext: boolPtr(true),
        },
    }

    result, err := kreuzberg.ExtractFile("document.md", nil, config)
    if err != nil {
        panic(err)
    }

    for _, chunk := range result.Chunks {
        // Each chunk's content is prefixed with its heading breadcrumb
        fmt.Printf("Content: %.100s...\n", chunk.Content)
    }
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .chunking(ChunkingConfig.builder()
        .maxChars(1000)
        .maxOverlap(200)
        .build())
    .build();
Java - Markdown with Heading Context
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;
import dev.kreuzberg.HeadingContext;
import dev.kreuzberg.HeadingLevel;

ExtractionConfig config = ExtractionConfig.builder()
    .chunking(ChunkingConfig.builder()
        .chunkerType("markdown")
        .maxChars(500)
        .maxOverlap(50)
        .sizingTokenizer("Xenova/gpt-4o")
        .build())
    .build();

ExtractionResult result = KreuzbergClient.extractFile("document.md", config);

result.getChunks().forEach(chunk -> {
    var headingContext = chunk.getMetadata().getHeadingContext();
    if (headingContext.isPresent()) {
        System.out.println("Headings:");
        headingContext.get().getHeadings().forEach(heading ->
            System.out.println("  Level " + heading.getLevel() + ": " + heading.getText())
        );
    }
});
Java - Prepend Heading Context
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .chunking(ChunkingConfig.builder()
        .chunkerType("markdown")
        .maxChars(500)
        .maxOverlap(50)
        .prependHeadingContext(true)
        .build())
    .build();

ExtractionResult result = KreuzbergClient.extractFile("document.md", config);

result.getChunks().forEach(chunk -> {
    // Each chunk's content is prefixed with its heading breadcrumb
    System.out.println(chunk.getContent().substring(0, Math.min(100, chunk.getContent().length())));
});
Python
import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(
            max_chars=1000,
            max_overlap=200,
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Chunks: {len(result.chunks or [])}")
    for chunk in result.chunks or []:
        print(f"Length: {len(chunk.content)}")

asyncio.run(main())
Python - Markdown with Heading Context
import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(
            chunker_type="markdown",
            max_chars=500,
            max_overlap=50,
            sizing_type="tokenizer",
            sizing_model="Xenova/gpt-4o",
        )
    )
    result = await extract_file("document.md", config=config)
    for chunk in result.chunks or []:
        heading_context = chunk.metadata.get("heading_context")
        if heading_context:
            headings = heading_context.get("headings", [])
            for h in headings:
                print(f"Heading L{h['level']}: {h['text']}")
        print(f"Content: {chunk.content[:100]}...")

asyncio.run(main())
Python - Prepend Heading Context
import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(
            chunker_type="markdown",
            max_chars=500,
            max_overlap=50,
            prepend_heading_context=True,
        )
    )
    result = await extract_file("document.md", config=config)
    for chunk in result.chunks or []:
        # Each chunk's content is prefixed with its heading breadcrumb
        print(f"Content: {chunk.content[:100]}...")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(
    max_characters: 1000,
    overlap: 200
  )
)
Ruby - Markdown with Heading Context
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(
    chunker_type: "markdown",
    max_characters: 500,
    overlap: 50,
    sizing_type: "tokenizer",
    sizing_model: "Xenova/gpt-4o"
  )
)

result = Kreuzberg.extract_file("document.md", config)

result.chunks.each do |chunk|
  if chunk.metadata.heading_context
    puts "Headings:"
    chunk.metadata.heading_context.headings.each do |heading|
      puts "  #{' ' * (heading.level - 1) * 2}Level #{heading.level}: #{heading.text}"
    end
  end
end
Ruby - Prepend Heading Context
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(
    chunker_type: "markdown",
    max_characters: 500,
    overlap: 50,
    prepend_heading_context: true
  )
)

result = Kreuzberg.extract_file("document.md", config)

result.chunks.each do |chunk|
  # Each chunk's content is prefixed with its heading breadcrumb
  puts chunk.content[0, 100]
end
R
library(kreuzberg)

# Example 1: Basic character-based chunking
chunking_cfg <- chunking_config(max_characters = 1000L, overlap = 200L)
config <- extraction_config(chunking = chunking_cfg)

result <- extract_file_sync("document.pdf", "application/pdf", config)
num_chunks <- length(result$chunks)
cat(sprintf("Document split into %d chunks\n", num_chunks))
for (i in seq_len(min(3L, num_chunks))) {
  cat(sprintf("Chunk %d: %d characters\n", i, nchar(result$chunks[[i]])))
}

# Example 2: Markdown chunker with token-based sizing and heading context
chunking_cfg2 <- chunking_config(
  chunker_type = "markdown",
  sizing = list(
    type = "tokenizer",
    model = "Xenova/gpt-4o"
  )
)
config2 <- extraction_config(chunking = chunking_cfg2)

result2 <- extract_file_sync("document.md", "text/markdown", config2)
num_chunks2 <- length(result2$chunks)
cat(sprintf("\nMarkdown document split into %d chunks\n", num_chunks2))

for (i in seq_len(min(3L, num_chunks2))) {
  chunk <- result2$chunks[[i]]
  cat(sprintf("\nChunk %d:\n", i))
  cat(sprintf("  Preview: %s...\n", substr(chunk$text, 1, 60)))

  # Access heading context
  if (!is.null(chunk$metadata$heading_context)) {
    headings <- chunk$metadata$heading_context$headings
    if (length(headings) > 0) {
      cat("  Headings in context:\n")
      for (h in headings) {
        cat(sprintf("    - Level %d: %s\n", h$level, h$text))
      }
    }
  }
}

# Example 3: Prepend heading context to chunk content
chunking_cfg3 <- chunking_config(
  chunker_type = "markdown",
  prepend_heading_context = TRUE
)
config3 <- extraction_config(chunking = chunking_cfg3)

result3 <- extract_file_sync("document.md", "text/markdown", config3)
num_chunks3 <- length(result3$chunks)
cat(sprintf("\nDocument split into %d chunks with prepended headings\n", num_chunks3))

for (i in seq_len(min(3L, num_chunks3))) {
  chunk <- result3$chunks[[i]]
  # Each chunk's content is prefixed with its heading breadcrumb
  cat(sprintf("Chunk %d: %s...\n", i, substr(chunk$content, 1, 80)))
}
Rust
use kreuzberg::{ExtractionConfig, ChunkingConfig};

let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        embedding: None,
    }),
    ..Default::default()
};
Rust - Prepend Heading Context
use kreuzberg::{ExtractionConfig, ChunkingConfig, ChunkerType};

let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_characters: 500,
        overlap: 50,
        chunker_type: ChunkerType::Markdown,
        prepend_heading_context: true,
        ..Default::default()
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    chunking: {
        maxChars: 1000,
        maxOverlap: 200,
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(`Total chunks: ${result.chunks?.length ?? 0}`);
TypeScript - Markdown with Heading Context
import { extractFile } from '@kreuzberg/node';

const config = {
    chunking: {
        chunkerType: 'markdown',
        maxChars: 500,
        maxOverlap: 50,
        sizingType: 'tokenizer',
        sizingModel: 'Xenova/gpt-4o',
    },
};

const result = await extractFile('document.md', null, config);
for (const chunk of result.chunks ?? []) {
    const headings = chunk.metadata?.headingContext?.headings ?? [];
    for (const heading of headings) {
        console.log(`Heading L${heading.level}: ${heading.text}`);
    }
    console.log(`Content: ${chunk.content.slice(0, 100)}...`);
}
TypeScript - Prepend Heading Context
import { extractFile } from '@kreuzberg/node';

const config = {
    chunking: {
        chunkerType: 'markdown',
        maxChars: 500,
        maxOverlap: 50,
        prependHeadingContext: true,
    },
};

const result = await extractFile('document.md', null, config);
for (const chunk of result.chunks ?? []) {
    // Each chunk's content is prefixed with its heading breadcrumb
    console.log(`Content: ${chunk.content.slice(0, 100)}...`);
}

EmbeddingConfig

Configuration for generating vector embeddings for text chunks. Enables semantic search and similarity matching by converting text into high-dimensional vector representations.

Overview

EmbeddingConfig is used to control embedding generation when chunking documents. It allows you to choose from pre-optimized models or specify custom models from HuggingFace. Embeddings can be generated for each chunk to enable vector database integration and semantic search capabilities.

Fields

Field Type Default Description
model EmbeddingModelType Preset { name: "balanced" } Embedding model selection (preset or custom)
batch_size usize 32 Number of texts to process in each batch (higher = faster but more memory)
normalize bool true Normalize embedding vectors to unit length (recommended for cosine similarity)
show_download_progress bool false Show progress when downloading model files
cache_dir String? ~/.cache/kreuzberg/embeddings/ Custom cache directory for downloaded models

Model Types

Preset models are pre-optimized configurations for common use cases. They automatically download and cache the necessary model files.

Preset Model Dims Speed Quality Use Case
fast AllMiniLML6V2Q 384 Very Fast Good Development, prototyping, resource-constrained environments
balanced BGEBaseENV15 768 Fast Excellent Default: General-purpose RAG, production deployments, English documents
quality BGELargeENV15 1024 Moderate Outstanding Complex documents, maximum accuracy, sufficient compute resources
multilingual MultilingualE5Base 768 Fast Excellent International documents, 100+ languages, mixed-language content

Preset models require the embeddings feature to be enabled in Kreuzberg.

Model Characteristics:

  • Fast: ~22M parameters, 384-dimensional vectors. Best for quick prototyping and development where speed is prioritized over quality.
  • Balanced: ~109M parameters, 768-dimensional vectors. Excellent general-purpose model with strong semantic understanding for most use cases.
  • Quality: ~335M parameters, 1024-dimensional vectors. Large model for maximum semantic accuracy when compute resources are available.
  • Multilingual: ~109M parameters, 768-dimensional vectors. Trained on multilingual data, effective for 100+ languages including rare languages.

FastEmbed Models

FastEmbed is a library for fast embedding generation. You can specify any supported FastEmbed model by name.

Common FastEmbed models:

  • AllMiniLML6V2Q - 384 dims, fast, quantized (same as fast preset)
  • BGEBaseENV15 - 768 dims, balanced (same as balanced preset)
  • BGELargeENV15 - 1024 dims, high quality (same as quality preset)
  • MultilingualE5Base - 768 dims, multilingual (same as multilingual preset)

Requires the embeddings feature and explicit dimensions specification.

Custom Models

Custom ONNX models from HuggingFace can be specified for specialized use cases. Provide the HuggingFace model ID and vector dimensions.

Note: Custom model support for full embedding generation is planned for future releases. Currently, custom models can be loaded and used via the Rust API.

LLM Provider-Hosted Embeddings

Instead of running local ONNX models, you can delegate embedding generation to a cloud provider's embedding API via liter-llm. This is useful when you want to use the same embedding model as your vector database provider or when local model hosting is impractical.

llm_embedding.rs
use kreuzberg::core::{EmbeddingConfig, EmbeddingModelType, LlmConfig};

let config = EmbeddingConfig {
    model: EmbeddingModelType::Llm {
        llm: LlmConfig {
            model: "openai/text-embedding-3-small".to_string(),
            api_key: None, // Falls back to OPENAI_API_KEY env var
            base_url: None,
        },
    },
    batch_size: 32,
    ..Default::default()
};
kreuzberg.toml
[chunking.embedding]
model = { type = "llm", model = "openai/text-embedding-3-small" }
batch_size = 32

Note: When api_key is not set in LlmConfig, liter-llm falls back to provider-standard environment variables (for example, OPENAI_API_KEY, ANTHROPIC_API_KEY). Requires the llm feature.

Cache Directory

Model files are cached locally to avoid re-downloading on subsequent runs.

Default cache location:

~/.cache/kreuzberg/embeddings/

Features:

  • Tilde (~) expansion: Home directory automatically resolved
  • Automatic creation: Cache directory created if it doesn't exist
  • Persistent across runs: Models cached indefinitely until manually removed
  • Multi-process safe: Thread-safe concurrent access

Custom cache directory:

Custom Embedding Cache Directory
[chunking.embedding]
model = { type = "preset", name = "balanced" }
cache_dir = "/custom/cache/path"

Performance Considerations

Batch Size Tuning

  • Default: 32 texts per batch
  • Small values (8-16): Lower memory usage, slower processing
  • Large values (64-128): Faster processing, higher memory usage
  • Adjust based on available GPU/CPU memory and document sizes

Normalization

  • Enabled (default): Vectors normalized to unit length, suitable for cosine similarity
  • Disabled: Raw vectors suitable for other distance metrics (Euclidean, dot product)

Model Size Trade-offs

Model Size Speed Quality Memory Network
Fast 20 MB Fastest Good 200 MB 100 MB
Balanced 250 MB Fast Excellent 500 MB 250 MB
Quality 800 MB Moderate Outstanding 1.5 GB 800 MB
Multilingual 250 MB Fast Excellent 500 MB 250 MB

Configuration Examples

embedding_basic.rs
use kreuzberg::core::{ExtractionConfig, ChunkingConfig, EmbeddingConfig, EmbeddingModelType};

// Basic embedding with default balanced preset
let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        embedding: Some(EmbeddingConfig::default()),
        preset: None,
    }),
    ..Default::default()
};
embedding_preset.rs
use kreuzberg::core::{EmbeddingConfig, EmbeddingModelType};

// Use fast preset for quick processing
let config = EmbeddingConfig {
    model: EmbeddingModelType::Preset {
        name: "fast".to_string(),
    },
    normalize: true,
    batch_size: 16,
    show_download_progress: true,
    cache_dir: None,
};

// Use quality preset for best accuracy
let config = EmbeddingConfig {
    model: EmbeddingModelType::Preset {
        name: "quality".to_string(),
    },
    batch_size: 32,
    ..Default::default()
};

// Use multilingual for international content
let config = EmbeddingConfig {
    model: EmbeddingModelType::Preset {
        name: "multilingual".to_string(),
    },
    ..Default::default()
};
embedding_custom_onnx.rs
use kreuzberg::core::{EmbeddingConfig, EmbeddingModelType};

// Explicit ONNX model specification
let config = EmbeddingConfig {
    model: EmbeddingModelType::FastEmbed {
        model: "BGEBaseENV15".to_string(),
        dimensions: 768,
    },
    batch_size: 32,
    ..Default::default()
};
embedding_cache.rs
use kreuzberg::core::{EmbeddingConfig, EmbeddingModelType};
use std::path::PathBuf;

let config = EmbeddingConfig {
    model: EmbeddingModelType::Preset {
        name: "balanced".to_string(),
    },
    cache_dir: Some(PathBuf::from("/custom/models/cache")),
    show_download_progress: true,
    ..Default::default()
};

Configuration File Examples

TOML Format

kreuzberg.toml
[chunking]
max_characters = 1000
overlap = 200

# Use balanced preset (default)
[chunking.embedding]
model = { type = "preset", name = "balanced" }
batch_size = 32
normalize = true

# Or use fast preset
# [chunking.embedding]
# model = { type = "preset", name = "fast" }
# batch_size = 16

# Or use custom cache directory
# [chunking.embedding]
# model = { type = "preset", name = "quality" }
# cache_dir = "/data/models"
# show_download_progress = true

Token-Based Sizing (TOML)

kreuzberg.toml
[chunking]
max_chars = 512
max_overlap = 50

[chunking.sizing]
type = "tokenizer"
model = "Xenova/gpt-4o"

Note

Token-based sizing requires the chunking-tokenizers feature to be enabled.

YAML Format

kreuzberg.yaml
chunking:
  max_characters: 1000
  overlap: 200
  embedding:
    model:
      type: preset
      name: balanced
    batch_size: 32
    normalize: true

JSON Format

kreuzberg.json
{
  "chunking": {
    "max_characters": 1000,
    "overlap": 200,
    "embedding": {
      "model": {
        "type": "preset",
        "name": "balanced"
      },
      "batch_size": 32,
      "normalize": true
    }
  }
}

LlmConfig

Configuration for LLM provider connections used by structured extraction, VLM-based OCR, and provider-hosted embeddings. Uses liter-llm for provider-agnostic model access.

Fields

Field Type Default Description
model String — Model identifier in provider/model-name format (for example, "openai/gpt-4o-mini", "anthropic/claude-sonnet-4-20250514")
api_key String? None API key for the provider. When None, falls back to provider-standard env vars (for example, OPENAI_API_KEY, ANTHROPIC_API_KEY)
base_url String? None Custom base URL for the provider API. When None, uses the provider's default endpoint. Useful for proxies or self-hosted API-compatible servers

Configuration Examples

llm_config.rs
use kreuzberg::core::LlmConfig;

// Minimal config (uses provider env var for API key)
let config = LlmConfig {
    model: "openai/gpt-4o-mini".to_string(),
    api_key: None,
    base_url: None,
};

// Explicit API key and custom endpoint
let config = LlmConfig {
    model: "openai/gpt-4o".to_string(),
    api_key: Some("sk-...".to_string()),
    base_url: Some("https://api.example.com".to_string()),
};
llm_config.py
config = {
    "model": "openai/gpt-4o-mini",
    "api_key": None,       # Falls back to OPENAI_API_KEY
    "base_url": None,
}
llm_config.ts
const config: LlmConfig = {
  model: "openai/gpt-4o-mini",
  apiKey: undefined,     // Falls back to OPENAI_API_KEY
  baseUrl: undefined,
};
llm_config.go
config := kreuzberg.LlmConfig{
    Model:   "openai/gpt-4o-mini",
    ApiKey:  nil,  // Falls back to OPENAI_API_KEY
    BaseUrl: nil,
}

Configuration File Examples

kreuzberg.toml
[llm]
model = "openai/gpt-4o-mini"
# api_key = "sk-..."       # Optional: falls back to OPENAI_API_KEY
# base_url = "https://..."  # Optional: uses provider default
kreuzberg.yaml
llm:
  model: openai/gpt-4o-mini
  # api_key: sk-...
  # base_url: https://...

StructuredExtractionConfig

Configuration for LLM-powered structured data extraction. Enables extracting structured data from documents by providing a JSON schema that defines the expected output format. The LLM processes the document content and returns data conforming to the schema.

Fields

Field Type Default Description
llm LlmConfig — LLM provider configuration for the structured extraction model
schema JsonValue — JSON Schema defining the expected output structure. Must be a valid JSON Schema object.
prompt String? None Custom system prompt for structured extraction. Overrides the default prompt. Useful for domain-specific instructions.
max_tokens usize? None Maximum tokens for LLM response. When None, uses the provider's default limit.
temperature f64? None Sampling temperature (0.0-2.0). Lower values produce more deterministic output. When None, defaults to 0.0 for maximum consistency.

Configuration Examples

structured_extraction.rs
use kreuzberg::core::{ExtractionConfig, StructuredExtractionConfig, LlmConfig};
use serde_json::json;

let config = ExtractionConfig {
    structured_extraction: Some(StructuredExtractionConfig {
        llm: LlmConfig {
            model: "openai/gpt-4o-mini".to_string(),
            api_key: None,
            base_url: None,
        },
        schema: json!({
            "type": "object",
            "properties": {
                "invoice_number": { "type": "string" },
                "total_amount": { "type": "number" },
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": { "type": "string" },
                            "amount": { "type": "number" }
                        }
                    }
                }
            },
            "required": ["invoice_number", "total_amount"]
        }),
        prompt: None,
        max_tokens: None,
        temperature: Some(0.0),
    }),
    ..Default::default()
};
structured_extraction.py
config = {
    "structured_extraction": {
        "llm": {
            "model": "openai/gpt-4o-mini",
        },
        "schema": {
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string"},
                "total_amount": {"type": "number"},
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "amount": {"type": "number"},
                        },
                    },
                },
            },
            "required": ["invoice_number", "total_amount"],
        },
        "temperature": 0.0,
    },
}
structured_extraction.ts
const config: ExtractionConfig = {
  structuredExtraction: {
    llm: {
      model: "openai/gpt-4o-mini",
    },
    schema: {
      type: "object",
      properties: {
        invoice_number: { type: "string" },
        total_amount: { type: "number" },
        line_items: {
          type: "array",
          items: {
            type: "object",
            properties: {
              description: { type: "string" },
              amount: { type: "number" },
            },
          },
        },
      },
      required: ["invoice_number", "total_amount"],
    },
    temperature: 0.0,
  },
};

Configuration File Examples

kreuzberg.toml
[structured_extraction]
prompt = "Extract invoice data from the document."
max_tokens = 4096
temperature = 0.0

[structured_extraction.llm]
model = "openai/gpt-4o-mini"

[structured_extraction.schema]
type = "object"

[structured_extraction.schema.properties.invoice_number]
type = "string"

[structured_extraction.schema.properties.total_amount]
type = "number"
kreuzberg.yaml
structured_extraction:
  llm:
    model: openai/gpt-4o-mini
  schema:
    type: object
    properties:
      invoice_number:
        type: string
      total_amount:
        type: number
    required:
      - invoice_number
      - total_amount
  temperature: 0.0

EmailConfig

Configuration for .msg (Outlook/MAPI) and .eml email file extraction. Controls how legacy Windows codepage encodings are handled when reading email headers and bodies that lack explicit character set declarations.

Overview

Many older email messages — particularly those created by Microsoft Outlook on Windows — encode text using a Windows code page rather than UTF-8. When no charset is declared in the message headers, Kreuzberg defaults to Windows-1252 (Western European). Use msg_fallback_codepage to override this default for mailboxes that predominantly contain messages in a different encoding.

Fields

Field Type Default Description
msg_fallback_codepage int? None (Windows-1252) Windows code page number used when no charset is declared in the message. None = use 1252.

Common Codepage Values

Code Page Encoding Region / Language
1250 Windows Central European Polish, Czech, Hungarian, and so on.
1251 Windows Cyrillic Russian, Ukrainian, Bulgarian
1252 Windows Western European English, German, French (default)
1253 Windows Greek Greek
1254 Windows Turkish Turkish
1255 Windows Hebrew Hebrew
1256 Windows Arabic Arabic
932 Shift-JIS Japanese
936 GBK (Simplified Chinese) Simplified Chinese

Configuration Examples

email_config.py
from kreuzberg import ExtractionConfig, PdfConfig
from kreuzberg.email import EmailConfig

# Extract a Russian Outlook .msg file with Cyrillic encoding
config = ExtractionConfig(
    pdf_options=PdfConfig(
        email=EmailConfig(msg_fallback_codepage=1251)
    )
)
email_config.ts
import { extract } from "kreuzberg";

// Extract a Japanese .msg file encoded in Shift-JIS
const result = await extract("message.msg", {
  pdfOptions: {
    email: { msgFallbackCodepage: 932 },
  },
});
email_config.rs
use kreuzberg::core::{ExtractionConfig, PdfConfig, EmailConfig};

// Extract a Central European .msg file
let config = ExtractionConfig {
    pdf_options: Some(PdfConfig {
        email: Some(EmailConfig {
            msg_fallback_codepage: Some(1250),
        }),
        ..Default::default()
    }),
    ..Default::default()
};

LanguageDetectionConfig

Configuration for automatic language detection in extracted text.

Field Type Default Description
enabled bool true Enable language detection
min_confidence float 0.8 Minimum confidence threshold (0.0-1.0) for reporting detected languages
detect_multiple bool false Detect multiple languages (vs. dominant language only)

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    LanguageDetection = new LanguageDetectionConfig
    {
        Enabled = true,
        MinConfidence = 0.9m,
        DetectMultiple = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Languages: {string.Join(", ", result.DetectedLanguages ?? new List<string>())}");
Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    minConfidence := 0.8
    config := &kreuzberg.ExtractionConfig{
        LanguageDetection: &kreuzberg.LanguageDetectionConfig{
            Enabled:        true,
            MinConfidence:  &minConfidence,
            DetectMultiple: false,
        },
    }

    fmt.Printf("Language detection enabled: %v\n", config.LanguageDetection.Enabled)
    fmt.Printf("Min confidence: %f\n", *config.LanguageDetection.MinConfidence)
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.LanguageDetectionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .languageDetection(LanguageDetectionConfig.builder()
        .enabled(true)
        .minConfidence(0.8)
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, LanguageDetectionConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        language_detection=LanguageDetectionConfig(
            enabled=True,
            min_confidence=0.85,
            detect_multiple=False
        )
    )
    result = await extract_file("document.pdf", config=config)
    if result.detected_languages:
        print(f"Primary language: {result.detected_languages[0]}")
    print(f"Content length: {len(result.content)} chars")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  language_detection: Kreuzberg::Config::LanguageDetection.new(
    enabled: true,
    min_confidence: 0.8,
    detect_multiple: false
  )
)
R
library(kreuzberg)

config <- extraction_config(
  language_detection = list(enabled = TRUE)
)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("Detected language: %s\n", result$detected_language))
cat(sprintf("Content preview: %.60s...\n", result$content))
Rust
use kreuzberg::{ExtractionConfig, LanguageDetectionConfig};

let config = ExtractionConfig {
    language_detection: Some(LanguageDetectionConfig {
        enabled: true,
        min_confidence: 0.8,
        detect_multiple: false,
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    languageDetection: {
        enabled: true,
        minConfidence: 0.8,
        detectMultiple: false,
    },
};

const result = await extractFile('document.pdf', null, config);
if (result.detectedLanguages) {
    console.log(`Detected languages: ${result.detectedLanguages.join(', ')}`);
}

KeywordConfig

Configuration for automatic keyword extraction from document text using YAKE or RAKE algorithms.

Feature Gate: Requires either keywords-yake or keywords-rake Cargo feature. Keyword extraction is only available when at least one of these features is enabled.

Overview

Keyword extraction automatically identifies important terms and phrases in extracted text without manual labeling. Two algorithms are available:

  • YAKE: Statistical approach based on term frequency and co-occurrence analysis
  • RAKE: Rapid Automatic Keyword Extraction using word co-occurrence and frequency

Both algorithms analyze text independently and require no external training data, making them suitable for documents in any domain.

Configuration Fields

Field Type Default Description
algorithm KeywordAlgorithm Yake (if available) Algorithm to use: yake or rake
max_keywords usize 10 Maximum number of keywords to extract
min_score f32 0.0 Minimum score threshold (0.0-1.0) for keyword filtering
ngram_range (usize, usize) (1, 3) N-gram range: (min, max) words per keyword phrase
language Option<String> Some("en") Language code for stopword filtering (for example, "en", "de", "fr"), None disables filtering
yake_params Option<YakeParams> None YAKE-specific tuning parameters
rake_params Option<RakeParams> None RAKE-specific tuning parameters

Algorithm Comparison

YAKE (Yet Another Keyword Extractor)

Approach: Statistical scoring based on term statistics and co-occurrence patterns.

Aspect Details
Best For General-purpose documents, balanced keyword distribution
Strengths No training required, handles rare terms well, language-independent
Limitations May extract very common terms, single-word focus
Score Range 0.0-1.0 (lower scores = more relevant)
Tuning window_size (default: 2) - context window for co-occurrence
Use Cases Research papers, news articles, general text

Characteristic: YAKE assigns lower scores to more relevant keywords, so use higher min_score to be more selective.

RAKE (Rapid Automatic Keyword Extraction)

Approach: Co-occurrence graph analysis separating keywords by frequent stop words.

Aspect Details
Best For Multi-word phrases, domain-specific terminology
Strengths Excellent for extracting multi-word phrases, fast, domain-aware
Limitations Requires good stopword list, less effective with poorly structured text
Score Range 0.0+ (higher scores = more relevant, unbounded)
Tuning min_word_length, max_words_per_phrase
Use Cases Technical documentation, scientific papers, product descriptions

Characteristic: RAKE assigns higher scores to more relevant keywords, so use lower min_score thresholds.

N-gram Range Explanation

The ngram_range parameter controls the size of keyword phrases:

ngram_range: (1, 1)  → Single words only: "python", "machine", "learning"
ngram_range: (1, 2)  → 1-2 word phrases: "python", "machine learning", "deep learning"
ngram_range: (1, 3)  → 1-3 word phrases: "python", "machine learning", "deep neural networks"
ngram_range: (2, 3)  → 2-3 word phrases only: "machine learning", "neural networks"

Recommendations:

  • Use (1, 1) for single-word indexing (tagging, classification)
  • Use (1, 2) for balanced coverage of terms and phrases
  • Use (1, 3) for comprehensive phrase extraction (default)
  • Use (2, 3) if you only want multi-word phrases

Keyword Output Format

Keywords are returned as a list of Keyword structures in the extraction result:

Keyword Output Structure
{
  "text": "machine learning",
  "score": 0.85,
  "algorithm": "yake",
  "positions": [42, 156, 203]
}

Fields:

  • text: The keyword or phrase text
  • score: Relevance score (algorithm-specific range and meaning)
  • algorithm: Which algorithm extracted this keyword
  • positions: Optional character offsets where the keyword appears in text

Example: YAKE Configuration

using Kreuzberg;

var config = new ExtractionConfig
{
    Keywords = new KeywordConfig
    {
        Algorithm = KeywordAlgorithm.Yake,
        MaxKeywords = 10,
        MinScore = 0.3,
        NgramRange = (1, 3),
        Language = "en"
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
config := &ExtractionConfig{
    Keywords: &KeywordConfig{
        Algorithm:   KeywordAlgorithm.Yake,
        MaxKeywords: 10,
        MinScore:    0.3,
        NgramRange:  [2]uint32{1, 3},
        Language:    "en",
    },
}
var config = ExtractionConfig.builder()
    .keywords(KeywordConfig.builder()
        .algorithm(KeywordAlgorithm.YAKE)
        .maxKeywords(10)
        .minScore(0.3f)
        .ngramRange(1, 3)
        .language("en")
        .build())
    .build();
from kreuzberg import ExtractionConfig, KeywordConfig, KeywordAlgorithm

config = ExtractionConfig(
    keywords=KeywordConfig(
        algorithm=KeywordAlgorithm.YAKE,
        max_keywords=10,
        min_score=0.3,
        ngram_range=(1, 3),
        language="en"
    )
)
require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  keywords: Kreuzberg::KeywordConfig.new(
    algorithm: :yake,
    max_keywords: 10,
    min_score: 0.3,
    ngram_range: [1, 3],
    language: "en"
  )
)
Rust
use kreuzberg::{ExtractionConfig, KeywordConfig, KeywordAlgorithm};

let config = ExtractionConfig {
    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Yake,
        max_keywords: 10,
        min_score: 0.3,
        ngram_range: (1, 3),
        language: Some("en".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};
import { ExtractionConfig, KeywordConfig, KeywordAlgorithm } from 'kreuzberg';

const config: ExtractionConfig = {
  keywords: {
    algorithm: KeywordAlgorithm.Yake,
    maxKeywords: 10,
    minScore: 0.3,
    ngramRange: [1, 3],
    language: "en"
  }
};

Example: RAKE Configuration with Multi-word Phrases

from kreuzberg import ExtractionConfig, KeywordConfig, KeywordAlgorithm, RakeParams

config = ExtractionConfig(
    keywords=KeywordConfig(
        algorithm=KeywordAlgorithm.RAKE,
        max_keywords=15,
        min_score=0.1,
        ngram_range=(1, 4),
        language="en",
        rake_params=RakeParams(
            min_word_length=2,
            max_words_per_phrase=4
        )
    )
)
use kreuzberg::{ExtractionConfig, KeywordConfig, KeywordAlgorithm, RakeParams};

let config = ExtractionConfig {
    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Rake,
        max_keywords: 15,
        min_score: 0.1,
        ngram_range: (1, 4),
        language: Some("en".to_string()),
        rake_params: Some(RakeParams {
            min_word_length: 2,
            max_words_per_phrase: 4,
        }),
        ..Default::default()
    }),
    ..Default::default()
};

Language Support

Stopword filtering is applied when a language is specified. Common supported languages:

  • en - English
  • es - Spanish
  • fr - French
  • de - German
  • pt - Portuguese
  • it - Italian
  • ru - Russian
  • ja - Japanese
  • zh - Chinese
  • ar - Arabic

Set language: None to disable stopword filtering and extract keywords in any language without filtering.


KeywordConfig

Configuration for automatic keyword extraction from document text using YAKE or RAKE algorithms.

Feature Gate: Requires either keywords-yake or keywords-rake Cargo feature. Keyword extraction is only available when at least one of these features is enabled.

Overview

Keyword extraction automatically identifies important terms and phrases in extracted text without manual labeling. Two algorithms are available:

  • YAKE: Statistical approach based on term frequency and co-occurrence analysis
  • RAKE: Rapid Automatic Keyword Extraction using word co-occurrence and frequency

Both algorithms analyze text independently and require no external training data, making them suitable for documents in any domain.

Configuration Fields

Field Type Default Description
algorithm KeywordAlgorithm Yake (if available) Algorithm to use: yake or rake
max_keywords usize 10 Maximum number of keywords to extract
min_score f32 0.0 Minimum score threshold (0.0-1.0) for keyword filtering
ngram_range (usize, usize) (1, 3) N-gram range: (min, max) words per keyword phrase
language Option<String> Some("en") Language code for stopword filtering (for example, "en", "de", "fr"), None disables filtering
yake_params Option<YakeParams> None YAKE-specific tuning parameters
rake_params Option<RakeParams> None RAKE-specific tuning parameters

Algorithm Comparison

YAKE (Yet Another Keyword Extractor)

Approach: Statistical scoring based on term statistics and co-occurrence patterns.

Aspect Details
Best For General-purpose documents, balanced keyword distribution
Strengths No training required, handles rare terms well, language-independent
Limitations May extract very common terms, single-word focus
Score Range 0.0-1.0 (lower scores = more relevant)
Tuning window_size (default: 2) - context window for co-occurrence
Use Cases Research papers, news articles, general text

Characteristic: YAKE assigns lower scores to more relevant keywords, so use higher min_score to be more selective.

RAKE (Rapid Automatic Keyword Extraction)

Approach: Co-occurrence graph analysis separating keywords by frequent stop words.

Aspect Details
Best For Multi-word phrases, domain-specific terminology
Strengths Excellent for extracting multi-word phrases, fast, domain-aware
Limitations Requires good stopword list, less effective with poorly structured text
Score Range 0.0+ (higher scores = more relevant, unbounded)
Tuning min_word_length, max_words_per_phrase
Use Cases Technical documentation, scientific papers, product descriptions

Characteristic: RAKE assigns higher scores to more relevant keywords, so use lower min_score thresholds.

N-gram Range Explanation

The ngram_range parameter controls the size of keyword phrases:

ngram_range: (1, 1)  → Single words only: "python", "machine", "learning"
ngram_range: (1, 2)  → 1-2 word phrases: "python", "machine learning", "deep learning"
ngram_range: (1, 3)  → 1-3 word phrases: "python", "machine learning", "deep neural networks"
ngram_range: (2, 3)  → 2-3 word phrases only: "machine learning", "neural networks"

Recommendations:

  • Use (1, 1) for single-word indexing (tagging, classification)
  • Use (1, 2) for balanced coverage of terms and phrases
  • Use (1, 3) for comprehensive phrase extraction (default)
  • Use (2, 3) if you only want multi-word phrases

Keyword Output Format

Keywords are returned as a list of Keyword structures in the extraction result:

Keyword Output Structure
{
  "text": "machine learning",
  "score": 0.85,
  "algorithm": "yake",
  "positions": [42, 156, 203]
}

Fields:

  • text: The keyword or phrase text
  • score: Relevance score (algorithm-specific range and meaning)
  • algorithm: Which algorithm extracted this keyword
  • positions: Optional character offsets where the keyword appears in text

Example: YAKE Configuration

using Kreuzberg;

var config = new ExtractionConfig
{
    Keywords = new KeywordConfig
    {
        Algorithm = KeywordAlgorithm.Yake,
        MaxKeywords = 10,
        MinScore = 0.3,
        NgramRange = (1, 3),
        Language = "en"
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
config := &ExtractionConfig{
    Keywords: &KeywordConfig{
        Algorithm:   KeywordAlgorithm.Yake,
        MaxKeywords: 10,
        MinScore:    0.3,
        NgramRange:  [2]uint32{1, 3},
        Language:    "en",
    },
}
var config = ExtractionConfig.builder()
    .keywords(KeywordConfig.builder()
        .algorithm(KeywordAlgorithm.YAKE)
        .maxKeywords(10)
        .minScore(0.3f)
        .ngramRange(1, 3)
        .language("en")
        .build())
    .build();
from kreuzberg import ExtractionConfig, KeywordConfig, KeywordAlgorithm

config = ExtractionConfig(
    keywords=KeywordConfig(
        algorithm=KeywordAlgorithm.YAKE,
        max_keywords=10,
        min_score=0.3,
        ngram_range=(1, 3),
        language="en"
    )
)
require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  keywords: Kreuzberg::KeywordConfig.new(
    algorithm: :yake,
    max_keywords: 10,
    min_score: 0.3,
    ngram_range: [1, 3],
    language: "en"
  )
)
Rust
use kreuzberg::{ExtractionConfig, KeywordConfig, KeywordAlgorithm};

let config = ExtractionConfig {
    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Yake,
        max_keywords: 10,
        min_score: 0.3,
        ngram_range: (1, 3),
        language: Some("en".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};
import { ExtractionConfig, KeywordConfig, KeywordAlgorithm } from 'kreuzberg';

const config: ExtractionConfig = {
  keywords: {
    algorithm: KeywordAlgorithm.Yake,
    maxKeywords: 10,
    minScore: 0.3,
    ngramRange: [1, 3],
    language: "en"
  }
};

Example: RAKE Configuration with Multi-word Phrases

from kreuzberg import ExtractionConfig, KeywordConfig, KeywordAlgorithm, RakeParams

config = ExtractionConfig(
    keywords=KeywordConfig(
        algorithm=KeywordAlgorithm.RAKE,
        max_keywords=15,
        min_score=0.1,
        ngram_range=(1, 4),
        language="en",
        rake_params=RakeParams(
            min_word_length=2,
            max_words_per_phrase=4
        )
    )
)
use kreuzberg::{ExtractionConfig, KeywordConfig, KeywordAlgorithm, RakeParams};

let config = ExtractionConfig {
    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Rake,
        max_keywords: 15,
        min_score: 0.1,
        ngram_range: (1, 4),
        language: Some("en".to_string()),
        rake_params: Some(RakeParams {
            min_word_length: 2,
            max_words_per_phrase: 4,
        }),
        ..Default::default()
    }),
    ..Default::default()
};

Language Support

Stopword filtering is applied when a language is specified. Common supported languages:

  • en - English
  • es - Spanish
  • fr - French
  • de - German
  • pt - Portuguese
  • it - Italian
  • ru - Russian
  • ja - Japanese
  • zh - Chinese
  • ar - Arabic

Set language: None to disable stopword filtering and extract keywords in any language without filtering.


PdfConfig

PDF-specific extraction configuration.

Field Type Default Description
extract_images bool false Extract embedded images from PDF pages
extract_metadata bool true Extract PDF metadata (title, author, creation date, etc.)
passwords list[str]? None List of passwords to try for encrypted PDFs (tries in order)
hierarchy HierarchyConfig? None Hierarchy extraction configuration (None = hierarchy extraction disabled)
allow_single_column_tables v4.5.0 bool false Relax min column count from 2-3 to 1, allowing single-column table extraction

Bounding boxes require explicit opt-in

Element bounding box coordinates are not extracted by default. To enable them, set pdf_options=PdfConfig(hierarchy=HierarchyConfig(enabled=True, include_bbox=True)). Coordinates are currently only available for text elements (headings and body blocks) — table and image regions do not carry per-element bbox data from this path.

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        ExtractImages = true,
        ExtractMetadata = true,
        Passwords = new List<string> { "password1", "password2" },
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 6,
            IncludeBbox = true,
            OcrCoverageThreshold = 0.5f
        }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    pw := []string{"password1", "password2"}
    result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
        PdfOptions: &kreuzberg.PdfConfig{
            ExtractImages:   kreuzberg.BoolPtr(true),
            ExtractMetadata: kreuzberg.BoolPtr(true),
            Passwords:       pw,
            Hierarchy:       &kreuzberg.HierarchyConfig{},
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.PdfConfig;
import dev.kreuzberg.config.HierarchyConfig;
import java.util.Arrays;

ExtractionConfig config = ExtractionConfig.builder()
    .pdfOptions(PdfConfig.builder()
        .extractImages(true)
        .extractMetadata(true)
        .passwords(Arrays.asList("password1", "password2"))
        .hierarchyConfig(HierarchyConfig.builder().build())
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, PdfConfig, HierarchyConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        pdf_options=PdfConfig(
            extract_images=True,
            extract_metadata=True,
            passwords=["password1", "password2"],
            hierarchy=HierarchyConfig(enabled=True, k_clusters=6)
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  pdf_options: Kreuzberg::Config::PDF.new(
    extract_images: true,
    extract_metadata: true,
    passwords: ['password1', 'password2'],
    hierarchy: Kreuzberg::Config::Hierarchy.new(
      enabled: true,
      k_clusters: 6,
      include_bbox: true
    )
  )
)
R
library(kreuzberg)

config <- extraction_config(
  pdf_options = list(extract_tables = TRUE)
)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("Tables extracted: %d\n", length(result$tables)))
cat(sprintf("Total elements: %d\n", length(result$elements)))
cat(sprintf("Content preview: %.50s...\n", result$content))
Rust
use kreuzberg::{ExtractionConfig, PdfConfig};

fn main() {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            extract_images: Some(true),
            extract_metadata: Some(true),
            passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
        }),
        ..Default::default()
    };
    println!("{:?}", config.pdf_options);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    pdfOptions: {
        extractImages: true,
        extractMetadata: true,
        passwords: ['password1', 'password2'],
        hierarchy: { enabled: true, kClusters: 6, includeBbox: true }
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

HierarchyConfig

PDF document hierarchy extraction configuration for semantic text structure analysis.

Overview

HierarchyConfig enables automatic extraction of document hierarchy levels (H1-H6) from PDF text by analyzing font size patterns. This is particularly useful for:

  • Building semantic document representations for RAG (Retrieval Augmented Generation) systems
  • Automatic table of contents extraction
  • Document structure understanding and analysis
  • Content organization and outlining

The hierarchy detection works by:

  1. Extracting text blocks with font size metadata from the PDF
  2. Performing K-means clustering on font sizes to identify distinct size groups
  3. Mapping clusters to heading levels (h1-h6) and body text
  4. Merging adjacent blocks with the same hierarchy level
  5. Optionally including bounding box information for spatial awareness

Fields

Field Type Default Description
enabled bool true Enable hierarchy extraction
k_clusters usize 6 Number of font size clusters (1-7). Default 6 provides H1-H6 with body text
include_bbox bool true Include bounding box coordinates in output
ocr_coverage_threshold Option<f32> None Smart OCR triggering threshold (0.0-1.0). Triggers OCR if text blocks cover less than this fraction of page

How It Works

Font Size Extraction

Text blocks are extracted from PDFs with their precise font sizes. This metadata is preserved for analysis.

K-means Clustering

The font sizes are clustered using K-means algorithm with the specified number of clusters. Each cluster represents a distinct text hierarchy level, from largest fonts (headings) to smallest (body text).

Cluster-to-Level Mapping:

  • For k_clusters=6 (recommended): Creates 6 clusters → h1 (largest), h2, h3, h4, h5, body (smallest)
  • For k_clusters=3: Fast mode with just h1, h3, body (minimal detail)
  • For k_clusters=7: Maximum detail separating h1-h6 with distinct body text

Block Merging

Adjacent blocks with the same hierarchy level are merged to create logical content units. This merge process considers:

  • Spatial proximity (vertical and horizontal distance)
  • Bounding box overlap ratio
  • Text flow direction

Output Structure

Each extracted block contains:

  • Text content
  • Font size (in points)
  • Hierarchy level (h1-h6 or body)
  • Optional bounding box (left, top, right, bottom in PDF units)

Use Cases

Semantic Document Understanding

Extract hierarchical structure for understanding document semantics and building knowledge graphs:

H1: Document Title
  H2: Section 1
    H3: Subsection 1.1
      Body text...
    H3: Subsection 1.2
      Body text...
  H2: Section 2
    H3: Subsection 2.1

Automatic Table of Contents Generation

Build dynamic table of contents from extracted hierarchy levels (h1-h3) for document navigation.

RAG System Optimization

Use hierarchy information to improve context retrieval by chunking at appropriate heading boundaries rather than arbitrary character counts. This preserves semantic relationships.

Document Analysis

Extract and analyze document structure programmatically for compliance checking, content validation, or metadata extraction.

Configuration Examples

Basic Hierarchy Extraction

basic_hierarchy.cs
using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

// Access hierarchy from pages
if (result.Pages != null)
{
    foreach (var page in result.Pages)
    {
        if (page.Hierarchy != null)
        {
            Console.WriteLine($"Page {page.PageNumber}: {page.Hierarchy.BlockCount} blocks");
            foreach (var block in page.Hierarchy.Blocks)
            {
                Console.WriteLine($"  [{block.Level}] {block.Text.Substring(0, 50)}...");
            }
        }
    }
}
basic_hierarchy.go
package main

import (
    "fmt"
    "kreuzberg"
)

func main() {
    config := &kreuzberg.ExtractionConfig{
        PdfOptions: &kreuzberg.PdfConfig{
            Hierarchy: &kreuzberg.HierarchyConfig{
                Enabled: true,
            },
        },
    }

    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        panic(err)
    }

    if result.Pages != nil {
        for _, page := range result.Pages {
            if page.Hierarchy != nil {
                fmt.Printf("Page %d: %d blocks\n", page.PageNumber, page.Hierarchy.BlockCount)
                for _, block := range page.Hierarchy.Blocks {
                    fmt.Printf("  [%s] %s...\n", block.Level, block.Text[:50])
                }
            }
        }
    }
}
BasicHierarchy.java
import com.kreuzberg.*;

public class BasicHierarchy {
    public static void main(String[] args) throws Exception {
        ExtractionConfig config = ExtractionConfig.builder()
            .pdfOptions(PdfConfig.builder()
                .hierarchy(HierarchyConfig.builder()
                    .enabled(true)
                    .build())
                .build())
            .build();

        ExtractionResult result = KreuzbergClient.extractFileSync("document.pdf", config);

        if (result.getPages() != null) {
            for (PageContent page : result.getPages()) {
                if (page.getHierarchy() != null) {
                    System.out.println("Page " + page.getPageNumber() + ": " +
                        page.getHierarchy().getBlockCount() + " blocks");
                    for (HierarchicalBlock block : page.getHierarchy().getBlocks()) {
                        System.out.println("  [" + block.getLevel() + "] " +
                            block.getText().substring(0, 50) + "...");
                    }
                }
            }
        }
    }
}
Python
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config: ExtractionConfig = ExtractionConfig(
    pdf_options=PdfConfig(
        extract_metadata=True,
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=6,
            include_bbox=True,
            ocr_coverage_threshold=0.8
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

# Access hierarchy information
for page in result.pages or []:
    print(f"Page {page.page_number}:")
    print(f"  Content: {page.content[:100]}...")
basic_hierarchy.rb
require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  pdf_options: Kreuzberg::PdfConfig.new(
    hierarchy: Kreuzberg::HierarchyConfig.new(
      enabled: true
    )
  )
)

result = Kreuzberg.extract_file_sync("document.pdf", config: config)

if result.pages
  result.pages.each do |page|
    if page.hierarchy
      puts "Page #{page.page_number}: #{page.hierarchy.block_count} blocks"
      page.hierarchy.blocks.each do |block|
        puts "  [#{block.level}] #{block.text[0..49]}..."
      end
    end
  end
end
Rust
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                enabled: true,
                detection_threshold: Some(0.75),
                ocr_coverage_threshold: Some(0.8),
                min_level: Some(1),
                max_level: Some(5),
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &config)?;
    println!("Hierarchy levels: {}", result.hierarchy.len());
    Ok(())
}
basic_hierarchy.ts
import { extractFileSync, ExtractionConfig, PdfConfig, HierarchyConfig } from 'kreuzberg';

const config: ExtractionConfig = {
    pdfOptions: {
        hierarchy: {
            enabled: true
        }
    }
};

const result = extractFileSync("document.pdf", config);

if (result.pages) {
    for (const page of result.pages) {
        if (page.hierarchy) {
            console.log(`Page ${page.pageNumber}: ${page.hierarchy.blockCount} blocks`);
            for (const block of page.hierarchy.blocks) {
                console.log(`  [${block.level}] ${block.text.substring(0, 50)}...`);
            }
        }
    }
}

Custom K-Clusters Configuration

Configure clustering granularity for different hierarchy detail levels:

custom_k_clusters.cs
using Kreuzberg;

// Fast mode: 3 clusters (h1, h3, body) - minimal detail
var fastConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 3  // Fast, identifies main structure only
        }
    }
};

// Balanced mode: 6 clusters (h1-h6) - default, recommended
var balancedConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 6  // Balanced detail
        }
    }
};

// Detailed mode: 7 clusters (h1-h6 + distinct body) - maximum detail
var detailedConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 7  // Maximum detail with body text separation
        }
    }
};
custom_k_clusters.py
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

# Fast mode: 3 clusters
fast_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=3  # Fast, identifies main structure only
        )
    )
)

# Balanced mode: 6 clusters (recommended)
balanced_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=6  # Balanced detail
        )
    )
)

# Detailed mode: 7 clusters
detailed_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=7  # Maximum detail with body text separation
        )
    )
)

result = extract_file_sync("document.pdf", config=balanced_config)
custom_k_clusters.rs
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    // Fast mode: 3 clusters
    let fast_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 3,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    // Balanced mode: 6 clusters (recommended)
    let balanced_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 6,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    // Detailed mode: 7 clusters
    let detailed_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 7,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &balanced_config)?;
    Ok(())
}

OCR Coverage Threshold

Smart OCR triggering based on text coverage:

ocr_coverage_threshold.cs
using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            OcrCoverageThreshold = 0.5f  // Trigger OCR if <50% of page has text
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
ocr_coverage_threshold.py
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            ocr_coverage_threshold=0.5  # Trigger OCR if <50% of page has text
        )
    )
)

result = extract_file_sync("document.pdf", config=config)
ocr_coverage_threshold.rs
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                ocr_coverage_threshold: Some(0.5),
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &config)?;
    Ok(())
}

Disabling Bounding Boxes

Reduce output size by excluding spatial information:

no_bbox.cs
using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            IncludeBbox = false  // Exclude bounding boxes
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
no_bbox.py
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            include_bbox=False  // Exclude bounding boxes
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

Performance Tuning

K-clusters Selection

Choose k_clusters based on your performance vs. detail requirements:

Setting Speed Detail Best For
k_clusters=3 Very Fast Minimal (h1, h3, body) Quick document structure identification, real-time processing
k_clusters=6 Balanced Standard (h1-h6, body) General purpose, RAG systems, recommended default
k_clusters=7 Moderate Detailed (h1-h6 separate body) Fine-grained content analysis, content organization

Bounding Box Optimization

Include bounding boxes (include_bbox=true, default) when:

  • Building visually-aware document processors
  • Need to correlate text with document position
  • Processing layout-sensitive documents (brochures, forms)

Exclude bounding boxes (include_bbox=false) when:

  • Minimizing output size for network transmission
  • Bandwidth is constrained
  • Spatial information is not needed
  • Typical output reduction: 10-15% smaller

OCR Integration

The ocr_coverage_threshold parameter enables smart OCR triggering:

If (text_block_coverage < ocr_coverage_threshold) {
run_ocr() // Trigger OCR on pages with insufficient text coverage
}

````text

**Common Scenarios:**

- `ocr_coverage_threshold=0.5`: Trigger OCR on scanned pages (<50% text coverage)
- `ocr_coverage_threshold=0.8`: Only OCR pages with very low text (>80% images)
- `ocr_coverage_threshold=None`: Disable smart OCR triggering, rely on `force_ocr` flag

### Output Format

#### PageHierarchy Structure

The extracted hierarchy is returned in `PageContent.hierarchy` when pages are extracted:

```json title="PageHierarchy Output Structure"
{
  "block_count": 12,
  "blocks": [
    {
      "text": "Document Title",
      "font_size": 24.0,
      "level": "h1",
      "bbox": [50.0, 100.0, 500.0, 130.0]
    },
    {
      "text": "Introduction",
      "font_size": 18.0,
      "level": "h2",
      "bbox": [50.0, 150.0, 300.0, 175.0]
    },
    {
      "text": "This is the introductory paragraph with standard body text content.",
      "font_size": 12.0,
      "level": "body",
      "bbox": [50.0, 200.0, 500.0, 250.0]
    },
    {
      "text": "Key Findings",
      "font_size": 18.0,
      "level": "h2",
      "bbox": [50.0, 280.0, 300.0, 305.0]
    }
  ]
}

Field Meanings

  • block_count: Total number of hierarchical blocks on the page
  • blocks: Array of hierarchical blocks
  • text: The text content of the block
  • font_size: Font size in points (useful for verification and styling)
  • level: Hierarchy level - "h1" through "h6" for headings, "body" for body text
  • bbox: Optional bounding box as [left, top, right, bottom] in PDF units (points). Only present when include_bbox=true

Accessing Hierarchy in Code

result = extract_file_sync("document.pdf", config=config)

for page in result.pages or []:
    if page.hierarchy:
        # Get all h1 headings
        h1_blocks = [b for b in page.hierarchy.blocks if b.level == "h1"]

        # Get all heading levels (h1-h6)
        headings = [b for b in page.hierarchy.blocks if b.level.startswith("h")]

        # Build outline with hierarchy
        for block in page.hierarchy.blocks:
            indent = int(block.level[1]) if block.level.startswith("h") else 0
            print("  " * indent + block.text)
for page in result.pages.iter().flat_map(|p| p.iter()) {
    if let Some(hierarchy) = &page.hierarchy {
        // Get all h1 headings
        let h1_blocks: Vec<_> = hierarchy.blocks
            .iter()
            .filter(|b| b.level == "h1")
            .collect();

        // Build outline
        for block in &hierarchy.blocks {
            let level = if block.level.starts_with('h') {
                block.level[1..].parse::<usize>().unwrap_or(0)
            } else {
                0
            };
            println!("{}{}", "  ".repeat(level), block.text);
        }
    }
}

Best Practices

  1. Always enable page extraction when using hierarchy:
pages = PageConfig(extract_pages=True)

Hierarchy data is only populated when pages are extracted.

  1. Use k_clusters=6 by default (recommended). It provides good balance between detail and performance for most documents.

  2. Include bounding boxes for RAG systems that need spatial awareness for relevance ranking.

  3. Test ocr_coverage_threshold with your document set to find optimal OCR triggering point.

  4. Process hierarchy at chunk boundaries in RAG systems to preserve semantic relationships in context windows.

Example: Building a Table of Contents

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig, PageConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(enabled=True, k_clusters=6)
    ),
    pages=PageConfig(extract_pages=True)
)

result = extract_file_sync("document.pdf", config=config)

toc = []
for page in result.pages or []:
    if page.hierarchy:
        for block in page.hierarchy.blocks:
            if block.level.startswith("h"):
                level = int(block.level[1])
                toc.append({
                    "level": level,
                    "text": block.text,
                    "page": page.page_number
                })

# Print hierarchical TOC
for entry in toc:
    indent = "  " * (entry["level"] - 1)
    print(f"{indent}{entry['text']} (p. {entry['page']})")

HierarchyConfig

PDF document hierarchy extraction configuration for semantic text structure analysis.

Overview

HierarchyConfig enables automatic extraction of document hierarchy levels (H1-H6) from PDF text by analyzing font size patterns. This is particularly useful for:

  • Building semantic document representations for RAG (Retrieval Augmented Generation) systems
  • Automatic table of contents extraction
  • Document structure understanding and analysis
  • Content organization and outlining

The hierarchy detection works by:

  1. Extracting text blocks with font size metadata from the PDF
  2. Performing K-means clustering on font sizes to identify distinct size groups
  3. Mapping clusters to heading levels (h1-h6) and body text
  4. Merging adjacent blocks with the same hierarchy level
  5. Optionally including bounding box information for spatial awareness

Fields

Field Type Default Description
enabled bool true Enable hierarchy extraction
k_clusters usize 6 Number of font size clusters (1-7). Default 6 provides H1-H6 with body text
include_bbox bool true Include bounding box coordinates in output
ocr_coverage_threshold Option<f32> None Smart OCR triggering threshold (0.0-1.0). Triggers OCR if text blocks cover less than this fraction of page

How It Works

Font Size Extraction

Text blocks are extracted from PDFs with their precise font sizes. This metadata is preserved for analysis.

K-means Clustering

The font sizes are clustered using K-means algorithm with the specified number of clusters. Each cluster represents a distinct text hierarchy level, from largest fonts (headings) to smallest (body text).

Cluster-to-Level Mapping:

  • For k_clusters=6 (recommended): Creates 6 clusters → h1 (largest), h2, h3, h4, h5, body (smallest)
  • For k_clusters=3: Fast mode with just h1, h3, body (minimal detail)
  • For k_clusters=7: Maximum detail separating h1-h6 with distinct body text

Block Merging

Adjacent blocks with the same hierarchy level are merged to create logical content units. This merge process considers:

  • Spatial proximity (vertical and horizontal distance)
  • Bounding box overlap ratio
  • Text flow direction

Output Structure

Each extracted block contains:

  • Text content
  • Font size (in points)
  • Hierarchy level (h1-h6 or body)
  • Optional bounding box (left, top, right, bottom in PDF units)

Use Cases

Semantic Document Understanding

Extract hierarchical structure for understanding document semantics and building knowledge graphs:

H1: Document Title
  H2: Section 1
    H3: Subsection 1.1
      Body text...
    H3: Subsection 1.2
      Body text...
  H2: Section 2
    H3: Subsection 2.1

Automatic Table of Contents Generation

Build dynamic table of contents from extracted hierarchy levels (h1-h3) for document navigation.

RAG System Optimization

Use hierarchy information to improve context retrieval by chunking at appropriate heading boundaries rather than arbitrary character counts. This preserves semantic relationships.

Document Analysis

Extract and analyze document structure programmatically for compliance checking, content validation, or metadata extraction.

Configuration Examples

Basic Hierarchy Extraction

basic_hierarchy.cs
using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

// Access hierarchy from pages
if (result.Pages != null)
{
    foreach (var page in result.Pages)
    {
        if (page.Hierarchy != null)
        {
            Console.WriteLine($"Page {page.PageNumber}: {page.Hierarchy.BlockCount} blocks");
            foreach (var block in page.Hierarchy.Blocks)
            {
                Console.WriteLine($"  [{block.Level}] {block.Text.Substring(0, 50)}...");
            }
        }
    }
}
basic_hierarchy.go
package main

import (
    "fmt"
    "kreuzberg"
)

func main() {
    config := &kreuzberg.ExtractionConfig{
        PdfOptions: &kreuzberg.PdfConfig{
            Hierarchy: &kreuzberg.HierarchyConfig{
                Enabled: true,
            },
        },
    }

    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        panic(err)
    }

    if result.Pages != nil {
        for _, page := range result.Pages {
            if page.Hierarchy != nil {
                fmt.Printf("Page %d: %d blocks\n", page.PageNumber, page.Hierarchy.BlockCount)
                for _, block := range page.Hierarchy.Blocks {
                    fmt.Printf("  [%s] %s...\n", block.Level, block.Text[:50])
                }
            }
        }
    }
}
BasicHierarchy.java
import com.kreuzberg.*;

public class BasicHierarchy {
    public static void main(String[] args) throws Exception {
        ExtractionConfig config = ExtractionConfig.builder()
            .pdfOptions(PdfConfig.builder()
                .hierarchy(HierarchyConfig.builder()
                    .enabled(true)
                    .build())
                .build())
            .build();

        ExtractionResult result = KreuzbergClient.extractFileSync("document.pdf", config);

        if (result.getPages() != null) {
            for (PageContent page : result.getPages()) {
                if (page.getHierarchy() != null) {
                    System.out.println("Page " + page.getPageNumber() + ": " +
                        page.getHierarchy().getBlockCount() + " blocks");
                    for (HierarchicalBlock block : page.getHierarchy().getBlocks()) {
                        System.out.println("  [" + block.getLevel() + "] " +
                            block.getText().substring(0, 50) + "...");
                    }
                }
            }
        }
    }
}
Python
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config: ExtractionConfig = ExtractionConfig(
    pdf_options=PdfConfig(
        extract_metadata=True,
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=6,
            include_bbox=True,
            ocr_coverage_threshold=0.8
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

# Access hierarchy information
for page in result.pages or []:
    print(f"Page {page.page_number}:")
    print(f"  Content: {page.content[:100]}...")
basic_hierarchy.rb
require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  pdf_options: Kreuzberg::PdfConfig.new(
    hierarchy: Kreuzberg::HierarchyConfig.new(
      enabled: true
    )
  )
)

result = Kreuzberg.extract_file_sync("document.pdf", config: config)

if result.pages
  result.pages.each do |page|
    if page.hierarchy
      puts "Page #{page.page_number}: #{page.hierarchy.block_count} blocks"
      page.hierarchy.blocks.each do |block|
        puts "  [#{block.level}] #{block.text[0..49]}..."
      end
    end
  end
end
Rust
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                enabled: true,
                detection_threshold: Some(0.75),
                ocr_coverage_threshold: Some(0.8),
                min_level: Some(1),
                max_level: Some(5),
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &config)?;
    println!("Hierarchy levels: {}", result.hierarchy.len());
    Ok(())
}
basic_hierarchy.ts
import { extractFileSync, ExtractionConfig, PdfConfig, HierarchyConfig } from 'kreuzberg';

const config: ExtractionConfig = {
    pdfOptions: {
        hierarchy: {
            enabled: true
        }
    }
};

const result = extractFileSync("document.pdf", config);

if (result.pages) {
    for (const page of result.pages) {
        if (page.hierarchy) {
            console.log(`Page ${page.pageNumber}: ${page.hierarchy.blockCount} blocks`);
            for (const block of page.hierarchy.blocks) {
                console.log(`  [${block.level}] ${block.text.substring(0, 50)}...`);
            }
        }
    }
}

Custom K-Clusters Configuration

Configure clustering granularity for different hierarchy detail levels:

custom_k_clusters.cs
using Kreuzberg;

// Fast mode: 3 clusters (h1, h3, body) - minimal detail
var fastConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 3  // Fast, identifies main structure only
        }
    }
};

// Balanced mode: 6 clusters (h1-h6) - default, recommended
var balancedConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 6  // Balanced detail
        }
    }
};

// Detailed mode: 7 clusters (h1-h6 + distinct body) - maximum detail
var detailedConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 7  // Maximum detail with body text separation
        }
    }
};
custom_k_clusters.py
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

# Fast mode: 3 clusters
fast_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=3  # Fast, identifies main structure only
        )
    )
)

# Balanced mode: 6 clusters (recommended)
balanced_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=6  # Balanced detail
        )
    )
)

# Detailed mode: 7 clusters
detailed_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=7  # Maximum detail with body text separation
        )
    )
)

result = extract_file_sync("document.pdf", config=balanced_config)
custom_k_clusters.rs
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    // Fast mode: 3 clusters
    let fast_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 3,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    // Balanced mode: 6 clusters (recommended)
    let balanced_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 6,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    // Detailed mode: 7 clusters
    let detailed_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 7,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &balanced_config)?;
    Ok(())
}

OCR Coverage Threshold

Smart OCR triggering based on text coverage:

ocr_coverage_threshold.cs
using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            OcrCoverageThreshold = 0.5f  // Trigger OCR if <50% of page has text
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
ocr_coverage_threshold.py
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            ocr_coverage_threshold=0.5  # Trigger OCR if <50% of page has text
        )
    )
)

result = extract_file_sync("document.pdf", config=config)
ocr_coverage_threshold.rs
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                ocr_coverage_threshold: Some(0.5),
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &config)?;
    Ok(())
}

Disabling Bounding Boxes

Reduce output size by excluding spatial information:

no_bbox.cs
using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            IncludeBbox = false  // Exclude bounding boxes
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
no_bbox.py
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            include_bbox=False  // Exclude bounding boxes
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

Performance Tuning

K-clusters Selection

Choose k_clusters based on your performance vs. detail requirements:

Setting Speed Detail Best For
k_clusters=3 Very Fast Minimal (h1, h3, body) Quick document structure identification, real-time processing
k_clusters=6 Balanced Standard (h1-h6, body) General purpose, RAG systems, recommended default
k_clusters=7 Moderate Detailed (h1-h6 separate body) Fine-grained content analysis, content organization

Bounding Box Optimization

Include bounding boxes (include_bbox=true, default) when:

  • Building visually-aware document processors
  • Need to correlate text with document position
  • Processing layout-sensitive documents (brochures, forms)

Exclude bounding boxes (include_bbox=false) when:

  • Minimizing output size for network transmission
  • Bandwidth is constrained
  • Spatial information is not needed
  • Typical output reduction: 10-15% smaller

OCR Integration

The ocr_coverage_threshold parameter enables smart OCR triggering:

If (text_block_coverage < ocr_coverage_threshold) {
run_ocr() // Trigger OCR on pages with insufficient text coverage
}

````text

**Common Scenarios:**

- `ocr_coverage_threshold=0.5`: Trigger OCR on scanned pages (<50% text coverage)
- `ocr_coverage_threshold=0.8`: Only OCR pages with very low text (>80% images)
- `ocr_coverage_threshold=None`: Disable smart OCR triggering, rely on `force_ocr` flag

### Output Format

#### PageHierarchy Structure

The extracted hierarchy is returned in `PageContent.hierarchy` when pages are extracted:

```json title="PageHierarchy Output Structure"
{
  "block_count": 12,
  "blocks": [
    {
      "text": "Document Title",
      "font_size": 24.0,
      "level": "h1",
      "bbox": [50.0, 100.0, 500.0, 130.0]
    },
    {
      "text": "Introduction",
      "font_size": 18.0,
      "level": "h2",
      "bbox": [50.0, 150.0, 300.0, 175.0]
    },
    {
      "text": "This is the introductory paragraph with standard body text content.",
      "font_size": 12.0,
      "level": "body",
      "bbox": [50.0, 200.0, 500.0, 250.0]
    },
    {
      "text": "Key Findings",
      "font_size": 18.0,
      "level": "h2",
      "bbox": [50.0, 280.0, 300.0, 305.0]
    }
  ]
}

Field Meanings

  • block_count: Total number of hierarchical blocks on the page
  • blocks: Array of hierarchical blocks
  • text: The text content of the block
  • font_size: Font size in points (useful for verification and styling)
  • level: Hierarchy level - "h1" through "h6" for headings, "body" for body text
  • bbox: Optional bounding box as [left, top, right, bottom] in PDF units (points). Only present when include_bbox=true

Accessing Hierarchy in Code

result = extract_file_sync("document.pdf", config=config)

for page in result.pages or []:
    if page.hierarchy:
        # Get all h1 headings
        h1_blocks = [b for b in page.hierarchy.blocks if b.level == "h1"]

        # Get all heading levels (h1-h6)
        headings = [b for b in page.hierarchy.blocks if b.level.startswith("h")]

        # Build outline with hierarchy
        for block in page.hierarchy.blocks:
            indent = int(block.level[1]) if block.level.startswith("h") else 0
            print("  " * indent + block.text)
for page in result.pages.iter().flat_map(|p| p.iter()) {
    if let Some(hierarchy) = &page.hierarchy {
        // Get all h1 headings
        let h1_blocks: Vec<_> = hierarchy.blocks
            .iter()
            .filter(|b| b.level == "h1")
            .collect();

        // Build outline
        for block in &hierarchy.blocks {
            let level = if block.level.starts_with('h') {
                block.level[1..].parse::<usize>().unwrap_or(0)
            } else {
                0
            };
            println!("{}{}", "  ".repeat(level), block.text);
        }
    }
}

Best Practices

  1. Always enable page extraction when using hierarchy:
pages = PageConfig(extract_pages=True)

Hierarchy data is only populated when pages are extracted.

  1. Use k_clusters=6 by default (recommended). It provides good balance between detail and performance for most documents.

  2. Include bounding boxes for RAG systems that need spatial awareness for relevance ranking.

  3. Test ocr_coverage_threshold with your document set to find optimal OCR triggering point.

  4. Process hierarchy at chunk boundaries in RAG systems to preserve semantic relationships in context windows.

Example: Building a Table of Contents

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig, PageConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(enabled=True, k_clusters=6)
    ),
    pages=PageConfig(extract_pages=True)
)

result = extract_file_sync("document.pdf", config=config)

toc = []
for page in result.pages or []:
    if page.hierarchy:
        for block in page.hierarchy.blocks:
            if block.level.startswith("h"):
                level = int(block.level[1])
                toc.append({
                    "level": level,
                    "text": block.text,
                    "page": page.page_number
                })

# Print hierarchical TOC
for entry in toc:
    indent = "  " * (entry["level"] - 1)
    print(f"{indent}{entry['text']} (p. {entry['page']})")

PageConfig

Configuration for page extraction and tracking.

Controls whether to extract per-page content and how to mark page boundaries in the combined text output.

Configuration

Field Type Default Description
extract_pages bool false Extract pages as separate array in results
insert_page_markers bool false Insert page markers in combined content string
marker_format String "\n\n<!-- PAGE {page_num} -->\n\n" Template for page markers (use {page_num} placeholder)

Example

page_config.cs
var config = new ExtractionConfig
{
    Pages = new PageConfig
    {
        ExtractPages = true,
        InsertPageMarkers = true,
        MarkerFormat = "\n\n--- Page {page_num} ---\n\n"
    }
};
page_config.go
config := &ExtractionConfig{
    Pages: &PageConfig{
        ExtractPages:      true,
        InsertPageMarkers: true,
        MarkerFormat:      "\n\n--- Page {page_num} ---\n\n",
    },
}
PageConfig.java
var config = ExtractionConfig.builder()
    .pages(PageConfig.builder()
        .extractPages(true)
        .insertPageMarkers(true)
        .markerFormat("\n\n--- Page {page_num} ---\n\n")
        .build())
    .build();
page_config.py
config = ExtractionConfig(
    pages=PageConfig(
        extract_pages=True,
        insert_page_markers=True,
        marker_format="\n\n--- Page {page_num} ---\n\n"
    )
)
page_config.rb
config = ExtractionConfig.new(
  pages: PageConfig.new(
    extract_pages: true,
    insert_page_markers: true,
    marker_format: "\n\n--- Page {page_num} ---\n\n"
  )
)
page_config.rs
let config = ExtractionConfig {
    pages: Some(PageConfig {
        extract_pages: true,
        insert_page_markers: true,
        marker_format: "\n\n--- Page {page_num} ---\n\n".to_string(),
    }),
    ..Default::default()
};
page_config.ts
const config: ExtractionConfig = {
  pages: {
    extractPages: true,
    insertPageMarkers: true,
    markerFormat: "\n\n--- Page {page_num} ---\n\n"
  }
};

Field Details

extract_pages: When true, populates ExtractionResult.pages with per-page content. Each page contains its text, tables, and images separately.

insert_page_markers: When true, inserts page markers into the combined content string at page boundaries. Useful for LLMs to understand document structure.

marker_format: Template string for page markers. Use {page_num} placeholder for the page number. Default HTML comment format is LLM-friendly.

Format Support

  • PDF: Full byte-accurate page tracking with O(1) lookup performance
  • PPTX: Slide boundary tracking with per-slide content
  • DOCX: Best-effort page break detection using explicit page breaks
  • Other formats: Page tracking not available (returns None/null)

ImageExtractionConfig

Configuration for extracting and processing images from documents.

Field Type Default Description
extract_images bool true Extract images from documents
target_dpi int 300 Target DPI for extracted/normalized images
max_image_dimension int 4096 Maximum image dimension (width or height) in pixels
inject_placeholders bool true Inject image reference placeholders (for example ![Image](embedded:p1_i0)) into markdown output. Set to false to extract images as data without modifying the text content.
auto_adjust_dpi bool true Automatically adjust DPI based on image size and content
min_dpi int 72 Minimum DPI when auto-adjusting
max_dpi int 600 Maximum DPI when auto-adjusting

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Images = new ImageExtractionConfig
    {
        ExtractImages = true,
        TargetDpi = 200,
        MaxImageDimension = 2048,
        InjectPlaceholders = true, // set to false to extract images without markdown references
        AutoAdjustDpi = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Extracted: {result.Content[..Math.Min(100, result.Content.Length)]}");
Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    targetDPI := 200
    maxDim := 2048
    result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
        ImageExtraction: &kreuzberg.ImageExtractionConfig{
            ExtractImages:      kreuzberg.BoolPtr(true),
            TargetDPI:          &targetDPI,
            MaxImageDimension:  &maxDim,
            InjectPlaceholders: kreuzberg.BoolPtr(true), // set to false to extract images without markdown references
            AutoAdjustDPI:      kreuzberg.BoolPtr(true),
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ImageExtractionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .imageExtraction(ImageExtractionConfig.builder()
        .extractImages(true)
        .targetDpi(200)
        .maxImageDimension(2048)
        .injectPlaceholders(true) // set to false to extract images without markdown references
        .autoAdjustDpi(true)
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, ImageExtractionConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        images=ImageExtractionConfig(
            extract_images=True,
            target_dpi=200,
            max_image_dimension=2048,
            inject_placeholders=True,  # set to False to extract images without markdown references
            auto_adjust_dpi=True,
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Extracted: {result.content[:100]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  images: Kreuzberg::Config::ImageExtraction.new(
    extract_images: true,
    target_dpi: 200,
    max_image_dimension: 2048,
    inject_placeholders: true, # set to false to extract images without markdown references
    auto_adjust_dpi: true
  )
)
R
library(kreuzberg)

ocr_cfg <- ocr_config(backend = "tesseract", language = "eng", dpi = 300L)
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)

result <- extract_file_sync("scan.png", "image/png", config)

cat(sprintf("Image extraction via OCR:\n"))
cat(sprintf("Content length: %d characters\n", nchar(result$content)))
cat(sprintf("Mime type: %s\n", result$mime_type))
cat(sprintf("Detected language: %s\n", result$detected_language))
Rust
use kreuzberg::{ExtractionConfig, ImageExtractionConfig};

fn main() {
    let config = ExtractionConfig {
        images: Some(ImageExtractionConfig {
            extract_images: Some(true),
            target_dpi: Some(200),
            max_image_dimension: Some(2048),
            inject_placeholders: Some(true), // set to false to extract images without markdown references
            auto_adjust_dpi: Some(true),
            ..Default::default()
        }),
        ..Default::default()
    };
    println!("{:?}", config.images);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    images: {
        extractImages: true,
        targetDpi: 200,
        maxImageDimension: 2048,
        injectPlaceholders: true, // set to false to extract images without markdown references
        autoAdjustDpi: true,
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(`Extracted ${result.images?.length ?? 0} images`);

ImagePreprocessingConfig

Image preprocessing configuration for improving OCR quality on scanned documents.

Field Type Default Description
target_dpi int 300 Target DPI for OCR processing (300 standard, 600 for small text)
auto_rotate bool true Auto-detect and correct image rotation
deskew bool true Correct skew (tilted images)
denoise bool false Apply noise reduction filter
contrast_enhance bool false Enhance image contrast for better text visibility
binarization_method str "otsu" Binarization method: "otsu", "sauvola", "adaptive", "none"
invert_colors bool false Invert colors (useful for white text on black background)

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        TesseractConfig = new TesseractConfig
        {
            Preprocessing = new ImagePreprocessingConfig
            {
                TargetDpi = 300,
                Denoise = true,
                Deskew = true,
                ContrastEnhance = true,
                BinarizationMethod = "otsu"
            }
        }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("scanned.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    targetDPI := 300
    config := &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Tesseract: &kreuzberg.TesseractConfig{
                Preprocessing: &kreuzberg.ImagePreprocessingConfig{
                    TargetDPI:         &targetDPI,
                    Denoise:           kreuzberg.BoolPtr(true),
                    Deskew:            kreuzberg.BoolPtr(true),
                    ContrastEnhance:   kreuzberg.BoolPtr(true),
                    BinarizationMode:  kreuzberg.StringPtr("otsu"),
                },
            },
        },
    }

    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ImagePreprocessingConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .tesseractConfig(TesseractConfig.builder()
            .preprocessing(ImagePreprocessingConfig.builder()
                .targetDpi(300)
                .denoise(true)
                .deskew(true)
                .contrastEnhance(true)
                .binarizationMethod("otsu")
                .build())
            .build())
        .build())
    .build();
Python
import asyncio
from kreuzberg import (
    ExtractionConfig,
    OcrConfig,
    TesseractConfig,
    ImagePreprocessingConfig,
    extract_file,
)

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        ocr=OcrConfig(
            tesseract_config=TesseractConfig(
                preprocessing=ImagePreprocessingConfig(
                    target_dpi=300,
                    denoise=True,
                    deskew=True,
                    contrast_enhance=True,
                    binarization_method="otsu",
                )
            )
        )
    )
    result = await extract_file("scanned.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    tesseract_config: Kreuzberg::Config::Tesseract.new(
      preprocessing: Kreuzberg::Config::ImagePreprocessing.new(
        target_dpi: 300,
        denoise: true,
        deskew: true,
        contrast_enhance: true,
        binarization_method: 'otsu'
      )
    )
  )
)
R
library(kreuzberg)

dpi_settings <- c(150L, 300L, 600L)
results <- list()

for (dpi in dpi_settings) {
  ocr_cfg <- ocr_config(backend = "tesseract", language = "eng", dpi = dpi)
  config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg,
                              enable_quality_processing = TRUE)
  results[[as.character(dpi)]] <- extract_file_sync("scan.png", "image/png", config)
}

for (dpi in dpi_settings) {
  quality <- results[[as.character(dpi)]]$quality_score
  length <- nchar(results[[as.character(dpi)]]$content)
  cat(sprintf("DPI %d: quality=%.2f, length=%d\n", dpi, quality, length))
}
Rust
use kreuzberg::{ExtractionConfig, ImagePreprocessingConfig, OcrConfig, TesseractConfig};

fn main() {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            tesseract_config: Some(TesseractConfig {
                preprocessing: Some(ImagePreprocessingConfig {
                    target_dpi: 300,
                    denoise: true,
                    deskew: true,
                    contrast_enhance: true,
                    binarization_method: "otsu".to_string(),
                    ..Default::default()
                }),
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    println!("{:?}", config.ocr);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
        tesseractConfig: {
            psm: 6,
            enableTableDetection: true,
        },
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

PostProcessorConfig

Configuration for the post-processing pipeline that runs after extraction.

Field Type Default Description
enabled bool true Enable post-processing pipeline
enabled_processors list[str]? None Specific processors to enable (if None, all enabled by default)
disabled_processors list[str]? None Specific processors to disable (takes precedence over enabled_processors)

Built-in post-processors include:

  • deduplication - Remove duplicate text blocks
  • whitespace_normalization - Normalize whitespace and line breaks
  • mojibake_fix - Fix mojibake (encoding corruption)
  • quality_scoring - Score and filter low-quality text

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Postprocessor = new PostProcessorConfig
    {
        Enabled = true,
        EnabledProcessors = new List<string> { "deduplication" }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
Go
package main

import "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"

func main() {
    enabled := true
    cfg := &kreuzberg.ExtractionConfig{
        Postprocessor: &kreuzberg.PostProcessorConfig{
            Enabled:            &enabled,
            EnabledProcessors:  []string{"deduplication", "whitespace_normalization"},
            DisabledProcessors: []string{"mojibake_fix"},
        },
    }

    _ = cfg
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.PostProcessorConfig;
import java.util.Arrays;

ExtractionConfig config = ExtractionConfig.builder()
    .postprocessor(PostProcessorConfig.builder()
        .enabled(true)
        .enabledProcessors(Arrays.asList("deduplication", "whitespace_normalization"))
        .disabledProcessors(Arrays.asList("mojibake_fix"))
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, PostProcessorConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        postprocessor=PostProcessorConfig(
            enabled=True,
            enabled_processors=["deduplication"],
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  postprocessor: Kreuzberg::Config::PostProcessor.new(
    enabled: true,
    enabled_processors: ['deduplication', 'whitespace_normalization'],
    disabled_processors: ['mojibake_fix']
  )
)
R
library(kreuzberg)

config <- extraction_config(
  postprocessor = list(enabled = TRUE)
)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("Content length: %d characters\n", nchar(result$content)))
cat(sprintf("Mime type: %s\n", result$mime_type))
Rust
use kreuzberg::{ExtractionConfig, PostProcessorConfig};

fn main() {
    let config = ExtractionConfig {
        postprocessor: Some(PostProcessorConfig {
            enabled: Some(true),
            enabled_processors: Some(vec![
                "deduplication".to_string(),
                "whitespace_normalization".to_string(),
            ]),
            disabled_processors: Some(vec!["mojibake_fix".to_string()]),
        }),
        ..Default::default()
    };
    println!("{:?}", config.postprocessor);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    postprocessor: {
        enabled: true,
        enabledProcessors: ['deduplication', 'whitespace_normalization'],
        disabledProcessors: ['mojibake_fix'],
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

TokenReductionConfig

Configuration for reducing token count in extracted text, useful for optimizing LLM context windows.

Field Type Default Description
mode str "off" Reduction mode: "off", "light", "moderate", "aggressive", "maximum"
preserve_important_words bool true Preserve important words (capitalized, technical terms) during reduction

Reduction Modes

  • off: No token reduction
  • light: Remove redundant whitespace and line breaks (~5-10% reduction)
  • moderate: Light + remove stopwords in low-information contexts (~15-25% reduction)
  • aggressive: Moderate + abbreviate common phrases (~30-40% reduction)
  • maximum: Aggressive + remove all stopwords (~50-60% reduction, may impact quality)

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    TokenReduction = new TokenReductionConfig
    {
        Mode = "moderate",
        PreserveImportantWords = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content length: {result.Content.Length}");
Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    config := &kreuzberg.ExtractionConfig{
        TokenReduction: &kreuzberg.TokenReductionConfig{
            Mode:                   "moderate",
            PreserveImportantWords: kreuzberg.BoolPtr(true),
        },
    }

    fmt.Printf("Mode: %s, Preserve Important Words: %v\n",
        config.TokenReduction.Mode,
        *config.TokenReduction.PreserveImportantWords)
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.TokenReductionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .tokenReduction(TokenReductionConfig.builder()
        .mode("moderate")
        .preserveImportantWords(true)
        .build())
    .build();
Python
from kreuzberg import ExtractionConfig, TokenReductionConfig

config: ExtractionConfig = ExtractionConfig(
    token_reduction=TokenReductionConfig(
        mode="moderate",
        preserve_important_words=True,
    )
)
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  token_reduction: Kreuzberg::Config::TokenReduction.new(
    mode: 'moderate',
    preserve_markdown: true,
    preserve_code: true,
    language_hint: 'eng'
  )
)
R
library(kreuzberg)

config <- extraction_config(
  token_reduction = list(enabled = TRUE)
)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("Original content length: %d characters\n", nchar(result$content)))
cat(sprintf("Content preview: %.60s...\n", result$content))
Rust
use kreuzberg::{ExtractionConfig, TokenReductionConfig};

let config = ExtractionConfig {
    token_reduction: Some(TokenReductionConfig {
        mode: "moderate".to_string(),
        preserve_markdown: true,
        preserve_code: true,
        language_hint: Some("eng".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    tokenReduction: {
        mode: 'moderate',
        preserveImportantWords: true,
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

LayoutDetectionConfig v4.5.0

Configuration for ONNX-based document layout detection. Analyzes PDF pages to identify structural regions such as tables, figures, headers, and text blocks.

Feature Gate: Requires the layout-detection Cargo feature. Layout detection is only available when this feature is enabled.

preset removed

The preset field was removed. If present in a config file it is silently ignored. The RT-DETR v2 model is now the only layout detection model.

Fields

Field Type Default Description
confidence_threshold float? None Confidence threshold override (0.0-1.0). If None, uses the model's built-in default threshold
apply_heuristics bool true Apply postprocessing heuristics (containment filtering, deduplication)
table_model str? None (uses "tatr") Table structure recognition model. Options: "tatr" (30MB, default), "slanet_wired" (365MB, bordered tables), "slanet_wireless" (365MB, borderless tables), "slanet_plus" (7.78MB, lightweight), "slanet_auto" (~737MB, classifier-routed). See Table Structure Models.

Table detection requires layout detection

Table extraction only runs when layout is set in ExtractionConfig. Setting only table_model has no effect without an enclosing LayoutDetectionConfig.

Configuration Examples

layout_detection_config.py
from kreuzberg import ExtractionConfig, LayoutDetectionConfig

config = ExtractionConfig(
    layout=LayoutDetectionConfig(
        confidence_threshold=0.5,
        apply_heuristics=True,
        table_model="slanet_auto",  # or "tatr", "slanet_wired", "slanet_wireless", "slanet_plus"
    )
)
layout_detection_config.ts
import { extract } from "kreuzberg";

const result = await extract("document.pdf", {
  layout: {
    confidenceThreshold: 0.5,
    applyHeuristics: true,
    tableModel: "slanet_auto", // or "tatr", "slanet_wired", "slanet_wireless", "slanet_plus"
  },
});
layout_detection_config.rs
use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig};

let config = ExtractionConfig {
    layout: Some(LayoutDetectionConfig {
        confidence_threshold: Some(0.5),
        apply_heuristics: true,
        table_model: Some("slanet_auto".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};

Configuration File Examples

kreuzberg.toml
[layout]
confidence_threshold = 0.5
apply_heuristics = true
# table_model = "slanet_auto"
kreuzberg.yaml
layout:
  confidence_threshold: 0.5
  apply_heuristics: true
  # table_model: slanet_auto

AccelerationConfig v4.5.0

Controls hardware acceleration for ONNX Runtime inference (layout detection and embeddings).

Fields

Field Type Default Description
provider str "auto" Execution provider: "auto", "cpu", "coreml", "cuda", "tensorrt"
device_id int 0 GPU device ID (for CUDA/TensorRT)

Provider Behavior

  • auto: CoreML on macOS, CUDA on Linux, CPU elsewhere
  • cpu: CPU-only inference (always available)
  • coreml: Apple CoreML (macOS Neural Engine / GPU)
  • cuda: NVIDIA CUDA GPU acceleration
  • tensorrt: NVIDIA TensorRT (optimized CUDA inference)

cuda and tensorrt only work when the Kreuzberg build was compiled against an ONNX Runtime that ships those execution providers. If a requested provider isn't compiled in or isn't available at runtime, ORT falls back to CPU silently. To verify which provider is actually selected, run with RUST_LOG=ort=info and check the startup log.

Platform Defaults

Platform provider="auto" resolves to
macOS (arm64) coreml
macOS (x86_64) coreml
Linux (x86_64) cuda if available, else cpu
Linux (aarch64) cpu
Windows cuda if available, else cpu

The device_id field only matters for cuda and tensorrt. Set it to the GPU index (0, 1, ...) when running on multi-GPU hosts; it is ignored for every other provider.

Configuration Examples

acceleration_config.py
from kreuzberg import ExtractionConfig, AccelerationConfig

# Force CUDA on GPU 0; falls back to CPU if CUDA isn't compiled in
config = ExtractionConfig(
    acceleration=AccelerationConfig(provider="cuda", device_id=0)
)

# macOS: explicitly use CoreML for ONNX inference
coreml_config = ExtractionConfig(
    acceleration=AccelerationConfig(provider="coreml")
)
acceleration_config.ts
import { extract } from "kreuzberg";

const result = await extract("document.pdf", {
  acceleration: { provider: 'cuda', deviceId: 0 },
});
acceleration_config.rs
use kreuzberg::core::{ExtractionConfig, AccelerationConfig};

let config = ExtractionConfig {
    acceleration: Some(AccelerationConfig {
        provider: "cuda".to_string(),
        device_id: 0,
    }),
    ..Default::default()
};

Configuration File Examples

kreuzberg.toml
[acceleration]
provider = "cpu"
device_id = 0
kreuzberg.yaml
acceleration:
  provider: cpu
  device_id: 0

ConcurrencyConfig v4.5.0

Controls thread pool and concurrency limits for Rayon parallelism, ONNX Runtime intra-op threading, and batch extraction semaphore.

Fields

Field Type Default Description
max_threads int? None Maximum number of threads for Rayon thread pool, ONNX intra-op, batch concurrency

Overview

Use ConcurrencyConfig to constrain resource usage on systems with limited hardware. When set, max_threads caps:

  • Rayon thread pool size for text extraction and parsing parallelism
  • ONNX Runtime intra-op parallelism for layout detection and embeddings inference
  • Batch extraction semaphore for limiting concurrent file extractions

Setting max_threads: None disables concurrency limits and allows libraries to use all available cores (default behavior).

Configuration Examples

concurrency_config.py
from kreuzberg import ExtractionConfig, ConcurrencyConfig

# Limit to 4 threads for constrained hardware
config = ExtractionConfig(
    concurrency=ConcurrencyConfig(max_threads=4)
)
concurrency_config.ts
import { extract } from "kreuzberg";

const result = await extract("document.pdf", {
  concurrency: { maxThreads: 4 },
});
concurrency_config.rs
use kreuzberg::core::{ExtractionConfig, ConcurrencyConfig};

let config = ExtractionConfig {
    concurrency: Some(ConcurrencyConfig {
        max_threads: Some(4),
    }),
    ..Default::default()
};
concurrency_config.go
package main

import "kreuzberg"

config := &kreuzberg.ExtractionConfig{
    Concurrency: &kreuzberg.ConcurrencyConfig{
        MaxThreads: intPtr(4),
    },
}
ConcurrencyConfig.java
ConcurrencyConfig concurrency = new ConcurrencyConfig(4);
ExtractionConfig config = new ExtractionConfig(
    /* ... other fields ... */
    Optional.of(concurrency)
);
concurrency_config.cs
using Kreuzberg;

var config = new ExtractionConfig
{
    Concurrency = new ConcurrencyConfig { MaxThreads = 4 }
};

TreeSitterConfig

Configuration for tree-sitter language pack integration. Controls grammar caching and code analysis options when extracting source code files. Requires the tree-sitter feature flag.

Fields

Field Type Default Description
enabled bool true Enable code intelligence processing. When false, tree-sitter analysis is skipped even if config is present
cache_dir PathBuf? None Custom cache directory for downloaded grammars. Default: ~/.cache/tree-sitter-language-pack/v{version}/libs/
languages Vec<String>? None Languages to pre-download on init (for example, ["python", "rust"])
groups Vec<String>? None Language groups to pre-download (for example, ["web", "systems", "scripting"])
process TreeSitterProcessConfig default Processing options for code analysis

TreeSitterProcessConfig

Controls which analysis features are enabled when extracting code files.

Field Type Default Description
structure bool true Extract structural items (functions, classes, structs, etc.)
imports bool true Extract import statements
exports bool true Extract export statements
comments bool false Extract comments
docstrings bool false Extract docstrings
symbols bool false Extract symbol definitions (variables, constants, type aliases)
diagnostics bool false Include parse diagnostics (errors and warnings from tree-sitter)
chunk_max_size usize? None Maximum chunk size in bytes. None uses the default chunking size
content_mode CodeContentMode chunks Controls how code content is rendered in the content field: chunks (semantic chunks, default), raw (raw source code), or structure (function/class headings + docstrings, no code bodies)

Configuration Examples

kreuzberg.toml
[tree_sitter]
languages = ["python", "rust", "typescript"]
groups = ["web"]

[tree_sitter.process]
structure = true
imports = true
exports = true
comments = true
docstrings = true
symbols = false
diagnostics = false
tree_sitter_config.rs
use kreuzberg::{ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};

let config = ExtractionConfig {
    tree_sitter: Some(TreeSitterConfig {
        process: TreeSitterProcessConfig {
            structure: true,
            imports: true,
            exports: true,
            comments: true,
            docstrings: true,
            ..Default::default()
        },
        ..Default::default()
    }),
    ..Default::default()
};
tree_sitter_config.py
import kreuzberg

config = kreuzberg.ExtractionConfig(
    tree_sitter={
        "process": {
            "structure": True,
            "imports": True,
            "exports": True,
            "comments": True,
            "docstrings": True,
        }
    }
)
tree_sitter_config.ts
import { ExtractionConfig } from "@kreuzberg/node";

const config: ExtractionConfig = {
  treeSitter: {
    process: {
      structure: true,
      imports: true,
      exports: true,
      comments: true,
      docstrings: true,
    },
  },
};
tree_sitter_config.go
config := &kreuzberg.ExtractionConfig{
    TreeSitter: &kreuzberg.TreeSitterConfig{
        Process: &kreuzberg.TreeSitterProcessConfig{
            Structure:  boolPtr(true),
            Imports:    boolPtr(true),
            Exports:    boolPtr(true),
            Comments:   boolPtr(true),
            Docstrings: boolPtr(true),
        },
    },
}

Configuration File Examples

TOML Format

kreuzberg.toml
use_cache = true
enable_quality_processing = true
force_ocr = false

[ocr]
backend = "tesseract"
language = "eng+fra"

[ocr.tesseract_config]
psm = 6
oem = 1
min_confidence = 0.8
enable_table_detection = true

[ocr.tesseract_config.preprocessing]
target_dpi = 300
denoise = true
deskew = true
contrast_enhance = true
binarization_method = "otsu"

[pdf_options]
extract_images = true
extract_metadata = true
passwords = ["password1", "password2"]

[images]
extract_images = true
target_dpi = 200
max_image_dimension = 4096

[chunking]
max_characters = 1000
overlap = 200

[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false

[token_reduction]
mode = "moderate"
preserve_important_words = true

[layout]
preset = "fast"

[postprocessor]
enabled = true

YAML Format

kreuzberg.yaml
# kreuzberg.yaml
use_cache: true
enable_quality_processing: true
force_ocr: false

ocr:
  backend: tesseract
  language: eng+fra
  tesseract_config:
    psm: 6
    oem: 1
    min_confidence: 0.8
    enable_table_detection: true
    preprocessing:
      target_dpi: 300
      denoise: true
      deskew: true
      contrast_enhance: true
      binarization_method: otsu

pdf_options:
  extract_images: true
  extract_metadata: true
  passwords:
    - password1
    - password2

images:
  extract_images: true
  target_dpi: 200
  max_image_dimension: 4096

chunking:
  max_characters: 1000
  overlap: 200

language_detection:
  enabled: true
  min_confidence: 0.8
  detect_multiple: false

token_reduction:
  mode: moderate
  preserve_important_words: true

layout:
  preset: fast

postprocessor:
  enabled: true

JSON Format

kreuzberg.json
{
  "use_cache": true,
  "enable_quality_processing": true,
  "force_ocr": false,
  "ocr": {
    "backend": "tesseract",
    "language": "eng+fra",
    "tesseract_config": {
      "psm": 6,
      "oem": 1,
      "min_confidence": 0.8,
      "enable_table_detection": true,
      "preprocessing": {
        "target_dpi": 300,
        "denoise": true,
        "deskew": true,
        "contrast_enhance": true,
        "binarization_method": "otsu"
      }
    }
  },
  "pdf_options": {
    "extract_images": true,
    "extract_metadata": true,
    "passwords": ["password1", "password2"]
  },
  "images": {
    "extract_images": true,
    "target_dpi": 200,
    "max_image_dimension": 4096
  },
  "chunking": {
    "max_characters": 1000,
    "overlap": 200
  },
  "language_detection": {
    "enabled": true,
    "min_confidence": 0.8,
    "detect_multiple": false
  },
  "token_reduction": {
    "mode": "moderate",
    "preserve_important_words": true
  },
  "layout": {
    "preset": "fast"
  },
  "postprocessor": {
    "enabled": true
  }
}

For complete working examples, see the examples directory.


Best Practices

When to Use Config Files vs Programmatic Config

Use config files when:

  • Settings are shared across multiple scripts/applications
  • Configuration needs to be version controlled
  • Non-developers need to modify settings
  • Deploying to multiple environments (dev/staging/prod)

Use programmatic config when:

  • Settings vary per execution or are computed dynamically
  • Configuration depends on runtime conditions
  • Building SDKs or libraries that wrap Kreuzberg
  • Rapid prototyping and experimentation

Performance Considerations

Caching:

  • Keep use_cache=true for repeated processing of the same files
  • Cache is automatically invalidated when files change
  • Cache location: platform-specific global cache (for example, ~/.cache/kreuzberg/ on Linux, ~/Library/Caches/kreuzberg/ on macOS), configurable via KREUZBERG_CACHE_DIR env var or cache_dir option

OCR Settings:

  • Lower target_dpi (for example, 150-200) for faster processing of low-quality scans
  • Higher target_dpi (for example, 400-600) for small text or high-quality documents
  • Disable enable_table_detection if tables aren't needed (10-20% speedup)
  • Use psm=6 for clean single-column documents (faster than psm=3)

Batch Processing:

  • Set max_concurrent_extractions to balance speed and memory usage
  • Default (num_cpus * 2) works well for most systems
  • Reduce for memory-constrained environments
  • Increase for I/O-bound workloads on systems with fast storage

Token Reduction:

  • Use "light" or "moderate" modes for minimal quality impact
  • "aggressive" and "maximum" modes may affect semantic meaning
  • Benchmark with your specific LLM to measure quality vs. cost tradeoff

Security Considerations

API Keys and Secrets:

  • Never commit config files containing API keys or passwords to version control
  • Use environment variables for sensitive data:
Terminal
export KREUZBERG_OCR_API_KEY="your-key-here"
  • Add kreuzberg.toml to .gitignore if it contains secrets
  • Use separate config files for development vs. production

PDF Passwords:

  • passwords field attempts passwords in order until one succeeds
  • Passwords are not logged or cached
  • Use environment variables for sensitive passwords:
secure_config.py
import os
config = PdfConfig(passwords=[os.getenv("PDF_PASSWORD")])

File System Access:

  • Kreuzberg only reads files you explicitly pass to extraction functions
  • Cache directory permissions should be restricted to the running user
  • Temporary files are automatically cleaned up after extraction

Data Privacy:

  • Extraction results are never sent to external services (except explicit OCR backends)
  • Tesseract OCR runs locally with no network access
  • EasyOCR and PaddleOCR may download models on first run (cached locally)
  • Consider disabling cache for sensitive documents requiring ephemeral processing

ApiSizeLimits

Configuration for API server request and file upload size limits.

Field Type Default Description
max_request_body_bytes int 104857600 Maximum size of entire request body in bytes (100 MB default)
max_multipart_field_bytes int 104857600 Maximum size of individual file in multipart upload in bytes (100 MB default)

About Size Limits

Size limits protect your server from resource exhaustion and memory spikes. Both limits default to 100 MB, suitable for typical document processing workloads. Users can configure higher limits via environment variables for processing larger files.

Default Configuration:

  • Total request body: 100 MB (104,857,600 bytes)
  • Individual file: 100 MB (104,857,600 bytes)

Environment Variable Configuration:

Terminal
# Set multipart field limit to 200 MB via environment variable
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=209715200
kreuzberg serve -H 0.0.0.0 -p 8000

Example

using Kreuzberg;
using Kreuzberg.Api;

// Default limits: 100 MB for both request body and individual files
var limits = new ApiSizeLimits();

// Custom limits: 200 MB for both request body and individual files
var customLimits = ApiSizeLimits.FromMB(200, 200);

// Or specify byte values directly
var customLimits2 = new ApiSizeLimits
{
    MaxRequestBodyBytes = 200 * 1024 * 1024,
    MaxMultipartFieldBytes = 200 * 1024 * 1024
};
import "kreuzberg"

// Default limits: 100 MB for both request body and individual files
limits := kreuzberg.NewApiSizeLimits(
    100 * 1024 * 1024,
    100 * 1024 * 1024,
)

// Or use convenience method for custom limits
limits := kreuzberg.ApiSizeLimitsFromMB(200, 200)
import com.kreuzberg.api.ApiSizeLimits;

// Default limits: 100 MB for both request body and individual files
ApiSizeLimits limits = new ApiSizeLimits();

// Custom limits via convenience method
ApiSizeLimits limits = ApiSizeLimits.fromMB(200, 200);

// Or specify byte values
ApiSizeLimits limits = new ApiSizeLimits(
    200 * 1024 * 1024,
    200 * 1024 * 1024
);
from kreuzberg.api import ApiSizeLimits

# Default limits: 100 MB for both request body and individual files
limits = ApiSizeLimits()

# Custom limits via convenience method
limits = ApiSizeLimits.from_mb(200, 200)

# Or specify byte values
limits = ApiSizeLimits(
    max_request_body_bytes=200 * 1024 * 1024,
    max_multipart_field_bytes=200 * 1024 * 1024
)
require 'kreuzberg'

# Default limits: 100 MB for both request body and individual files
limits = Kreuzberg::Api::ApiSizeLimits.new

# Custom limits via convenience method
limits = Kreuzberg::Api::ApiSizeLimits.from_mb(200, 200)

# Or specify byte values
limits = Kreuzberg::Api::ApiSizeLimits.new(
  max_request_body_bytes: 200 * 1024 * 1024,
  max_multipart_field_bytes: 200 * 1024 * 1024
)
use kreuzberg::api::ApiSizeLimits;

// Default limits: 100 MB for both request body and individual files
let limits = ApiSizeLimits::default();

// Custom limits via convenience method
let limits = ApiSizeLimits::from_mb(200, 200);

// Or specify byte values
let limits = ApiSizeLimits::new(
    200 * 1024 * 1024,  // max_request_body_bytes
    200 * 1024 * 1024,  // max_multipart_field_bytes
);
import { ApiSizeLimits } from 'kreuzberg';

// Default limits: 100 MB for both request body and individual files
const limits = new ApiSizeLimits();

// Custom limits via convenience method
const limits = ApiSizeLimits.fromMb(200, 200);

// Or specify byte values
const limits = new ApiSizeLimits({
    maxRequestBodyBytes: 200 * 1024 * 1024,
    maxMultipartFieldBytes: 200 * 1024 * 1024
});

Configuration Scenarios

Use Case Recommended Limit Rationale
Small documents (standard PDFs, Office files) 100 MB (default) Optimal for typical business documents
Medium documents (large scans, batches) 200 MB Good balance for batching without excessive memory
Large documents (archives, high-res scans) 500-1000 MB Suitable for specialized workflows with adequate RAM
Development/testing 50 MB Conservative limit to catch issues early
Memory-constrained environments 50 MB Prevents out-of-memory errors on limited systems

For comprehensive documentation including memory impact calculations, reverse proxy configuration, and troubleshooting, see the File Size Limits Reference.