Configuration Reference

This page provides complete documentation for all Kreuzberg configuration types and fields. For quick-start examples and common use cases, see the Configuration Guide.

Getting Started

New users should start with the Configuration Guide which covers:

Configuration discovery mechanism
Quick-start examples in all languages
Common use cases (OCR setup, chunking for RAG)
Configuration file formats (TOML, YAML, JSON)

This reference page is the comprehensive source for:

All configuration field details
Default values and constraints
Technical specifications for each config type

ServerConfig

NEW in v4.2.7: The ServerConfig controls API server and network settings.

API server configuration for the Kreuzberg HTTP server, including host/port settings, CORS configuration, and upload size limits. All settings can be overridden via environment variables.

Overview

ServerConfig is used to customize the Kreuzberg API server behavior when running kreuzberg serve or embedding a Kreuzberg API server in your application. It controls network binding, cross-origin resource sharing (CORS), and file upload size constraints.

Fields

Field	Type	Default	Description
`host`	`String`	`"127.0.0.1"`	Server host address (for example, “127.0.0.1”, “0.0.0.0”)
`port`	`u16`	`8000`	Server port number (1-65535)
`cors_origins`	`Vec<String>`	empty	CORS allowed origins. Empty list allows all origins.
`max_request_body_bytes`	`usize`	`104857600`	Maximum request body size in bytes (100 MB default)
`max_multipart_field_bytes`	`usize`	`104857600`	Maximum multipart field size in bytes (100 MB default)

Configuration Precedence

Settings are applied in this order (highest priority first):

Environment Variables - KREUZBERG_* variables override everything
Configuration File - TOML, YAML, or JSON values
Programmatic Defaults - Hard-coded defaults

CORS Security Warning

The default configuration (empty cors_origins list) allows requests from any origin. This is suitable for development and internal APIs, but you should explicitly configure cors_origins for production deployments to prevent unauthorized cross-origin requests.

Recommended for production:

cors_origins = ["https://yourdomain.com", "https://app.yourdomain.com"]

Configuration Examples

use kreuzberg::core::ServerConfig;

// Basic configuration with defaults
let config = ServerConfig::default();
assert_eq!(config.host, "127.0.0.1");
assert_eq!(config.port, 8000);

// Custom configuration
let mut config = ServerConfig::default();
config.host = "0.0.0.0".to_string();
config.port = 3000;

// Listen address helper
println!("Server listening on: {}", config.listen_addr());

use kreuzberg::core::ServerConfig;

// Allow specific origins only (secure)
let mut config = ServerConfig::default();
config.cors_origins = vec![
    "https://app.example.com".to_string(),
    "https://admin.example.com".to_string(),
];

// Check if origin is allowed
assert!(config.is_origin_allowed("https://app.example.com"));
assert!(!config.is_origin_allowed("https://evil.com"));

// Check if allowing all origins
assert!(!config.cors_allows_all());

use kreuzberg::core::ServerConfig;

// Custom size limits (200 MB)
let mut config = ServerConfig::default();
config.max_request_body_bytes = 200 * 1_048_576;  // 200 MB
config.max_multipart_field_bytes = 200 * 1_048_576;  // 200 MB

// Get sizes in MB
println!("Max request body: {} MB", config.max_request_body_mb());
println!("Max file upload: {} MB", config.max_multipart_field_mb());

use kreuzberg::core::ServerConfig;

// Auto-detect format from extension (.toml, .yaml, .json)
let mut config = ServerConfig::from_file("server.toml")?;

// Or use specific loaders
let config = ServerConfig::from_toml_file("server.toml")?;
let config = ServerConfig::from_yaml_file("server.yaml")?;
let config = ServerConfig::from_json_file("server.json")?;

// Apply environment variable overrides
config.apply_env_overrides()?;

Environment Variable Overrides

All settings can be overridden via environment variables with KREUZBERG_ prefix:

# Network settings
export KREUZBERG_HOST="0.0.0.0"
export KREUZBERG_PORT="3000"

# CORS configuration (comma-separated)
export KREUZBERG_CORS_ORIGINS="https://app1.com, https://app2.com"

# Size limits (in bytes)
export KREUZBERG_MAX_REQUEST_BODY_BYTES="209715200"      # 200 MB
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES="209715200"   # 200 MB

kreuzberg serve

Configuration File Examples

TOML Format

# Basic server configuration
host = "0.0.0.0"          # Listen on all interfaces
port = 8000               # API port

# CORS configuration (empty = allow all)
cors_origins = [
    "https://app.example.com",
    "https://admin.example.com"
]

# Upload size limits (default: 100 MB)
max_request_body_bytes = 104857600      # 100 MB
max_multipart_field_bytes = 104857600   # 100 MB

YAML Format

host: 0.0.0.0
port: 8000

cors_origins:
  - https://app.example.com
  - https://admin.example.com

max_request_body_bytes: 104857600
max_multipart_field_bytes: 104857600

JSON Format

{
  "host": "0.0.0.0",
  "port": 8000,
  "cors_origins": ["https://app.example.com", "https://admin.example.com"],
  "max_request_body_bytes": 104857600,
  "max_multipart_field_bytes": 104857600
}

Docker Integration

When deploying Kreuzberg in Docker, use environment variables to configure the server:

FROM kreuzberg:latest

ENV KREUZBERG_HOST="0.0.0.0"
ENV KREUZBERG_PORT="8000"
ENV KREUZBERG_CORS_ORIGINS="https://yourdomain.com"
ENV KREUZBERG_MAX_MULTIPART_FIELD_BYTES="524288000"

EXPOSE 8000

CMD ["kreuzberg", "serve"]

docker run -it \
  -e KREUZBERG_HOST="0.0.0.0" \
  -e KREUZBERG_PORT="3000" \
  -e KREUZBERG_CORS_ORIGINS="https://api.example.com" \
  -p 3000:3000 \
  kreuzberg:latest kreuzberg serve

ExtractionConfig

Main extraction configuration controlling all aspects of document processing.

Field	Type	Default	Description
`use_cache`	`bool`	`true`	Enable caching of extraction results for faster re-processing
`enable_quality_processing`	`bool`	`true`	Enable quality post-processing (deduplication, mojibake fixing, etc.)
`force_ocr`	`bool`	`false`	Force OCR even for searchable PDFs with text layers
`disable_ocr`	`bool`	`false`	Disable OCR entirely — image files return empty content instead of raising errors (v4.7.0+)
`ocr`	`OcrConfig?`	`None`	OCR configuration (if None, OCR disabled)
`pdf_options`	`PdfConfig?`	`None`	PDF-specific configuration options
`images`	`ImageExtractionConfig?`	`None`	Image extraction configuration
`chunking`	`ChunkingConfig?`	`None`	Text chunking configuration for splitting into chunks
`content_filter`	`ContentFilterConfig?` v4.8.0	`None`	Header, footer, watermark, and repeating-text filtering. See ContentFilterConfig.
`token_reduction`	`TokenReductionConfig?`	`None`	Token reduction configuration for optimizing LLM context
`language_detection`	`LanguageDetectionConfig?`	`None`	Automatic language detection configuration
`postprocessor`	`PostProcessorConfig?`	`None`	Post-processing pipeline configuration
`pages`	`PageConfig?`	`None`	Page extraction and tracking configuration
`max_concurrent_extractions`	`int?`	`None`	Maximum concurrent batch extractions (defaults to num_cpus * 2)
`concurrency`	`ConcurrencyConfig?` v4.5.0	`None`	Concurrency configuration for threading (max_threads caps Rayon, ONNX intra-op threads, and batch semaphore)
`result_format`	`OutputFormat`	`Unified`	Result structure format: `Unified` (content in single field) or `ElementBased` (semantic elements array)
`output_format`	`OutputFormat`	`Plain`	Output format for extracted text content (Plain, Markdown, Djot, Html, Structured)
`html_options`	`ConversionOptions`	`None`	HTML to Markdown conversion options (heading styles, list formatting, code block styles). Only available with `html` feature.
`html_output`	`HtmlOutputConfig?` v4.8.1	`None`	Styled HTML output configuration: theme selection, custom CSS, class prefix. When set alongside `output_format = Html`, activates the styled renderer with `kb-*` class hooks. Only available with `html` feature.
`security_limits`	`SecurityLimits?`	`None` (uses defaults)	Archive security thresholds: max archive size (500MB), compression ratio (100:1), file count (10K), nesting depth, content size, XML depth, table cells. Only available with `archives` feature.
`layout`	`LayoutDetectionConfig?`	`None`	Layout detection configuration for document structure analysis. Only available with `layout-detection` feature.
`acceleration`	`AccelerationConfig?`	`None`	Hardware acceleration configuration for ONNX Runtime inference (layout detection and embeddings). See AccelerationConfig.
`include_document_structure`	`bool`	`false`	Enable structured document model output. When true, the `document` field on ExtractionResult is populated with a tree-based representation of document content.
`tree_sitter`	`TreeSitterConfig?`	`None`	Tree-sitter code intelligence configuration. Controls code analysis features when extracting source code files. Only available with `tree-sitter` feature.
`structured_extraction`	`StructuredExtractionConfig?`	`None`	Structured extraction configuration for LLM-powered schema-based extraction. When set, extraction results include a `structured_output` field with data conforming to the provided JSON schema. Only available with `liter-llm` feature.

Result Format vs Output Format

Important distinction: These two fields control different aspects of extraction results:

result_format - Controls the structure of the result:
- Unified (default): All content returned in the content field as a single string
- ElementBased: Content returned as semantic elements in the elements array (Unstructured-compatible format)
output_format - Controls the text format within the content:
- Plain (default): Raw extracted text
- Markdown: Markdown formatted output
- Djot: Djot markup format
- Html: HTML formatted output

OutputFormat (result_format field)

Controls the structure of extraction results:

Value	Description
`unified`	All content in single `content` field (default)
`element_based`	Semantic elements with type classification, IDs, and metadata

When result_format is set to ElementBased, the elements field contains an array of semantic elements with unique identifiers, element types (title, heading, narrative_text, etc.), and metadata for Unstructured-compatible processing.

OutputFormat (output_format field)

Output format for extraction content. Controls how extracted text is formatted in the result.

Value	Description
`plain`	Plain text content only (default)
`markdown`	Markdown formatted output
`djot`	Djot markup format
`html`	HTML formatted output
`structured`	Structured JSON with full OCR element data (bounding boxes, confidence)

Environment Variable: KREUZBERG_OUTPUT_FORMAT - Set output format via environment (plain, markdown, djot, html, structured)

HtmlOutputConfig

Configuration for the styled HTML renderer. When set on ExtractionConfig.html_output alongside output_format = Html, the pipeline produces HTML with semantic kb-* class hooks instead of plain HTML.

Field	Type	Default	Description
`theme`	`HtmlTheme`	`Unstyled`	Built-in colour/typography theme
`css`	`string?`	`None`	Inline CSS string appended after theme stylesheet
`css_file`	`path?`	`None`	CSS file loaded at render time (max 1 MiB)
`class_prefix`	`string`	`"kb-"`	CSS class prefix (alphanumeric + hyphens + underscores only)
`embed_css`	`bool`	`true`	Embed CSS in `<style>` block. Set `false` for external stylesheets

HtmlTheme

Built-in theme selection for styled HTML output.

Value	Description
`Unstyled` (default)	No built-in stylesheet. CSS custom properties defined on `:root` for user stylesheets
`Default`	System font stack, neutral colours, readable line measure
`GitHub`	GitHub Markdown-inspired palette and spacing
`Dark`	Dark background, light text
`Light`	Minimal light theme with generous whitespace

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    UseCache = true,
    EnableQualityProcessing = true,
    ForceOcr = false,
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

package main

import (
  "log"

  "github.com/kreuzberg-dev/kreuzberg-lts/v4"
)

func main() {
  useCache := true
  enableQP := true

  result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
    UseCache:                &useCache,
    EnableQualityProcessing: &enableQP,
  })
  if err != nil {
    log.Fatalf("extract failed: %v", err)
  }

  log.Println("content length:", len(result.Content))
}

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .useCache(true)
    .enableQualityProcessing(true)
    .build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);

import asyncio
from kreuzberg import extract_file, ExtractionConfig

async def main() -> None:
    config = ExtractionConfig(
        use_cache=True,
        enable_quality_processing=True
    )
    result = await extract_file("document.pdf", config=config)
    print(result.content)

asyncio.run(main())

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  use_cache: true,
  enable_quality_processing: true
)

result = Kreuzberg.extract_file_sync('document.pdf', config: config)

library(kreuzberg)

file_path <- "document.pdf"

config <- extraction_config(
  output_format = "markdown"
)

result <- extract_file_sync(file_path, config = config)

cat(sprintf("MIME type: %s\n", result$mime_type))
cat(sprintf("Content length: %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))

use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        use_cache: true,
        enable_quality_processing: true,
        ..Default::default()
    };

    let result = extract_file("document.pdf", None, &config).await?;
    println!("{}", result.content);
    Ok(())
}

import { extractFile } from '@kreuzberg/node';

const config = {
  useCache: true,
  enableQualityProcessing: true,
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

FileExtractionConfig v4.5.0

Per-file extraction configuration overrides for batch operations. All fields are optional — None means “use the batch-level default from ExtractionConfig.”

When passed as an optional parameter to batch_extract_file / batch_extract_bytes (or their sync variants), each file in the batch can specify its own overrides that are merged with the shared batch-level ExtractionConfig.

Overridable Fields

Field	Type	Description
`enable_quality_processing`	`bool?`	Override quality post-processing for this file
`ocr`	`OcrConfig?`	Override OCR configuration
`force_ocr`	`bool?`	Override force OCR
`disable_ocr`	`bool?`	Override disable OCR (v4.7.0+)
`chunking`	`ChunkingConfig?`	Override text chunking
`content_filter`	`ContentFilterConfig?`	Override content filtering
`images`	`ImageExtractionConfig?`	Override image extraction
`pdf_options`	`PdfConfig?`	Override PDF-specific options
`token_reduction`	`TokenReductionConfig?`	Override token reduction
`language_detection`	`LanguageDetectionConfig?`	Override language detection
`pages`	`PageConfig?`	Override page extraction
`keywords`	`KeywordConfig?`	Override keyword extraction
`postprocessor`	`PostProcessorConfig?`	Override post-processing
`html_options`	`ConversionOptions?`	Override HTML conversion options
`result_format`	`OutputFormat?`	Override result structure format
`output_format`	`OutputFormat?`	Override output content format
`include_document_structure`	`bool?`	Override document structure output
`layout`	`LayoutDetectionConfig?`	Override layout detection

Batch-Level Only Fields (Not Overridable)

These ExtractionConfig fields cannot be overridden per file:

max_concurrent_extractions — controls batch parallelism
use_cache — global caching policy
acceleration — shared ONNX execution provider
security_limits — global archive security policy

Merge Semantics

For each file in a batch, the effective configuration is computed by overlaying the per-file FileExtractionConfig onto the batch-level ExtractionConfig. A field set to None in FileExtractionConfig falls through to the batch default. A field set to Some(value) replaces the batch default entirely for that file.

Example

use kreuzberg::{
    batch_extract_file, ExtractionConfig, FileExtractionConfig, OcrConfig,
};
use std::path::PathBuf;

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let batch_config = ExtractionConfig::default();

    let paths = vec![
        PathBuf::from("report.pdf"),
        PathBuf::from("scanned.pdf"),
    ];

    let file_configs = vec![
        None, // Use batch defaults for this PDF
        Some(FileExtractionConfig { // Force OCR for this scanned document
            force_ocr: Some(true),
            ocr: Some(OcrConfig {
                backend: "tesseract".to_string(),
                language: "deu".to_string(),
                ..Default::default()
            }),
            ..Default::default()
        }),
    ];

    let results = batch_extract_file(paths, &batch_config, Some(&file_configs)).await?;
    Ok(())
}

from kreuzberg import (
    batch_extract_files_sync,
    ExtractionConfig,
    FileExtractionConfig,
    OcrConfig,
)

config = ExtractionConfig()

paths = ["report.pdf", "scanned.pdf"]
file_configs = [
    None,  # use batch defaults
    FileExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(backend="tesseract", language="deu"),
    ),
]

results = batch_extract_files_sync(paths, config, file_configs=file_configs)

import { batchExtractFilesSync } from '@kreuzberg/node';

const results = batchExtractFilesSync(
  ['report.pdf', 'scanned.pdf'],
  undefined, // use default config
  [
    null,  // use batch defaults
    {      // per-file overrides
      forceOcr: true,
      ocr: { backend: 'tesseract', language: 'deu' },
    },
  ],
);

ContentFilterConfig v4.8.0

Controls whether headers, footers, watermarks, and repeating cross-page text are kept in or stripped from extraction output. Applies to PDF, DOCX, RTF, ODT, HTML, EPUB, and PPT extractors with format-specific behavior.

When content_filter is None on ExtractionConfig, each extractor uses its built-in defaults (the same values listed below).

Fields

Field	Type	Default	Description
`include_headers`	`bool`	`False`	Keep running headers. PDF skips top-margin furniture stripping; DOCX includes header parts; HTML/EPUB keep `<header>` content.
`include_footers`	`bool`	`False`	Keep running footers. PDF skips bottom-margin furniture stripping; DOCX includes footer parts; HTML/EPUB keep `<footer>` content.
`strip_repeating_text`	`bool`	`True`	Detect text that repeats verbatim across most pages and remove it. Disable if brand names or repeated headings are being incorrectly stripped. Primarily PDF.
`include_watermarks`	`bool`	`False`	Keep watermark text and arXiv-style identifiers. PDF only.

The strip_repeating_text flag also gates paragraph deduplication: when set to False, near-duplicate paragraphs are preserved as well (kreuzberg/kreuzberg#681, fixed in v4.8.1).

When a layout-detection model is active, it can independently classify regions as PageHeader or PageFooter and strip them per page. To preserve those regions in addition to disabling the cross-page heuristic, set include_headers = True and/or include_footers = True.

Configuration Examples

from kreuzberg import ExtractionConfig, ContentFilterConfig

# Keep headers and footers for legal/forms work
config = ExtractionConfig(
    content_filter=ContentFilterConfig(
        include_headers=True,
        include_footers=True,
    ),
)

import { extract } from "@kreuzberg/node";

// Disable cross-page repeating-text detection
const result = await extract("report.pdf", {
  contentFilter: {
    stripRepeatingText: false,
  },
});

use kreuzberg::{ExtractionConfig, ContentFilterConfig};

let config = ExtractionConfig {
    content_filter: Some(ContentFilterConfig {
        include_headers: true,
        include_footers: true,
        strip_repeating_text: true,
        include_watermarks: false,
    }),
    ..Default::default()
};

[content_filter]
include_headers = true
include_footers = true
strip_repeating_text = true
include_watermarks = false

content_filter:
  include_headers: true
  include_footers: true
  strip_repeating_text: true
  include_watermarks: false

OcrConfig

Configuration for OCR (Optical Character Recognition) processing on images and scanned PDFs.

Field	Type	Default	Description
`backend`	`str`	`"tesseract"`	OCR backend to use: `"tesseract"`, `"easyocr"`, `"paddleocr"`
`language`	`str`	`"eng"`	Language code(s) for OCR, for example, `"eng"`, `"eng+fra"`, `"eng+deu+fra"`
`tesseract_config`	`TesseractConfig?`	`None`	Tesseract-specific configuration options
`paddle_ocr_config`	`PaddleOcrConfig?`	`None`	PaddleOCR-specific configuration options
`vlm_config`	`LlmConfig?`	`None`	Vision Language Model configuration for VLM-based OCR. When set, enables using a VLM as an OCR backend. Requires the `liter-llm` feature.
`vlm_prompt`	`String?`	`None`	Custom prompt for VLM-based OCR. Overrides the default OCR prompt sent to the vision model. Useful for domain-specific extraction instructions.

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Backend = "tesseract",
        Language = "eng+fra",
        TesseractConfig = new TesseractConfig { Psm = 3 }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine(result.Content);

package main

import "github.com/kreuzberg-dev/kreuzberg-lts/v4"

func main() {
  language := "eng+fra"
  psm := 3

  _ = &kreuzberg.ExtractionConfig{
    OCR: &kreuzberg.OCRConfig{
      Backend:  "tesseract",
      Language: &language,
      Tesseract: &kreuzberg.TesseractConfig{
        PSM: &psm,
      },
    },
  }
}

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .language("eng+fra")
        .tesseractConfig(TesseractConfig.builder()
            .psm(3)
            .build())
        .build())
    .build();

import asyncio
from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        ocr=OcrConfig(
            backend="tesseract", language="eng+fra",
            tesseract_config=TesseractConfig(psm=3)
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(result.content)

asyncio.run(main())

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    backend: 'tesseract',
    language: 'eng+fra',
    tesseract_config: Kreuzberg::Config::Tesseract.new(psm: 3)
  )
)

library(kreuzberg)

ocr_cfg <- ocr_config(backend = "tesseract", language = "eng", dpi = 300L)
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)

result <- extract_file_sync("document.pdf", "application/pdf", config)
cat(sprintf("Extracted content length: %d\n", nchar(result$content)))
cat(sprintf("Detected language: %s\n", result$detected_language))

use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: "eng+deu+fra".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("multilingual.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

import { extractFile } from '@kreuzberg/node';

const config = {
  ocr: {
    backend: 'tesseract',
    language: 'eng+fra',
    tesseractConfig: {
      psm: 3,
    },
  },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

PaddleOcrConfig v4.5.0

PaddleOCR-specific configuration for model selection and detection tuning.

Field	Type	Default	Description
`model_tier` v4.5.0	`str`	`"mobile"`	Model tier: `"mobile"` (lightweight, ~21MB total, fast) or `"server"` (high accuracy, ~172MB, best with GPU)
`padding` v4.5.0	`int`	`10`	Padding in pixels (0-100) added around the image before detection

TesseractConfig

Tesseract OCR engine configuration with fine-grained control over recognition parameters.

Field	Type	Default	Description
`language`	`str`	`"eng"`	Language code(s), for example, `"eng"`, `"eng+fra"`
`psm`	`int`	`3`	Page Segmentation Mode (0-13, see below)
`output_format`	`str`	`"markdown"`	Output format: `"text"`, `"markdown"`, `"hocr"`
`oem`	`int`	`3`	OCR Engine Mode (0-3, see below)
`min_confidence`	`float`	`0.0`	Minimum confidence threshold (0.0-100.0)
`preprocessing`	`ImagePreprocessingConfig?`	`None`	Image preprocessing configuration
`enable_table_detection`	`bool`	`true`	Enable automatic table detection and reconstruction
`table_min_confidence`	`float`	`0.0`	Minimum confidence for table cell recognition (0.0-1.0)
`table_column_threshold`	`int`	`50`	Pixel threshold for detecting table columns
`table_row_threshold_ratio`	`float`	`0.5`	Row threshold ratio for table detection (0.0-1.0)
`use_cache`	`bool`	`true`	Enable OCR result caching for faster re-processing
`classify_use_pre_adapted_templates`	`bool`	`true`	Use pre-adapted templates for character classification
`language_model_ngram_on`	`bool`	`false`	Enable N-gram language model for better word recognition
`tessedit_dont_blkrej_good_wds`	`bool`	`true`	Don’t reject good words during block-level processing
`tessedit_dont_rowrej_good_wds`	`bool`	`true`	Don’t reject good words during row-level processing
`tessedit_enable_dict_correction`	`bool`	`true`	Enable dictionary-based word correction
`tessedit_char_whitelist`	`str`	`""`	Allowed characters (empty = all allowed)
`tessedit_char_blacklist`	`str`	`""`	Forbidden characters (empty = none forbidden)
`tessedit_use_primary_params_model`	`bool`	`true`	Use primary language params model
`textord_space_size_is_variable`	`bool`	`true`	Enable variable-width space detection
`thresholding_method`	`bool`	`false`	Use adaptive thresholding method

Page Segmentation Modes (PSM)

0: Orientation and script detection only (no OCR)
1: Automatic page segmentation with OSD (Orientation and Script Detection)
2: Automatic page segmentation (no OSD, no OCR)
3: Fully automatic page segmentation (default, best for most documents)
4: Single column of text of variable sizes
5: Single uniform block of vertically aligned text
6: Single uniform block of text (best for clean documents)
7: Single text line
8: Single word
9: Single word in a circle
10: Single character
11: Sparse text with no particular order (best for forms, invoices)
12: Sparse text with OSD
13: Raw line (bypass Tesseract’s layout analysis)

OCR Engine Modes (OEM)

0: Legacy Tesseract engine only (pre-2016)
1: Neural nets LSTM engine only (recommended for best quality)
2: Legacy + LSTM engines combined
3: Default based on what’s available (recommended for compatibility)

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Language = "eng+fra+deu",
        TesseractConfig = new TesseractConfig
        {
            Psm = 6,
            Oem = 1,
            MinConfidence = 0.8m,
            EnableTableDetection = true
        }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");

package main

import (
  "log"

  "github.com/kreuzberg-dev/kreuzberg-lts/v4"
)

func main() {
  psm := 6
  oem := 1
  minConf := 0.8
  lang := "eng+fra+deu"
  whitelist := "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?"

  config := &kreuzberg.ExtractionConfig{
    OCR: &kreuzberg.OCRConfig{
      Backend:  "tesseract",
      Language: &lang,
      Tesseract: &kreuzberg.TesseractConfig{
        PSM:              &psm,
        OEM:              &oem,
        MinConfidence:    &minConf,
        EnableTableDetection: kreuzberg.BoolPtr(true),
        TesseditCharWhitelist: whitelist,
      },
    },
  }

  result, err := kreuzberg.ExtractFileSync("document.pdf", config)
  if err != nil {
    log.Fatalf("extract failed: %v", err)
  }

  log.Println("content length:", len(result.Content))
}

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .language("eng+fra+deu")
        .tesseractConfig(TesseractConfig.builder()
            .psm(6)
            .oem(1)
            .minConfidence(0.8)
            .tesseditCharWhitelist("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?")
            .enableTableDetection(true)
            .build())
        .build())
    .build();

import asyncio
from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        ocr=OcrConfig(
            language="eng+fra+deu",
            tesseract_config=TesseractConfig(
                psm=6,
                oem=1,
                min_confidence=0.8,
                enable_table_detection=True,
            ),
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    language: 'eng+fra+deu',
    tesseract_config: Kreuzberg::Config::Tesseract.new(
      psm: 6,
      oem: 1,
      min_confidence: 0.8,
      tessedit_char_whitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?',
      enable_table_detection: true
    )
  )
)

library(kreuzberg)

ocr_cfg <- ocr_config(
  backend = "tesseract",
  language = "eng+deu",
  dpi = 300L
)
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("Detected language: %s\n", result$detected_language))
cat(sprintf("Content length: %d characters\n", nchar(result$content)))

use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig};

fn main() {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            language: "eng+fra+deu".to_string(),
            tesseract_config: Some(TesseractConfig {
                psm: 6,
                oem: 1,
                min_confidence: 0.8,
                tessedit_char_whitelist: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?".to_string(),
                enable_table_detection: true,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };
    println!("{:?}", config.ocr);
}

import { extractFile } from '@kreuzberg/node';

const config = {
  ocr: {
    backend: 'tesseract',
    language: 'eng+fra+deu',
    tesseractConfig: {
      psm: 6,
      tesseditCharWhitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?',
      enableTableDetection: true,
    },
  },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

ChunkingConfig

Configuration for splitting extracted text into overlapping chunks, useful for vector databases and LLM processing.

Field	Type	Default	Description
`max_characters`	`int`	`1000`	Maximum characters per chunk
`overlap`	`int`	`200`	Overlap between consecutive chunks in characters
`embedding`	`EmbeddingConfig?`	`None`	Optional embedding generation for each chunk
`preset`	`str?`	`None`	Chunking preset: `"small"` (500/100), `"medium"` (1000/200), `"large"` (2000/400)
`trim`	`bool`	`true`	Whether to trim whitespace from chunk boundaries
`chunker_type`	`ChunkerType`	`Text`	Type of chunker: `Text`, `Markdown`, `Yaml`, or `Semantic`. Set to `"semantic"` for topic-aware chunking that works out of the box with no extra configuration needed.
`topic_threshold`	`float` / `None`	`0.75`	Optional. Cosine similarity threshold for topic boundary detection (0.0-1.0). Only used with `chunker_type="semantic"` and an embedding config. Rarely needs tuning.
`sizing` v4.5.0	`ChunkSizing`	`Characters`	Controls how chunk size is measured. `Characters` counts characters (default). `Tokenizer` counts tokens using a HuggingFace tokenizer model. Requires the `chunking-tokenizers` feature

Note: max_chars and max_overlap are accepted as aliases for max_characters and overlap respectively for backwards compatibility.

When chunker_type is set to "markdown", the chunker populates heading_context on each chunk’s metadata with the heading hierarchy (for example, # Title > ## Section) that the chunk falls under. This is useful for preserving semantic context in RAG pipelines.

When chunker_type is set to "semantic", the chunker groups paragraphs by topic similarity. It works out of the box with no extra configuration – just set chunker_type="semantic" and all defaults (max_characters=1000, overlap=200, topic_threshold=0.75) are tuned for typical RAG use cases. If an embedding config is provided, adjacent segments are compared and split at topic boundaries where cosine similarity falls below topic_threshold. Without embeddings, structural-only splitting is performed.

Example

using Kreuzberg;

class Program
{
    static async Task Main()
    {
        var config = new ExtractionConfig
        {
            Chunking = new ChunkingConfig
            {
                MaxChars = 1000,
                MaxOverlap = 200,
                Embedding = new EmbeddingConfig
                {
                    Model = EmbeddingModelType.Preset("all-minilm-l6-v2"),
                    Normalize = true,
                    BatchSize = 32
                }
            }
        };

        try
        {
            var result = await KreuzbergClient.ExtractFileAsync(
                "document.pdf",
                config
            ).ConfigureAwait(false);

            Console.WriteLine($"Chunks: {result.Chunks.Count}");
            foreach (var chunk in result.Chunks)
            {
                Console.WriteLine($"Content length: {chunk.Content.Length}");
                if (chunk.Embedding != null)
                {
                    Console.WriteLine($"Embedding dimensions: {chunk.Embedding.Length}");
                }
            }
        }
        catch (KreuzbergException ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
        }
    }

    static async Task PrependHeadingContextExample()
    {
        var config = new ExtractionConfig
        {
            Chunking = new ChunkingConfig
            {
                MaxChars = 500,
                MaxOverlap = 50,
                PrependHeadingContext = true
            }
        };

        try
        {
            var result = await KreuzbergClient.ExtractFileAsync(
                "document.md",
                config
            ).ConfigureAwait(false);

            foreach (var chunk in result.Chunks)
            {
                // Each chunk's content is prefixed with its heading breadcrumb
                Console.WriteLine(chunk.Content[..Math.Min(100, chunk.Content.Length)]);
            }
        }
        catch (KreuzbergException ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
        }
    }
}

package main

import (
  "fmt"

  "github.com/kreuzberg-dev/kreuzberg-lts/v4"
)

func main() {
  maxChars := 1000
  maxOverlap := 200
  config := &kreuzberg.ExtractionConfig{
    Chunking: &kreuzberg.ChunkingConfig{
      MaxChars:   &maxChars,
      MaxOverlap: &maxOverlap,
    },
  }

  fmt.Printf("Config: MaxChars=%d, MaxOverlap=%d\n", *config.Chunking.MaxChars, *config.Chunking.MaxOverlap)
}

package main

import (
  "fmt"

  "github.com/kreuzberg-dev/kreuzberg-lts/v4"
)

func main() {
  maxChars := 500
  maxOverlap := 50

  config := &kreuzberg.ExtractionConfig{
    Chunking: &kreuzberg.ChunkingConfig{
      MaxChars:   &maxChars,
      MaxOverlap: &maxOverlap,
      Sizing: &kreuzberg.ChunkSizingConfig{
        Type:  "tokenizer",
        Model: "Xenova/gpt-4o",
      },
    },
  }

  result, err := kreuzberg.ExtractFile("document.md", nil, config)
  if err != nil {
    panic(err)
  }

  for _, chunk := range result.Chunks {
    if chunk.Metadata != nil && chunk.Metadata.HeadingContext != nil {
      for _, heading := range chunk.Metadata.HeadingContext.Headings {
        fmt.Printf("Heading L%d: %s\n", heading.Level, heading.Text)
      }
    }
    fmt.Printf("Content: %.100s...\n", chunk.Content)
  }
}

package main

import (
  "fmt"

  "github.com/kreuzberg-dev/kreuzberg-lts/v4"
)

func boolPtr(b bool) *bool { return &b }

func main() {
  maxChars := 500
  maxOverlap := 50

  config := &kreuzberg.ExtractionConfig{
    Chunking: &kreuzberg.ChunkingConfig{
      MaxChars:              &maxChars,
      MaxOverlap:            &maxOverlap,
      PrependHeadingContext: boolPtr(true),
    },
  }

  result, err := kreuzberg.ExtractFile("document.md", nil, config)
  if err != nil {
    panic(err)
  }

  for _, chunk := range result.Chunks {
    // Each chunk's content is prefixed with its heading breadcrumb
    fmt.Printf("Content: %.100s...\n", chunk.Content)
  }
}

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .chunking(ChunkingConfig.builder()
        .maxChars(1000)
        .maxOverlap(200)
        .build())
    .build();

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;
import dev.kreuzberg.HeadingContext;
import dev.kreuzberg.HeadingLevel;

ExtractionConfig config = ExtractionConfig.builder()
    .chunking(ChunkingConfig.builder()
        .chunkerType("markdown")
        .maxChars(500)
        .maxOverlap(50)
        .sizingTokenizer("Xenova/gpt-4o")
        .build())
    .build();

ExtractionResult result = KreuzbergClient.extractFile("document.md", config);

result.getChunks().forEach(chunk -> {
    var headingContext = chunk.getMetadata().getHeadingContext();
    if (headingContext.isPresent()) {
        System.out.println("Headings:");
        headingContext.get().getHeadings().forEach(heading ->
            System.out.println("  Level " + heading.getLevel() + ": " + heading.getText())
        );
    }
});

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .chunking(ChunkingConfig.builder()
        .chunkerType("markdown")
        .maxChars(500)
        .maxOverlap(50)
        .prependHeadingContext(true)
        .build())
    .build();

ExtractionResult result = KreuzbergClient.extractFile("document.md", config);

result.getChunks().forEach(chunk -> {
    // Each chunk's content is prefixed with its heading breadcrumb
    System.out.println(chunk.getContent().substring(0, Math.min(100, chunk.getContent().length())));
});

import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(
            max_chars=1000,
            max_overlap=200,
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Chunks: {len(result.chunks or [])}")
    for chunk in result.chunks or []:
        print(f"Length: {len(chunk.content)}")

asyncio.run(main())

import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(
            chunker_type="markdown",
            max_chars=500,
            max_overlap=50,
            sizing_type="tokenizer",
            sizing_model="Xenova/gpt-4o",
        )
    )
    result = await extract_file("document.md", config=config)
    for chunk in result.chunks or []:
        heading_context = chunk.metadata.get("heading_context")
        if heading_context:
            headings = heading_context.get("headings", [])
            for h in headings:
                print(f"Heading L{h['level']}: {h['text']}")
        print(f"Content: {chunk.content[:100]}...")

asyncio.run(main())

import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(chunker_type="semantic")
    )
    result = await extract_file("document.pdf", config=config)
    for chunk in result.chunks or []:
        print(f"Content: {chunk.content[:100]}...")

asyncio.run(main())

import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(
            chunker_type="markdown",
            max_chars=500,
            max_overlap=50,
            prepend_heading_context=True,
        )
    )
    result = await extract_file("document.md", config=config)
    for chunk in result.chunks or []:
        # Each chunk's content is prefixed with its heading breadcrumb
        print(f"Content: {chunk.content[:100]}...")

asyncio.run(main())

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(
    max_characters: 1000,
    overlap: 200
  )
)

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(
    chunker_type: "markdown",
    max_characters: 500,
    overlap: 50,
    sizing_type: "tokenizer",
    sizing_model: "Xenova/gpt-4o"
  )
)

result = Kreuzberg.extract_file("document.md", config)

result.chunks.each do |chunk|
  if chunk.metadata.heading_context
    puts "Headings:"
    chunk.metadata.heading_context.headings.each do |heading|
      puts "  #{' ' * (heading.level - 1) * 2}Level #{heading.level}: #{heading.text}"
    end
  end
end

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(
    chunker_type: "markdown",
    max_characters: 500,
    overlap: 50,
    prepend_heading_context: true
  )
)

result = Kreuzberg.extract_file("document.md", config)

result.chunks.each do |chunk|
  # Each chunk's content is prefixed with its heading breadcrumb
  puts chunk.content[0, 100]
end

library(kreuzberg)

# Example 1: Basic character-based chunking
chunking_cfg <- chunking_config(max_characters = 1000L, overlap = 200L)
config <- extraction_config(chunking = chunking_cfg)

result <- extract_file_sync("document.pdf", "application/pdf", config)
num_chunks <- length(result$chunks)
cat(sprintf("Document split into %d chunks\n", num_chunks))
for (i in seq_len(min(3L, num_chunks))) {
  cat(sprintf("Chunk %d: %d characters\n", i, nchar(result$chunks[[i]])))
}

# Example 2: Markdown chunker with token-based sizing and heading context
chunking_cfg2 <- chunking_config(
  chunker_type = "markdown",
  sizing = list(
    type = "tokenizer",
    model = "Xenova/gpt-4o"
  )
)
config2 <- extraction_config(chunking = chunking_cfg2)

result2 <- extract_file_sync("document.md", "text/markdown", config2)
num_chunks2 <- length(result2$chunks)
cat(sprintf("\nMarkdown document split into %d chunks\n", num_chunks2))

for (i in seq_len(min(3L, num_chunks2))) {
  chunk <- result2$chunks[[i]]
  cat(sprintf("\nChunk %d:\n", i))
  cat(sprintf("  Preview: %s...\n", substr(chunk$text, 1, 60)))

  # Access heading context
  if (!is.null(chunk$metadata$heading_context)) {
    headings <- chunk$metadata$heading_context$headings
    if (length(headings) > 0) {
      cat("  Headings in context:\n")
      for (h in headings) {
        cat(sprintf("    - Level %d: %s\n", h$level, h$text))
      }
    }
  }
}

# Example 3: Prepend heading context to chunk content
chunking_cfg3 <- chunking_config(
  chunker_type = "markdown",
  prepend_heading_context = TRUE
)
config3 <- extraction_config(chunking = chunking_cfg3)

result3 <- extract_file_sync("document.md", "text/markdown", config3)
num_chunks3 <- length(result3$chunks)
cat(sprintf("\nDocument split into %d chunks with prepended headings\n", num_chunks3))

for (i in seq_len(min(3L, num_chunks3))) {
  chunk <- result3$chunks[[i]]
  # Each chunk's content is prefixed with its heading breadcrumb
  cat(sprintf("Chunk %d: %s...\n", i, substr(chunk$content, 1, 80)))
}

use kreuzberg::{ExtractionConfig, ChunkingConfig};

let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        embedding: None,
    }),
    ..Default::default()
};

use kreuzberg::{ExtractionConfig, ChunkingConfig, ChunkerType};

let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        chunker_type: ChunkerType::Semantic,
        ..Default::default()
    }),
    ..Default::default()
};

use kreuzberg::{ExtractionConfig, ChunkingConfig, ChunkerType};

let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_characters: 500,
        overlap: 50,
        chunker_type: ChunkerType::Markdown,
        prepend_heading_context: true,
        ..Default::default()
    }),
    ..Default::default()
};

import { extractFile } from '@kreuzberg/node';

const config = {
  chunking: {
    maxChars: 1000,
    maxOverlap: 200,
  },
};

const result = await extractFile('document.pdf', null, config);
console.log(`Total chunks: ${result.chunks?.length ?? 0}`);

import { extractFile } from '@kreuzberg/node';

const config = {
  chunking: {
    chunkerType: 'markdown',
    maxChars: 500,
    maxOverlap: 50,
    sizingType: 'tokenizer',
    sizingModel: 'Xenova/gpt-4o',
  },
};

const result = await extractFile('document.md', null, config);
for (const chunk of result.chunks ?? []) {
  const headings = chunk.metadata?.headingContext?.headings ?? [];
  for (const heading of headings) {
    console.log(`Heading L${heading.level}: ${heading.text}`);
  }
  console.log(`Content: ${chunk.content.slice(0, 100)}...`);
}

import { extractFile } from '@kreuzberg/node';

const config = {
  chunking: {
    chunkerType: 'semantic',
  },
};

const result = await extractFile('document.pdf', null, config);
for (const chunk of result.chunks ?? []) {
  console.log(`Content: ${chunk.content.slice(0, 100)}...`);
}

import { extractFile } from '@kreuzberg/node';

const config = {
  chunking: {
    chunkerType: 'markdown',
    maxChars: 500,
    maxOverlap: 50,
    prependHeadingContext: true,
  },
};

const result = await extractFile('document.md', null, config);
for (const chunk of result.chunks ?? []) {
  // Each chunk's content is prefixed with its heading breadcrumb
  console.log(`Content: ${chunk.content.slice(0, 100)}...`);
}

EmbeddingConfig

Configuration for generating vector embeddings for text chunks. Enables semantic search and similarity matching by converting text into high-dimensional vector representations.

Overview

EmbeddingConfig is used to control embedding generation when chunking documents. It allows you to choose from pre-optimized models or specify custom models from HuggingFace. Embeddings can be generated for each chunk to enable vector database integration and semantic search capabilities.

Fields

Field	Type	Default	Description
`model`	`EmbeddingModelType`	`Preset { name: "balanced" }`	Embedding model selection (preset or custom)
`batch_size`	`usize`	`32`	Number of texts to process in each batch (higher = faster but more memory)
`normalize`	`bool`	`true`	Normalize embedding vectors to unit length (recommended for cosine similarity)
`show_download_progress`	`bool`	`false`	Show progress when downloading model files
`cache_dir`	`String?`	`~/.cache/kreuzberg/embeddings/`	Custom cache directory for downloaded models

Model Types

Preset Models (Recommended)

Preset models are pre-optimized configurations for common use cases. They automatically download and cache the necessary model files.

Preset	Model	Dims	Speed	Quality	Use Case
`fast`	AllMiniLML6V2Q	384	Very Fast	Good	Development, prototyping, resource-constrained environments
`balanced`	BGEBaseENV15	768	Fast	Excellent	Default: General-purpose RAG, production deployments, English documents
`quality`	BGELargeENV15	1024	Moderate	Outstanding	Complex documents, maximum accuracy, sufficient compute resources
`multilingual`	MultilingualE5Base	768	Fast	Excellent	International documents, 100+ languages, mixed-language content

Preset models require the embeddings feature to be enabled in Kreuzberg.

Model Characteristics:

Fast: ~22M parameters, 384-dimensional vectors. Best for quick prototyping and development where speed is prioritized over quality.
Balanced: ~109M parameters, 768-dimensional vectors. Excellent general-purpose model with strong semantic understanding for most use cases.
Quality: ~335M parameters, 1024-dimensional vectors. Large model for maximum semantic accuracy when compute resources are available.
Multilingual: ~109M parameters, 768-dimensional vectors. Trained on multilingual data, effective for 100+ languages including rare languages.

FastEmbed Models

FastEmbed is a library for fast embedding generation. You can specify any supported FastEmbed model by name.

Common FastEmbed models:

AllMiniLML6V2Q - 384 dims, fast, quantized (same as fast preset)
BGEBaseENV15 - 768 dims, balanced (same as balanced preset)
BGELargeENV15 - 1024 dims, high quality (same as quality preset)
MultilingualE5Base - 768 dims, multilingual (same as multilingual preset)

Requires the embeddings feature and explicit dimensions specification.

Custom Models

Custom ONNX models from HuggingFace can be specified for specialized use cases. Provide the HuggingFace model ID and vector dimensions.

Note: Custom model support for full embedding generation is planned for future releases. Currently, custom models can be loaded and used via the Rust API.

LLM Provider-Hosted Embeddings

Instead of running local ONNX models, you can delegate embedding generation to a cloud provider’s embedding API via liter-llm. This is useful when you want to use the same embedding model as your vector database provider or when local model hosting is impractical.

use kreuzberg::core::{EmbeddingConfig, EmbeddingModelType, LlmConfig};

let config = EmbeddingConfig {
    model: EmbeddingModelType::Llm {
        llm: LlmConfig {
            model: "openai/text-embedding-3-small".to_string(),
            api_key: None, // Falls back to OPENAI_API_KEY env var
            base_url: None,
        },
    },
    batch_size: 32,
    ..Default::default()
};

[chunking.embedding]
model = { type = "llm", model = "openai/text-embedding-3-small" }
batch_size = 32

Note: When api_key is not set in LlmConfig, liter-llm falls back to provider-standard environment variables (for example, OPENAI_API_KEY, ANTHROPIC_API_KEY). Requires the liter-llm feature.

Cache Directory

Model files are cached locally to avoid re-downloading on subsequent runs.

Default cache location:

~/.cache/kreuzberg/embeddings/

Features:

Tilde (~) expansion: Home directory automatically resolved
Automatic creation: Cache directory created if it doesn’t exist
Persistent across runs: Models cached indefinitely until manually removed
Multi-process safe: Thread-safe concurrent access

Custom cache directory:

[chunking.embedding]
model = { type = "preset", name = "balanced" }
cache_dir = "/custom/cache/path"

Performance Considerations

Batch Size Tuning

Default: 32 texts per batch
Small values (8-16): Lower memory usage, slower processing
Large values (64-128): Faster processing, higher memory usage
Adjust based on available GPU/CPU memory and document sizes

Normalization

Enabled (default): Vectors normalized to unit length, suitable for cosine similarity
Disabled: Raw vectors suitable for other distance metrics (Euclidean, dot product)

Model Size Trade-offs

Model	Size	Speed	Quality	Memory	Network
Fast	20 MB	Fastest	Good	200 MB	100 MB
Balanced	250 MB	Fast	Excellent	500 MB	250 MB
Quality	800 MB	Moderate	Outstanding	1.5 GB	800 MB
Multilingual	250 MB	Fast	Excellent	500 MB	250 MB

Configuration Examples

use kreuzberg::core::{ExtractionConfig, ChunkingConfig, EmbeddingConfig, EmbeddingModelType};

// Basic embedding with default balanced preset
let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        embedding: Some(EmbeddingConfig::default()),
        preset: None,
    }),
    ..Default::default()
};

use kreuzberg::core::{EmbeddingConfig, EmbeddingModelType};

// Use fast preset for quick processing
let config = EmbeddingConfig {
    model: EmbeddingModelType::Preset {
        name: "fast".to_string(),
    },
    normalize: true,
    batch_size: 16,
    show_download_progress: true,
    cache_dir: None,
};

// Use quality preset for best accuracy
let config = EmbeddingConfig {
    model: EmbeddingModelType::Preset {
        name: "quality".to_string(),
    },
    batch_size: 32,
    ..Default::default()
};

// Use multilingual for international content
let config = EmbeddingConfig {
    model: EmbeddingModelType::Preset {
        name: "multilingual".to_string(),
    },
    ..Default::default()
};

use kreuzberg::core::{EmbeddingConfig, EmbeddingModelType};

// Explicit ONNX model specification
let config = EmbeddingConfig {
    model: EmbeddingModelType::FastEmbed {
        model: "BGEBaseENV15".to_string(),
        dimensions: 768,
    },
    batch_size: 32,
    ..Default::default()
};

use kreuzberg::core::{EmbeddingConfig, EmbeddingModelType};
use std::path::PathBuf;

let config = EmbeddingConfig {
    model: EmbeddingModelType::Preset {
        name: "balanced".to_string(),
    },
    cache_dir: Some(PathBuf::from("/custom/models/cache")),
    show_download_progress: true,
    ..Default::default()
};

Configuration File Examples

TOML Format

[chunking]
max_characters = 1000
overlap = 200

# Use balanced preset (default)
[chunking.embedding]
model = { type = "preset", name = "balanced" }
batch_size = 32
normalize = true

# Or use fast preset
# [chunking.embedding]
# model = { type = "preset", name = "fast" }
# batch_size = 16

# Or use custom cache directory
# [chunking.embedding]
# model = { type = "preset", name = "quality" }
# cache_dir = "/data/models"
# show_download_progress = true

Token-Based Sizing (TOML)

[chunking]
max_chars = 512
max_overlap = 50

[chunking.sizing]
type = "tokenizer"
model = "Xenova/gpt-4o"

YAML Format

chunking:
  max_characters: 1000
  overlap: 200
  embedding:
    model:
      type: preset
      name: balanced
    batch_size: 32
    normalize: true

JSON Format

{
  "chunking": {
    "max_characters": 1000,
    "overlap": 200,
    "embedding": {
      "model": {
        "type": "preset",
        "name": "balanced"
      },
      "batch_size": 32,
      "normalize": true
    }
  }
}

LlmConfig

Configuration for LLM provider connections used by structured extraction, VLM-based OCR, and provider-hosted embeddings. Uses liter-llm for provider-agnostic model access.

Fields

Field	Type	Default	Description
`model`	`String`	—	Model identifier in `provider/model-name` format (for example, `"openai/gpt-4o-mini"`, `"anthropic/claude-sonnet-4-20250514"`)
`api_key`	`String?`	`None`	API key for the provider. When `None`, falls back to provider-standard env vars (for example, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`)
`base_url`	`String?`	`None`	Custom base URL for the provider API. When `None`, uses the provider’s default endpoint. Useful for proxies or self-hosted API-compatible servers

Configuration Examples

use kreuzberg::core::LlmConfig;

// Minimal config (uses provider env var for API key)
let config = LlmConfig {
    model: "openai/gpt-4o-mini".to_string(),
    api_key: None,
    base_url: None,
};

// Explicit API key and custom endpoint
let config = LlmConfig {
    model: "openai/gpt-4o".to_string(),
    api_key: Some("sk-...".to_string()),
    base_url: Some("https://api.example.com".to_string()),
};

config = {
    "model": "openai/gpt-4o-mini",
    "api_key": None,       # Falls back to OPENAI_API_KEY
    "base_url": None,
}

const config: LlmConfig = {
  model: "openai/gpt-4o-mini",
  apiKey: undefined,     // Falls back to OPENAI_API_KEY
  baseUrl: undefined,
};

config := kreuzberg.LlmConfig{
    Model:   "openai/gpt-4o-mini",
    ApiKey:  nil,  // Falls back to OPENAI_API_KEY
    BaseUrl: nil,
}

Configuration File Examples

[llm]
model = "openai/gpt-4o-mini"
# api_key = "sk-..."       # Optional: falls back to OPENAI_API_KEY
# base_url = "https://..."  # Optional: uses provider default

llm:
  model: openai/gpt-4o-mini
  # api_key: sk-...
  # base_url: https://...

StructuredExtractionConfig

Configuration for LLM-powered structured data extraction. Enables extracting structured data from documents by providing a JSON schema that defines the expected output format. The LLM processes the document content and returns data conforming to the schema.

Fields

Field	Type	Default	Description
`llm`	`LlmConfig`	—	LLM provider configuration for the structured extraction model
`schema`	`JsonValue`	—	JSON Schema defining the expected output structure. Must be a valid JSON Schema object.
`prompt`	`String?`	`None`	Custom system prompt for structured extraction. Overrides the default prompt. Useful for domain-specific instructions.
`max_tokens`	`usize?`	`None`	Maximum tokens for LLM response. When `None`, uses the provider’s default limit.
`temperature`	`f64?`	`None`	Sampling temperature (0.0-2.0). Lower values produce more deterministic output. When `None`, defaults to `0.0` for maximum consistency.

Configuration Examples

use kreuzberg::core::{ExtractionConfig, StructuredExtractionConfig, LlmConfig};
use serde_json::json;

let config = ExtractionConfig {
    structured_extraction: Some(StructuredExtractionConfig {
        llm: LlmConfig {
            model: "openai/gpt-4o-mini".to_string(),
            api_key: None,
            base_url: None,
        },
        schema: json!({
            "type": "object",
            "properties": {
                "invoice_number": { "type": "string" },
                "total_amount": { "type": "number" },
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": { "type": "string" },
                            "amount": { "type": "number" }
                        }
                    }
                }
            },
            "required": ["invoice_number", "total_amount"]
        }),
        prompt: None,
        max_tokens: None,
        temperature: Some(0.0),
    }),
    ..Default::default()
};

config = {
    "structured_extraction": {
        "llm": {
            "model": "openai/gpt-4o-mini",
        },
        "schema": {
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string"},
                "total_amount": {"type": "number"},
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "amount": {"type": "number"},
                        },
                    },
                },
            },
            "required": ["invoice_number", "total_amount"],
        },
        "temperature": 0.0,
    },
}

const config: ExtractionConfig = {
  structuredExtraction: {
    llm: {
      model: "openai/gpt-4o-mini",
    },
    schema: {
      type: "object",
      properties: {
        invoice_number: { type: "string" },
        total_amount: { type: "number" },
        line_items: {
          type: "array",
          items: {
            type: "object",
            properties: {
              description: { type: "string" },
              amount: { type: "number" },
            },
          },
        },
      },
      required: ["invoice_number", "total_amount"],
    },
    temperature: 0.0,
  },
};

Configuration File Examples

[structured_extraction]
prompt = "Extract invoice data from the document."
max_tokens = 4096
temperature = 0.0

[structured_extraction.llm]
model = "openai/gpt-4o-mini"

[structured_extraction.schema]
type = "object"

[structured_extraction.schema.properties.invoice_number]
type = "string"

[structured_extraction.schema.properties.total_amount]
type = "number"

structured_extraction:
  llm:
    model: openai/gpt-4o-mini
  schema:
    type: object
    properties:
      invoice_number:
        type: string
      total_amount:
        type: number
    required:
      - invoice_number
      - total_amount
  temperature: 0.0

EmailConfig

Configuration for .msg (Outlook/MAPI) and .eml email file extraction. Controls how legacy Windows codepage encodings are handled when reading email headers and bodies that lack explicit character set declarations.

Overview

Many older email messages — particularly those created by Microsoft Outlook on Windows — encode text using a Windows code page rather than UTF-8. When no charset is declared in the message headers, Kreuzberg defaults to Windows-1252 (Western European). Use msg_fallback_codepage to override this default for mailboxes that predominantly contain messages in a different encoding.

Fields

Field	Type	Default	Description
`msg_fallback_codepage`	`int?`	`None` (Windows-1252)	Windows code page number used when no charset is declared in the message. `None` = use 1252.

Common Codepage Values

Code Page	Encoding	Region / Language
`1250`	Windows Central European	Polish, Czech, Hungarian, and so on.
`1251`	Windows Cyrillic	Russian, Ukrainian, Bulgarian
`1252`	Windows Western European	English, German, French (default)
`1253`	Windows Greek	Greek
`1254`	Windows Turkish	Turkish
`1255`	Windows Hebrew	Hebrew
`1256`	Windows Arabic	Arabic
`932`	Shift-JIS	Japanese
`936`	GBK (Simplified Chinese)	Simplified Chinese

Configuration Examples

from kreuzberg import ExtractionConfig, PdfConfig
from kreuzberg.email import EmailConfig

# Extract a Russian Outlook .msg file with Cyrillic encoding
config = ExtractionConfig(
    pdf_options=PdfConfig(
        email=EmailConfig(msg_fallback_codepage=1251)
    )
)

import { extract } from "kreuzberg";

// Extract a Japanese .msg file encoded in Shift-JIS
const result = await extract("message.msg", {
  pdfOptions: {
    email: { msgFallbackCodepage: 932 },
  },
});

use kreuzberg::core::{ExtractionConfig, PdfConfig, EmailConfig};

// Extract a Central European .msg file
let config = ExtractionConfig {
    pdf_options: Some(PdfConfig {
        email: Some(EmailConfig {
            msg_fallback_codepage: Some(1250),
        }),
        ..Default::default()
    }),
    ..Default::default()
};

LanguageDetectionConfig

Configuration for automatic language detection in extracted text.

Field	Type	Default	Description
`enabled`	`bool`	`true`	Enable language detection
`min_confidence`	`float`	`0.8`	Minimum confidence threshold (0.0-1.0) for reporting detected languages
`detect_multiple`	`bool`	`false`	Detect multiple languages (vs. dominant language only)

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    LanguageDetection = new LanguageDetectionConfig
    {
        Enabled = true,
        MinConfidence = 0.9m,
        DetectMultiple = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Languages: {string.Join(", ", result.DetectedLanguages ?? new List<string>())}");

package main

import (
  "fmt"

  "github.com/kreuzberg-dev/kreuzberg-lts/v4"
)

func main() {
  minConfidence := 0.8
  config := &kreuzberg.ExtractionConfig{
    LanguageDetection: &kreuzberg.LanguageDetectionConfig{
      Enabled:        true,
      MinConfidence:  &minConfidence,
      DetectMultiple: false,
    },
  }

  fmt.Printf("Language detection enabled: %v\n", config.LanguageDetection.Enabled)
  fmt.Printf("Min confidence: %f\n", *config.LanguageDetection.MinConfidence)
}

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.LanguageDetectionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .languageDetection(LanguageDetectionConfig.builder()
        .enabled(true)
        .minConfidence(0.8)
        .build())
    .build();

import asyncio
from kreuzberg import ExtractionConfig, LanguageDetectionConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        language_detection=LanguageDetectionConfig(
            enabled=True,
            min_confidence=0.85,
            detect_multiple=False
        )
    )
    result = await extract_file("document.pdf", config=config)
    if result.detected_languages:
        print(f"Primary language: {result.detected_languages[0]}")
    print(f"Content length: {len(result.content)} chars")

asyncio.run(main())

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  language_detection: Kreuzberg::Config::LanguageDetection.new(
    enabled: true,
    min_confidence: 0.8,
    detect_multiple: false
  )
)

library(kreuzberg)

config <- extraction_config(
  language_detection = list(enabled = TRUE)
)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("Detected language: %s\n", result$detected_language))
cat(sprintf("Content preview: %.60s...\n", result$content))

use kreuzberg::{ExtractionConfig, LanguageDetectionConfig};

let config = ExtractionConfig {
    language_detection: Some(LanguageDetectionConfig {
        enabled: true,
        min_confidence: 0.8,
        detect_multiple: false,
    }),
    ..Default::default()
};

import { extractFile } from '@kreuzberg/node';

const config = {
  languageDetection: {
    enabled: true,
    minConfidence: 0.8,
    detectMultiple: false,
  },
};

const result = await extractFile('document.pdf', null, config);
if (result.detectedLanguages) {
  console.log(`Detected languages: ${result.detectedLanguages.join(', ')}`);
}

KeywordConfig

Configuration for automatic keyword extraction from document text using YAKE or RAKE algorithms.

Feature Gate: Requires either keywords-yake or keywords-rake Cargo feature. Keyword extraction is only available when at least one of these features is enabled.

Overview

Keyword extraction automatically identifies important terms and phrases in extracted text without manual labeling. Two algorithms are available:

YAKE: Statistical approach based on term frequency and co-occurrence analysis
RAKE: Rapid Automatic Keyword Extraction using word co-occurrence and frequency

Both algorithms analyze text independently and require no external training data, making them suitable for documents in any domain.

Configuration Fields

Field	Type	Default	Description
`algorithm`	`KeywordAlgorithm`	`Yake` (if available)	Algorithm to use: `yake` or `rake`
`max_keywords`	`usize`	`10`	Maximum number of keywords to extract
`min_score`	`f32`	`0.0`	Minimum score threshold (0.0-1.0) for keyword filtering
`ngram_range`	`(usize, usize)`	`(1, 3)`	N-gram range: (min, max) words per keyword phrase
`language`	`Option<String>`	`Some("en")`	Language code for stopword filtering (for example, “en”, “de”, “fr”), `None` disables filtering
`yake_params`	`Option<YakeParams>`	`None`	YAKE-specific tuning parameters
`rake_params`	`Option<RakeParams>`	`None`	RAKE-specific tuning parameters

Algorithm Comparison

YAKE (Yet Another Keyword Extractor)

Approach: Statistical scoring based on term statistics and co-occurrence patterns.

Aspect	Details
Best For	General-purpose documents, balanced keyword distribution
Strengths	No training required, handles rare terms well, language-independent
Limitations	May extract very common terms, single-word focus
Score Range	0.0-1.0 (lower scores = more relevant)
Tuning	`window_size` (default: 2) - context window for co-occurrence
Use Cases	Research papers, news articles, general text

Characteristic: YAKE assigns lower scores to more relevant keywords, so use higher min_score to be more selective.

RAKE (Rapid Automatic Keyword Extraction)

Approach: Co-occurrence graph analysis separating keywords by frequent stop words.

Aspect	Details
Best For	Multi-word phrases, domain-specific terminology
Strengths	Excellent for extracting multi-word phrases, fast, domain-aware
Limitations	Requires good stopword list, less effective with poorly structured text
Score Range	0.0+ (higher scores = more relevant, unbounded)
Tuning	`min_word_length`, `max_words_per_phrase`
Use Cases	Technical documentation, scientific papers, product descriptions

Characteristic: RAKE assigns higher scores to more relevant keywords, so use lower min_score thresholds.

N-gram Range Explanation

The ngram_range parameter controls the size of keyword phrases:

ngram_range: (1, 1)  → Single words only: "python", "machine", "learning"
ngram_range: (1, 2)  → 1-2 word phrases: "python", "machine learning", "deep learning"
ngram_range: (1, 3)  → 1-3 word phrases: "python", "machine learning", "deep neural networks"
ngram_range: (2, 3)  → 2-3 word phrases only: "machine learning", "neural networks"

Recommendations:

Use (1, 1) for single-word indexing (tagging, classification)
Use (1, 2) for balanced coverage of terms and phrases
Use (1, 3) for comprehensive phrase extraction (default)
Use (2, 3) if you only want multi-word phrases

Keyword Output Format

Keywords are returned as a list of Keyword structures in the extraction result:

{
  "text": "machine learning",
  "score": 0.85,
  "algorithm": "yake",
  "positions": [42, 156, 203]
}

Fields:

text: The keyword or phrase text
score: Relevance score (algorithm-specific range and meaning)
algorithm: Which algorithm extracted this keyword
positions: Optional character offsets where the keyword appears in text

Example: YAKE Configuration

using Kreuzberg;

var config = new ExtractionConfig
{
    Keywords = new KeywordConfig
    {
        Algorithm = KeywordAlgorithm.Yake,
        MaxKeywords = 10,
        MinScore = 0.3,
        NgramRange = (1, 3),
        Language = "en"
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

config := &ExtractionConfig{
    Keywords: &KeywordConfig{
        Algorithm:   KeywordAlgorithm.Yake,
        MaxKeywords: 10,
        MinScore:    0.3,
        NgramRange:  [2]uint32{1, 3},
        Language:    "en",
    },
}

var config = ExtractionConfig.builder()
    .keywords(KeywordConfig.builder()
        .algorithm(KeywordAlgorithm.YAKE)
        .maxKeywords(10)
        .minScore(0.3f)
        .ngramRange(1, 3)
        .language("en")
        .build())
    .build();

from kreuzberg import ExtractionConfig, KeywordConfig, KeywordAlgorithm

config = ExtractionConfig(
    keywords=KeywordConfig(
        algorithm=KeywordAlgorithm.YAKE,
        max_keywords=10,
        min_score=0.3,
        ngram_range=(1, 3),
        language="en"
    )
)

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  keywords: Kreuzberg::KeywordConfig.new(
    algorithm: :yake,
    max_keywords: 10,
    min_score: 0.3,
    ngram_range: [1, 3],
    language: "en"
  )
)

use kreuzberg::{ExtractionConfig, KeywordConfig, KeywordAlgorithm};

let config = ExtractionConfig {
    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Yake,
        max_keywords: 10,
        min_score: 0.3,
        ngram_range: (1, 3),
        language: Some("en".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};

import { ExtractionConfig, KeywordConfig, KeywordAlgorithm } from 'kreuzberg';

const config: ExtractionConfig = {
  keywords: {
    algorithm: KeywordAlgorithm.Yake,
    maxKeywords: 10,
    minScore: 0.3,
    ngramRange: [1, 3],
    language: "en"
  }
};

Example: RAKE Configuration with Multi-word Phrases

Python
Rust

from kreuzberg import ExtractionConfig, KeywordConfig, KeywordAlgorithm, RakeParams

config = ExtractionConfig(
    keywords=KeywordConfig(
        algorithm=KeywordAlgorithm.RAKE,
        max_keywords=15,
        min_score=0.1,
        ngram_range=(1, 4),
        language="en",
        rake_params=RakeParams(
            min_word_length=2,
            max_words_per_phrase=4
        )
    )
)

use kreuzberg::{ExtractionConfig, KeywordConfig, KeywordAlgorithm, RakeParams};

let config = ExtractionConfig {
    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Rake,
        max_keywords: 15,
        min_score: 0.1,
        ngram_range: (1, 4),
        language: Some("en".to_string()),
        rake_params: Some(RakeParams {
            min_word_length: 2,
            max_words_per_phrase: 4,
        }),
        ..Default::default()
    }),
    ..Default::default()
};

Language Support

Stopword filtering is applied when a language is specified. Common supported languages:

en - English
es - Spanish
fr - French
de - German
pt - Portuguese
it - Italian
ru - Russian
ja - Japanese
zh - Chinese
ar - Arabic

Set language: None to disable stopword filtering and extract keywords in any language without filtering.

KeywordConfig

Configuration for automatic keyword extraction from document text using YAKE or RAKE algorithms.

Feature Gate: Requires either keywords-yake or keywords-rake Cargo feature. Keyword extraction is only available when at least one of these features is enabled.

Overview

Keyword extraction automatically identifies important terms and phrases in extracted text without manual labeling. Two algorithms are available:

YAKE: Statistical approach based on term frequency and co-occurrence analysis
RAKE: Rapid Automatic Keyword Extraction using word co-occurrence and frequency

Both algorithms analyze text independently and require no external training data, making them suitable for documents in any domain.

Configuration Fields

Field	Type	Default	Description
`algorithm`	`KeywordAlgorithm`	`Yake` (if available)	Algorithm to use: `yake` or `rake`
`max_keywords`	`usize`	`10`	Maximum number of keywords to extract
`min_score`	`f32`	`0.0`	Minimum score threshold (0.0-1.0) for keyword filtering
`ngram_range`	`(usize, usize)`	`(1, 3)`	N-gram range: (min, max) words per keyword phrase
`language`	`Option<String>`	`Some("en")`	Language code for stopword filtering (for example, “en”, “de”, “fr”), `None` disables filtering
`yake_params`	`Option<YakeParams>`	`None`	YAKE-specific tuning parameters
`rake_params`	`Option<RakeParams>`	`None`	RAKE-specific tuning parameters

Algorithm Comparison

YAKE (Yet Another Keyword Extractor)

Approach: Statistical scoring based on term statistics and co-occurrence patterns.

Aspect	Details
Best For	General-purpose documents, balanced keyword distribution
Strengths	No training required, handles rare terms well, language-independent
Limitations	May extract very common terms, single-word focus
Score Range	0.0-1.0 (lower scores = more relevant)
Tuning	`window_size` (default: 2) - context window for co-occurrence
Use Cases	Research papers, news articles, general text

Characteristic: YAKE assigns lower scores to more relevant keywords, so use higher min_score to be more selective.

RAKE (Rapid Automatic Keyword Extraction)

Approach: Co-occurrence graph analysis separating keywords by frequent stop words.

Aspect	Details
Best For	Multi-word phrases, domain-specific terminology
Strengths	Excellent for extracting multi-word phrases, fast, domain-aware
Limitations	Requires good stopword list, less effective with poorly structured text
Score Range	0.0+ (higher scores = more relevant, unbounded)
Tuning	`min_word_length`, `max_words_per_phrase`
Use Cases	Technical documentation, scientific papers, product descriptions

Characteristic: RAKE assigns higher scores to more relevant keywords, so use lower min_score thresholds.

N-gram Range Explanation

The ngram_range parameter controls the size of keyword phrases:

ngram_range: (1, 1)  → Single words only: "python", "machine", "learning"
ngram_range: (1, 2)  → 1-2 word phrases: "python", "machine learning", "deep learning"
ngram_range: (1, 3)  → 1-3 word phrases: "python", "machine learning", "deep neural networks"
ngram_range: (2, 3)  → 2-3 word phrases only: "machine learning", "neural networks"

Recommendations:

Use (1, 1) for single-word indexing (tagging, classification)
Use (1, 2) for balanced coverage of terms and phrases
Use (1, 3) for comprehensive phrase extraction (default)
Use (2, 3) if you only want multi-word phrases

Keyword Output Format

Keywords are returned as a list of Keyword structures in the extraction result:

{
  "text": "machine learning",
  "score": 0.85,
  "algorithm": "yake",
  "positions": [42, 156, 203]
}

Fields:

text: The keyword or phrase text
score: Relevance score (algorithm-specific range and meaning)
algorithm: Which algorithm extracted this keyword
positions: Optional character offsets where the keyword appears in text

Example: YAKE Configuration

using Kreuzberg;

var config = new ExtractionConfig
{
    Keywords = new KeywordConfig
    {
        Algorithm = KeywordAlgorithm.Yake,
        MaxKeywords = 10,
        MinScore = 0.3,
        NgramRange = (1, 3),
        Language = "en"
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

config := &ExtractionConfig{
    Keywords: &KeywordConfig{
        Algorithm:   KeywordAlgorithm.Yake,
        MaxKeywords: 10,
        MinScore:    0.3,
        NgramRange:  [2]uint32{1, 3},
        Language:    "en",
    },
}

var config = ExtractionConfig.builder()
    .keywords(KeywordConfig.builder()
        .algorithm(KeywordAlgorithm.YAKE)
        .maxKeywords(10)
        .minScore(0.3f)
        .ngramRange(1, 3)
        .language("en")
        .build())
    .build();

from kreuzberg import ExtractionConfig, KeywordConfig, KeywordAlgorithm

config = ExtractionConfig(
    keywords=KeywordConfig(
        algorithm=KeywordAlgorithm.YAKE,
        max_keywords=10,
        min_score=0.3,
        ngram_range=(1, 3),
        language="en"
    )
)

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  keywords: Kreuzberg::KeywordConfig.new(
    algorithm: :yake,
    max_keywords: 10,
    min_score: 0.3,
    ngram_range: [1, 3],
    language: "en"
  )
)

use kreuzberg::{ExtractionConfig, KeywordConfig, KeywordAlgorithm};

let config = ExtractionConfig {
    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Yake,
        max_keywords: 10,
        min_score: 0.3,
        ngram_range: (1, 3),
        language: Some("en".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};

import { ExtractionConfig, KeywordConfig, KeywordAlgorithm } from 'kreuzberg';

const config: ExtractionConfig = {
  keywords: {
    algorithm: KeywordAlgorithm.Yake,
    maxKeywords: 10,
    minScore: 0.3,
    ngramRange: [1, 3],
    language: "en"
  }
};

Example: RAKE Configuration with Multi-word Phrases

Python
Rust

from kreuzberg import ExtractionConfig, KeywordConfig, KeywordAlgorithm, RakeParams

config = ExtractionConfig(
    keywords=KeywordConfig(
        algorithm=KeywordAlgorithm.RAKE,
        max_keywords=15,
        min_score=0.1,
        ngram_range=(1, 4),
        language="en",
        rake_params=RakeParams(
            min_word_length=2,
            max_words_per_phrase=4
        )
    )
)

use kreuzberg::{ExtractionConfig, KeywordConfig, KeywordAlgorithm, RakeParams};

let config = ExtractionConfig {
    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Rake,
        max_keywords: 15,
        min_score: 0.1,
        ngram_range: (1, 4),
        language: Some("en".to_string()),
        rake_params: Some(RakeParams {
            min_word_length: 2,
            max_words_per_phrase: 4,
        }),
        ..Default::default()
    }),
    ..Default::default()
};

Language Support

Stopword filtering is applied when a language is specified. Common supported languages:

en - English
es - Spanish
fr - French
de - German
pt - Portuguese
it - Italian
ru - Russian
ja - Japanese
zh - Chinese
ar - Arabic

Set language: None to disable stopword filtering and extract keywords in any language without filtering.

PdfConfig

PDF-specific extraction configuration.

Field	Type	Default	Description
`extract_images`	`bool`	`false`	Extract embedded images from PDF pages
`extract_metadata`	`bool`	`true`	Extract PDF metadata (title, author, creation date, etc.)
`passwords`	`list[str]?`	`None`	List of passwords to try for encrypted PDFs (tries in order)
`hierarchy`	`HierarchyConfig?`	`None`	Hierarchy extraction configuration (None = hierarchy extraction disabled)
`allow_single_column_tables` v4.5.0	`bool`	`false`	Relax min column count from 2-3 to 1, allowing single-column table extraction

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        ExtractImages = true,
        ExtractMetadata = true,
        Passwords = new List<string> { "password1", "password2" },
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 6,
            IncludeBbox = true,
            OcrCoverageThreshold = 0.5f
        }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");

package main

import (
  "log"

  "github.com/kreuzberg-dev/kreuzberg-lts/v4"
)

func main() {
  pw := []string{"password1", "password2"}
  result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
    PdfOptions: &kreuzberg.PdfConfig{
      ExtractImages:   kreuzberg.BoolPtr(true),
      ExtractMetadata: kreuzberg.BoolPtr(true),
      Passwords:       pw,
      Hierarchy:       &kreuzberg.HierarchyConfig{},
    },
  })
  if err != nil {
    log.Fatalf("extract failed: %v", err)
  }

  log.Println("content length:", len(result.Content))
}

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.PdfConfig;
import dev.kreuzberg.config.HierarchyConfig;
import java.util.Arrays;

ExtractionConfig config = ExtractionConfig.builder()
    .pdfOptions(PdfConfig.builder()
        .extractImages(true)
        .extractMetadata(true)
        .passwords(Arrays.asList("password1", "password2"))
        .hierarchyConfig(HierarchyConfig.builder().build())
        .build())
    .build();

import asyncio
from kreuzberg import ExtractionConfig, PdfConfig, HierarchyConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        pdf_options=PdfConfig(
            extract_images=True,
            extract_metadata=True,
            passwords=["password1", "password2"],
            hierarchy=HierarchyConfig(enabled=True, k_clusters=6)
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  pdf_options: Kreuzberg::Config::PDF.new(
    extract_images: true,
    extract_metadata: true,
    passwords: ['password1', 'password2'],
    hierarchy: Kreuzberg::Config::Hierarchy.new(
      enabled: true,
      k_clusters: 6,
      include_bbox: true
    )
  )
)

library(kreuzberg)

config <- extraction_config(
  pdf_options = list(extract_tables = TRUE)
)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("Tables extracted: %d\n", length(result$tables)))
cat(sprintf("Total elements: %d\n", length(result$elements)))
cat(sprintf("Content preview: %.50s...\n", result$content))

use kreuzberg::{ExtractionConfig, PdfConfig};

fn main() {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            extract_images: Some(true),
            extract_metadata: Some(true),
            passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
        }),
        ..Default::default()
    };
    println!("{:?}", config.pdf_options);
}

import { extractFile } from '@kreuzberg/node';

const config = {
  pdfOptions: {
    extractImages: true,
    extractMetadata: true,
    passwords: ['password1', 'password2'],
    hierarchy: { enabled: true, kClusters: 6, includeBbox: true }
  },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

HierarchyConfig

PDF document hierarchy extraction configuration for semantic text structure analysis.

Overview

HierarchyConfig enables automatic extraction of document hierarchy levels (H1-H6) from PDF text by analyzing font size patterns. This is particularly useful for:

Building semantic document representations for RAG (Retrieval Augmented Generation) systems
Automatic table of contents extraction
Document structure understanding and analysis
Content organization and outlining

The hierarchy detection works by:

Extracting text blocks with font size metadata from the PDF
Performing K-means clustering on font sizes to identify distinct size groups
Mapping clusters to heading levels (h1-h6) and body text
Merging adjacent blocks with the same hierarchy level
Optionally including bounding box information for spatial awareness

Fields

Field	Type	Default	Description
`enabled`	`bool`	`true`	Enable hierarchy extraction
`k_clusters`	`usize`	`6`	Number of font size clusters (1-7). Default 6 provides H1-H6 with body text
`include_bbox`	`bool`	`true`	Include bounding box coordinates in output
`ocr_coverage_threshold`	`Option<f32>`	`None`	Smart OCR triggering threshold (0.0-1.0). Triggers OCR if text blocks cover less than this fraction of page

How It Works

Font Size Extraction

Text blocks are extracted from PDFs with their precise font sizes. This metadata is preserved for analysis.

K-means Clustering

The font sizes are clustered using K-means algorithm with the specified number of clusters. Each cluster represents a distinct text hierarchy level, from largest fonts (headings) to smallest (body text).

Cluster-to-Level Mapping:

For k_clusters=6 (recommended): Creates 6 clusters → h1 (largest), h2, h3, h4, h5, body (smallest)
For k_clusters=3: Fast mode with just h1, h3, body (minimal detail)
For k_clusters=7: Maximum detail separating h1-h6 with distinct body text

Block Merging

Adjacent blocks with the same hierarchy level are merged to create logical content units. This merge process considers:

Spatial proximity (vertical and horizontal distance)
Bounding box overlap ratio
Text flow direction

Output Structure

Each extracted block contains:

Text content
Font size (in points)
Hierarchy level (h1-h6 or body)
Optional bounding box (left, top, right, bottom in PDF units)

Use Cases

Semantic Document Understanding

Extract hierarchical structure for understanding document semantics and building knowledge graphs:

H1: Document Title
  H2: Section 1
    H3: Subsection 1.1
      Body text...
    H3: Subsection 1.2
      Body text...
  H2: Section 2
    H3: Subsection 2.1

Automatic Table of Contents Generation

Build dynamic table of contents from extracted hierarchy levels (h1-h3) for document navigation.

RAG System Optimization

Use hierarchy information to improve context retrieval by chunking at appropriate heading boundaries rather than arbitrary character counts. This preserves semantic relationships.

Document Analysis

Extract and analyze document structure programmatically for compliance checking, content validation, or metadata extraction.

Configuration Examples

Basic Hierarchy Extraction

using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

// Access hierarchy from pages
if (result.Pages != null)
{
    foreach (var page in result.Pages)
    {
        if (page.Hierarchy != null)
        {
            Console.WriteLine($"Page {page.PageNumber}: {page.Hierarchy.BlockCount} blocks");
            foreach (var block in page.Hierarchy.Blocks)
            {
                Console.WriteLine($"  [{block.Level}] {block.Text.Substring(0, 50)}...");
            }
        }
    }
}

package main

import (
    "fmt"
    "kreuzberg"
)

func main() {
    config := &kreuzberg.ExtractionConfig{
        PdfOptions: &kreuzberg.PdfConfig{
            Hierarchy: &kreuzberg.HierarchyConfig{
                Enabled: true,
            },
        },
    }

    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        panic(err)
    }

    if result.Pages != nil {
        for _, page := range result.Pages {
            if page.Hierarchy != nil {
                fmt.Printf("Page %d: %d blocks\n", page.PageNumber, page.Hierarchy.BlockCount)
                for _, block := range page.Hierarchy.Blocks {
                    fmt.Printf("  [%s] %s...\n", block.Level, block.Text[:50])
                }
            }
        }
    }
}

import com.kreuzberg.*;

public class BasicHierarchy {
    public static void main(String[] args) throws Exception {
        ExtractionConfig config = ExtractionConfig.builder()
            .pdfOptions(PdfConfig.builder()
                .hierarchy(HierarchyConfig.builder()
                    .enabled(true)
                    .build())
                .build())
            .build();

        ExtractionResult result = KreuzbergClient.extractFileSync("document.pdf", config);

        if (result.getPages() != null) {
            for (PageContent page : result.getPages()) {
                if (page.getHierarchy() != null) {
                    System.out.println("Page " + page.getPageNumber() + ": " +
                        page.getHierarchy().getBlockCount() + " blocks");
                    for (HierarchicalBlock block : page.getHierarchy().getBlocks()) {
                        System.out.println("  [" + block.getLevel() + "] " +
                            block.getText().substring(0, 50) + "...");
                    }
                }
            }
        }
    }
}

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config: ExtractionConfig = ExtractionConfig(
    pdf_options=PdfConfig(
        extract_metadata=True,
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=6,
            include_bbox=True,
            ocr_coverage_threshold=0.8
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

# Access hierarchy information
for page in result.pages or []:
    print(f"Page {page.page_number}:")
    print(f"  Content: {page.content[:100]}...")

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  pdf_options: Kreuzberg::PdfConfig.new(
    hierarchy: Kreuzberg::HierarchyConfig.new(
      enabled: true
    )
  )
)

result = Kreuzberg.extract_file_sync("document.pdf", config: config)

if result.pages
  result.pages.each do |page|
    if page.hierarchy
      puts "Page #{page.page_number}: #{page.hierarchy.block_count} blocks"
      page.hierarchy.blocks.each do |block|
        puts "  [#{block.level}] #{block.text[0..49]}..."
      end
    end
  end
end

use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                enabled: true,
                detection_threshold: Some(0.75),
                ocr_coverage_threshold: Some(0.8),
                min_level: Some(1),
                max_level: Some(5),
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &config)?;
    println!("Hierarchy levels: {}", result.hierarchy.len());
    Ok(())
}

import { extractFileSync, ExtractionConfig, PdfConfig, HierarchyConfig } from 'kreuzberg';

const config: ExtractionConfig = {
    pdfOptions: {
        hierarchy: {
            enabled: true
        }
    }
};

const result = extractFileSync("document.pdf", config);

if (result.pages) {
    for (const page of result.pages) {
        if (page.hierarchy) {
            console.log(`Page ${page.pageNumber}: ${page.hierarchy.blockCount} blocks`);
            for (const block of page.hierarchy.blocks) {
                console.log(`  [${block.level}] ${block.text.substring(0, 50)}...`);
            }
        }
    }
}

Custom K-Clusters Configuration

Configure clustering granularity for different hierarchy detail levels:

using Kreuzberg;

// Fast mode: 3 clusters (h1, h3, body) - minimal detail
var fastConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 3  // Fast, identifies main structure only
        }
    }
};

// Balanced mode: 6 clusters (h1-h6) - default, recommended
var balancedConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 6  // Balanced detail
        }
    }
};

// Detailed mode: 7 clusters (h1-h6 + distinct body) - maximum detail
var detailedConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 7  // Maximum detail with body text separation
        }
    }
};

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

# Fast mode: 3 clusters
fast_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=3  # Fast, identifies main structure only
        )
    )
)

# Balanced mode: 6 clusters (recommended)
balanced_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=6  # Balanced detail
        )
    )
)

# Detailed mode: 7 clusters
detailed_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=7  # Maximum detail with body text separation
        )
    )
)

result = extract_file_sync("document.pdf", config=balanced_config)

use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    // Fast mode: 3 clusters
    let fast_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 3,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    // Balanced mode: 6 clusters (recommended)
    let balanced_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 6,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    // Detailed mode: 7 clusters
    let detailed_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 7,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &balanced_config)?;
    Ok(())
}

OCR Coverage Threshold

Smart OCR triggering based on text coverage:

using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            OcrCoverageThreshold = 0.5f  // Trigger OCR if <50% of page has text
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            ocr_coverage_threshold=0.5  # Trigger OCR if <50% of page has text
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                ocr_coverage_threshold: Some(0.5),
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &config)?;
    Ok(())
}

Disabling Bounding Boxes

Reduce output size by excluding spatial information:

C#
Python

using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            IncludeBbox = false  // Exclude bounding boxes
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            include_bbox=False  // Exclude bounding boxes
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

Performance Tuning

K-clusters Selection

Choose k_clusters based on your performance vs. detail requirements:

Setting	Speed	Detail	Best For
`k_clusters=3`	Very Fast	Minimal (h1, h3, body)	Quick document structure identification, real-time processing
`k_clusters=6`	Balanced	Standard (h1-h6, body)	General purpose, RAG systems, recommended default
`k_clusters=7`	Moderate	Detailed (h1-h6 separate body)	Fine-grained content analysis, content organization

Bounding Box Optimization

Include bounding boxes (include_bbox=true, default) when:

Building visually-aware document processors
Need to correlate text with document position
Processing layout-sensitive documents (brochures, forms)

Exclude bounding boxes (include_bbox=false) when:

Minimizing output size for network transmission
Bandwidth is constrained
Spatial information is not needed
Typical output reduction: 10-15% smaller

OCR Integration

The ocr_coverage_threshold parameter enables smart OCR triggering:

If (text_block_coverage < ocr_coverage_threshold) {
run_ocr() // Trigger OCR on pages with insufficient text coverage
}

Common Scenarios:

ocr_coverage_threshold=0.5: Trigger OCR on scanned pages (<50% text coverage)
ocr_coverage_threshold=0.8: Only OCR pages with very low text (>80% images)
ocr_coverage_threshold=None: Disable smart OCR triggering, rely on force_ocr flag

Output Format

PageHierarchy Structure

The extracted hierarchy is returned in PageContent.hierarchy when pages are extracted:

{
  "block_count": 12,
  "blocks": [
    {
      "text": "Document Title",
      "font_size": 24.0,
      "level": "h1",
      "bbox": [50.0, 100.0, 500.0, 130.0]
    },
    {
      "text": "Introduction",
      "font_size": 18.0,
      "level": "h2",
      "bbox": [50.0, 150.0, 300.0, 175.0]
    },
    {
      "text": "This is the introductory paragraph with standard body text content.",
      "font_size": 12.0,
      "level": "body",
      "bbox": [50.0, 200.0, 500.0, 250.0]
    },
    {
      "text": "Key Findings",
      "font_size": 18.0,
      "level": "h2",
      "bbox": [50.0, 280.0, 300.0, 305.0]
    }
  ]
}

Field Meanings

block_count: Total number of hierarchical blocks on the page
blocks: Array of hierarchical blocks
- text: The text content of the block
- font_size: Font size in points (useful for verification and styling)
- level: Hierarchy level - “h1” through “h6” for headings, “body” for body text
- bbox: Optional bounding box as [left, top, right, bottom] in PDF units (points). Only present when include_bbox=true

Accessing Hierarchy in Code

Python
Rust

result = extract_file_sync("document.pdf", config=config)

for page in result.pages or []:
    if page.hierarchy:
        # Get all h1 headings
        h1_blocks = [b for b in page.hierarchy.blocks if b.level == "h1"]

        # Get all heading levels (h1-h6)
        headings = [b for b in page.hierarchy.blocks if b.level.startswith("h")]

        # Build outline with hierarchy
        for block in page.hierarchy.blocks:
            indent = int(block.level[1]) if block.level.startswith("h") else 0
            print("  " * indent + block.text)

for page in result.pages.iter().flat_map(|p| p.iter()) {
    if let Some(hierarchy) = &page.hierarchy {
        // Get all h1 headings
        let h1_blocks: Vec<_> = hierarchy.blocks
            .iter()
            .filter(|b| b.level == "h1")
            .collect();

        // Build outline
        for block in &hierarchy.blocks {
            let level = if block.level.starts_with('h') {
                block.level[1..].parse::<usize>().unwrap_or(0)
            } else {
                0
            };
            println!("{}{}", "  ".repeat(level), block.text);
        }
    }
}

Best Practices

Always enable page extraction when using hierarchy:
```
pages = PageConfig(extract_pages=True)
```
Hierarchy data is only populated when pages are extracted.
Use k_clusters=6 by default (recommended). It provides good balance between detail and performance for most documents.
Include bounding boxes for RAG systems that need spatial awareness for relevance ranking.
Test ocr_coverage_threshold with your document set to find optimal OCR triggering point.
Process hierarchy at chunk boundaries in RAG systems to preserve semantic relationships in context windows.

Example: Building a Table of Contents

Python

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig, PageConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(enabled=True, k_clusters=6)
    ),
    pages=PageConfig(extract_pages=True)
)

result = extract_file_sync("document.pdf", config=config)

toc = []
for page in result.pages or []:
    if page.hierarchy:
        for block in page.hierarchy.blocks:
            if block.level.startswith("h"):
                level = int(block.level[1])
                toc.append({
                    "level": level,
                    "text": block.text,
                    "page": page.page_number
                })

# Print hierarchical TOC
for entry in toc:
    indent = "  " * (entry["level"] - 1)
    print(f"{indent}{entry['text']} (p. {entry['page']})")

HierarchyConfig

PDF document hierarchy extraction configuration for semantic text structure analysis.

Overview

HierarchyConfig enables automatic extraction of document hierarchy levels (H1-H6) from PDF text by analyzing font size patterns. This is particularly useful for:

Building semantic document representations for RAG (Retrieval Augmented Generation) systems
Automatic table of contents extraction
Document structure understanding and analysis
Content organization and outlining

The hierarchy detection works by:

Extracting text blocks with font size metadata from the PDF
Performing K-means clustering on font sizes to identify distinct size groups
Mapping clusters to heading levels (h1-h6) and body text
Merging adjacent blocks with the same hierarchy level
Optionally including bounding box information for spatial awareness

Fields

Field	Type	Default	Description
`enabled`	`bool`	`true`	Enable hierarchy extraction
`k_clusters`	`usize`	`6`	Number of font size clusters (1-7). Default 6 provides H1-H6 with body text
`include_bbox`	`bool`	`true`	Include bounding box coordinates in output
`ocr_coverage_threshold`	`Option<f32>`	`None`	Smart OCR triggering threshold (0.0-1.0). Triggers OCR if text blocks cover less than this fraction of page

How It Works

Font Size Extraction

Text blocks are extracted from PDFs with their precise font sizes. This metadata is preserved for analysis.

K-means Clustering

Cluster-to-Level Mapping:

For k_clusters=6 (recommended): Creates 6 clusters → h1 (largest), h2, h3, h4, h5, body (smallest)
For k_clusters=3: Fast mode with just h1, h3, body (minimal detail)
For k_clusters=7: Maximum detail separating h1-h6 with distinct body text

Block Merging

Adjacent blocks with the same hierarchy level are merged to create logical content units. This merge process considers:

Spatial proximity (vertical and horizontal distance)
Bounding box overlap ratio
Text flow direction

Output Structure

Each extracted block contains:

Text content
Font size (in points)
Hierarchy level (h1-h6 or body)
Optional bounding box (left, top, right, bottom in PDF units)

Use Cases

Semantic Document Understanding

Extract hierarchical structure for understanding document semantics and building knowledge graphs:

H1: Document Title
  H2: Section 1
    H3: Subsection 1.1
      Body text...
    H3: Subsection 1.2
      Body text...
  H2: Section 2
    H3: Subsection 2.1

Automatic Table of Contents Generation

Build dynamic table of contents from extracted hierarchy levels (h1-h3) for document navigation.

RAG System Optimization

Use hierarchy information to improve context retrieval by chunking at appropriate heading boundaries rather than arbitrary character counts. This preserves semantic relationships.

Document Analysis

Extract and analyze document structure programmatically for compliance checking, content validation, or metadata extraction.

Configuration Examples

Basic Hierarchy Extraction

using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

// Access hierarchy from pages
if (result.Pages != null)
{
    foreach (var page in result.Pages)
    {
        if (page.Hierarchy != null)
        {
            Console.WriteLine($"Page {page.PageNumber}: {page.Hierarchy.BlockCount} blocks");
            foreach (var block in page.Hierarchy.Blocks)
            {
                Console.WriteLine($"  [{block.Level}] {block.Text.Substring(0, 50)}...");
            }
        }
    }
}

package main

import (
    "fmt"
    "kreuzberg"
)

func main() {
    config := &kreuzberg.ExtractionConfig{
        PdfOptions: &kreuzberg.PdfConfig{
            Hierarchy: &kreuzberg.HierarchyConfig{
                Enabled: true,
            },
        },
    }

    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        panic(err)
    }

    if result.Pages != nil {
        for _, page := range result.Pages {
            if page.Hierarchy != nil {
                fmt.Printf("Page %d: %d blocks\n", page.PageNumber, page.Hierarchy.BlockCount)
                for _, block := range page.Hierarchy.Blocks {
                    fmt.Printf("  [%s] %s...\n", block.Level, block.Text[:50])
                }
            }
        }
    }
}

import com.kreuzberg.*;

public class BasicHierarchy {
    public static void main(String[] args) throws Exception {
        ExtractionConfig config = ExtractionConfig.builder()
            .pdfOptions(PdfConfig.builder()
                .hierarchy(HierarchyConfig.builder()
                    .enabled(true)
                    .build())
                .build())
            .build();

        ExtractionResult result = KreuzbergClient.extractFileSync("document.pdf", config);

        if (result.getPages() != null) {
            for (PageContent page : result.getPages()) {
                if (page.getHierarchy() != null) {
                    System.out.println("Page " + page.getPageNumber() + ": " +
                        page.getHierarchy().getBlockCount() + " blocks");
                    for (HierarchicalBlock block : page.getHierarchy().getBlocks()) {
                        System.out.println("  [" + block.getLevel() + "] " +
                            block.getText().substring(0, 50) + "...");
                    }
                }
            }
        }
    }
}

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config: ExtractionConfig = ExtractionConfig(
    pdf_options=PdfConfig(
        extract_metadata=True,
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=6,
            include_bbox=True,
            ocr_coverage_threshold=0.8
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

# Access hierarchy information
for page in result.pages or []:
    print(f"Page {page.page_number}:")
    print(f"  Content: {page.content[:100]}...")

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  pdf_options: Kreuzberg::PdfConfig.new(
    hierarchy: Kreuzberg::HierarchyConfig.new(
      enabled: true
    )
  )
)

result = Kreuzberg.extract_file_sync("document.pdf", config: config)

if result.pages
  result.pages.each do |page|
    if page.hierarchy
      puts "Page #{page.page_number}: #{page.hierarchy.block_count} blocks"
      page.hierarchy.blocks.each do |block|
        puts "  [#{block.level}] #{block.text[0..49]}..."
      end
    end
  end
end

use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                enabled: true,
                detection_threshold: Some(0.75),
                ocr_coverage_threshold: Some(0.8),
                min_level: Some(1),
                max_level: Some(5),
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &config)?;
    println!("Hierarchy levels: {}", result.hierarchy.len());
    Ok(())
}

import { extractFileSync, ExtractionConfig, PdfConfig, HierarchyConfig } from 'kreuzberg';

const config: ExtractionConfig = {
    pdfOptions: {
        hierarchy: {
            enabled: true
        }
    }
};

const result = extractFileSync("document.pdf", config);

if (result.pages) {
    for (const page of result.pages) {
        if (page.hierarchy) {
            console.log(`Page ${page.pageNumber}: ${page.hierarchy.blockCount} blocks`);
            for (const block of page.hierarchy.blocks) {
                console.log(`  [${block.level}] ${block.text.substring(0, 50)}...`);
            }
        }
    }
}

Custom K-Clusters Configuration

Configure clustering granularity for different hierarchy detail levels:

using Kreuzberg;

// Fast mode: 3 clusters (h1, h3, body) - minimal detail
var fastConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 3  // Fast, identifies main structure only
        }
    }
};

// Balanced mode: 6 clusters (h1-h6) - default, recommended
var balancedConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 6  // Balanced detail
        }
    }
};

// Detailed mode: 7 clusters (h1-h6 + distinct body) - maximum detail
var detailedConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 7  // Maximum detail with body text separation
        }
    }
};

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

# Fast mode: 3 clusters
fast_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=3  # Fast, identifies main structure only
        )
    )
)

# Balanced mode: 6 clusters (recommended)
balanced_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=6  # Balanced detail
        )
    )
)

# Detailed mode: 7 clusters
detailed_config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=7  # Maximum detail with body text separation
        )
    )
)

result = extract_file_sync("document.pdf", config=balanced_config)

use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    // Fast mode: 3 clusters
    let fast_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 3,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    // Balanced mode: 6 clusters (recommended)
    let balanced_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 6,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    // Detailed mode: 7 clusters
    let detailed_config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                k_clusters: 7,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &balanced_config)?;
    Ok(())
}

OCR Coverage Threshold

Smart OCR triggering based on text coverage:

using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            OcrCoverageThreshold = 0.5f  // Trigger OCR if <50% of page has text
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            ocr_coverage_threshold=0.5  # Trigger OCR if <50% of page has text
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                ocr_coverage_threshold: Some(0.5),
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &config)?;
    Ok(())
}

Disabling Bounding Boxes

Reduce output size by excluding spatial information:

C#
Python

using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            IncludeBbox = false  // Exclude bounding boxes
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(
            enabled=True,
            include_bbox=False  // Exclude bounding boxes
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

Performance Tuning

K-clusters Selection

Choose k_clusters based on your performance vs. detail requirements:

Setting	Speed	Detail	Best For
`k_clusters=3`	Very Fast	Minimal (h1, h3, body)	Quick document structure identification, real-time processing
`k_clusters=6`	Balanced	Standard (h1-h6, body)	General purpose, RAG systems, recommended default
`k_clusters=7`	Moderate	Detailed (h1-h6 separate body)	Fine-grained content analysis, content organization

Bounding Box Optimization

Include bounding boxes (include_bbox=true, default) when:

Building visually-aware document processors
Need to correlate text with document position
Processing layout-sensitive documents (brochures, forms)

Exclude bounding boxes (include_bbox=false) when:

Minimizing output size for network transmission
Bandwidth is constrained
Spatial information is not needed
Typical output reduction: 10-15% smaller

OCR Integration

The ocr_coverage_threshold parameter enables smart OCR triggering:

If (text_block_coverage < ocr_coverage_threshold) {
run_ocr() // Trigger OCR on pages with insufficient text coverage
}

Common Scenarios:

ocr_coverage_threshold=0.5: Trigger OCR on scanned pages (<50% text coverage)
ocr_coverage_threshold=0.8: Only OCR pages with very low text (>80% images)
ocr_coverage_threshold=None: Disable smart OCR triggering, rely on force_ocr flag

Output Format

PageHierarchy Structure

The extracted hierarchy is returned in PageContent.hierarchy when pages are extracted:

{
  "block_count": 12,
  "blocks": [
    {
      "text": "Document Title",
      "font_size": 24.0,
      "level": "h1",
      "bbox": [50.0, 100.0, 500.0, 130.0]
    },
    {
      "text": "Introduction",
      "font_size": 18.0,
      "level": "h2",
      "bbox": [50.0, 150.0, 300.0, 175.0]
    },
    {
      "text": "This is the introductory paragraph with standard body text content.",
      "font_size": 12.0,
      "level": "body",
      "bbox": [50.0, 200.0, 500.0, 250.0]
    },
    {
      "text": "Key Findings",
      "font_size": 18.0,
      "level": "h2",
      "bbox": [50.0, 280.0, 300.0, 305.0]
    }
  ]
}

Field Meanings

block_count: Total number of hierarchical blocks on the page
blocks: Array of hierarchical blocks
- text: The text content of the block
- font_size: Font size in points (useful for verification and styling)
- level: Hierarchy level - “h1” through “h6” for headings, “body” for body text
- bbox: Optional bounding box as [left, top, right, bottom] in PDF units (points). Only present when include_bbox=true

Accessing Hierarchy in Code

Python
Rust

result = extract_file_sync("document.pdf", config=config)

for page in result.pages or []:
    if page.hierarchy:
        # Get all h1 headings
        h1_blocks = [b for b in page.hierarchy.blocks if b.level == "h1"]

        # Get all heading levels (h1-h6)
        headings = [b for b in page.hierarchy.blocks if b.level.startswith("h")]

        # Build outline with hierarchy
        for block in page.hierarchy.blocks:
            indent = int(block.level[1]) if block.level.startswith("h") else 0
            print("  " * indent + block.text)

for page in result.pages.iter().flat_map(|p| p.iter()) {
    if let Some(hierarchy) = &page.hierarchy {
        // Get all h1 headings
        let h1_blocks: Vec<_> = hierarchy.blocks
            .iter()
            .filter(|b| b.level == "h1")
            .collect();

        // Build outline
        for block in &hierarchy.blocks {
            let level = if block.level.starts_with('h') {
                block.level[1..].parse::<usize>().unwrap_or(0)
            } else {
                0
            };
            println!("{}{}", "  ".repeat(level), block.text);
        }
    }
}

Best Practices

Always enable page extraction when using hierarchy:
```
pages = PageConfig(extract_pages=True)
```
Hierarchy data is only populated when pages are extracted.
Use k_clusters=6 by default (recommended). It provides good balance between detail and performance for most documents.
Include bounding boxes for RAG systems that need spatial awareness for relevance ranking.
Test ocr_coverage_threshold with your document set to find optimal OCR triggering point.
Process hierarchy at chunk boundaries in RAG systems to preserve semantic relationships in context windows.

Example: Building a Table of Contents

Python

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig, PageConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        hierarchy=HierarchyConfig(enabled=True, k_clusters=6)
    ),
    pages=PageConfig(extract_pages=True)
)

result = extract_file_sync("document.pdf", config=config)

toc = []
for page in result.pages or []:
    if page.hierarchy:
        for block in page.hierarchy.blocks:
            if block.level.startswith("h"):
                level = int(block.level[1])
                toc.append({
                    "level": level,
                    "text": block.text,
                    "page": page.page_number
                })

# Print hierarchical TOC
for entry in toc:
    indent = "  " * (entry["level"] - 1)
    print(f"{indent}{entry['text']} (p. {entry['page']})")

PageConfig

Configuration for page extraction and tracking.

Controls whether to extract per-page content and how to mark page boundaries in the combined text output.

Configuration

Field	Type	Default	Description
`extract_pages`	`bool`	`false`	Extract pages as separate array in results
`insert_page_markers`	`bool`	`false`	Insert page markers in combined content string
`marker_format`	`String`	`"\n\n<!-- PAGE {page_num} -->\n\n"`	Template for page markers (use `{page_num}` placeholder)

Example

var config = new ExtractionConfig
{
    Pages = new PageConfig
    {
        ExtractPages = true,
        InsertPageMarkers = true,
        MarkerFormat = "\n\n--- Page {page_num} ---\n\n"
    }
};

config := &ExtractionConfig{
    Pages: &PageConfig{
        ExtractPages:      true,
        InsertPageMarkers: true,
        MarkerFormat:      "\n\n--- Page {page_num} ---\n\n",
    },
}

var config = ExtractionConfig.builder()
    .pages(PageConfig.builder()
        .extractPages(true)
        .insertPageMarkers(true)
        .markerFormat("\n\n--- Page {page_num} ---\n\n")
        .build())
    .build();

config = ExtractionConfig(
    pages=PageConfig(
        extract_pages=True,
        insert_page_markers=True,
        marker_format="\n\n--- Page {page_num} ---\n\n"
    )
)

config = ExtractionConfig.new(
  pages: PageConfig.new(
    extract_pages: true,
    insert_page_markers: true,
    marker_format: "\n\n--- Page {page_num} ---\n\n"
  )
)

let config = ExtractionConfig {
    pages: Some(PageConfig {
        extract_pages: true,
        insert_page_markers: true,
        marker_format: "\n\n--- Page {page_num} ---\n\n".to_string(),
    }),
    ..Default::default()
};

const config: ExtractionConfig = {
  pages: {
    extractPages: true,
    insertPageMarkers: true,
    markerFormat: "\n\n--- Page {page_num} ---\n\n"
  }
};

Field Details

extract_pages: When true, populates ExtractionResult.pages with per-page content. Each page contains its text, tables, and images separately.

insert_page_markers: When true, inserts page markers into the combined content string at page boundaries. Useful for LLMs to understand document structure.

marker_format: Template string for page markers. Use {page_num} placeholder for the page number. Default HTML comment format is LLM-friendly.

Format Support

PDF: Full byte-accurate page tracking with O(1) lookup performance
PPTX: Slide boundary tracking with per-slide content
DOCX: Best-effort page break detection using explicit page breaks
Other formats: Page tracking not available (returns None/null)

ImageExtractionConfig

Configuration for extracting and processing images from documents.

Field	Type	Default	Description
`extract_images`	`bool`	`true`	Extract images from documents
`target_dpi`	`int`	`300`	Target DPI for extracted/normalized images
`max_image_dimension`	`int`	`4096`	Maximum image dimension (width or height) in pixels
`inject_placeholders`	`bool`	`true`	Inject image reference placeholders (for example `![Image](/reference/embedded:p1_i0/)`) into markdown output. Set to `false` to extract images as data without modifying the text content.
`auto_adjust_dpi`	`bool`	`true`	Automatically adjust DPI based on image size and content
`min_dpi`	`int`	`72`	Minimum DPI when auto-adjusting
`max_dpi`	`int`	`600`	Maximum DPI when auto-adjusting

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    Images = new ImageExtractionConfig
    {
        ExtractImages = true,
        TargetDpi = 200,
        MaxImageDimension = 2048,
        InjectPlaceholders = true, // set to false to extract images without markdown references
        AutoAdjustDpi = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Extracted: {result.Content[..Math.Min(100, result.Content.Length)]}");

package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg-lts/v4"
)

func main() {
    targetDPI := 200
    maxDim := 2048
    result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
        ImageExtraction: &kreuzberg.ImageExtractionConfig{
            ExtractImages:      kreuzberg.BoolPtr(true),
            TargetDPI:          &targetDPI,
            MaxImageDimension:  &maxDim,
            InjectPlaceholders: kreuzberg.BoolPtr(true), // set to false to extract images without markdown references
            AutoAdjustDPI:      kreuzberg.BoolPtr(true),
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ImageExtractionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .imageExtraction(ImageExtractionConfig.builder()
        .extractImages(true)
        .targetDpi(200)
        .maxImageDimension(2048)
        .injectPlaceholders(true) // set to false to extract images without markdown references
        .autoAdjustDpi(true)
        .build())
    .build();

import asyncio
from kreuzberg import ExtractionConfig, ImageExtractionConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        images=ImageExtractionConfig(
            extract_images=True,
            target_dpi=200,
            max_image_dimension=2048,
            inject_placeholders=True,  # set to False to extract images without markdown references
            auto_adjust_dpi=True,
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Extracted: {result.content[:100]}")

asyncio.run(main())

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  images: Kreuzberg::Config::ImageExtraction.new(
    extract_images: true,
    target_dpi: 200,
    max_image_dimension: 2048,
    inject_placeholders: true, # set to false to extract images without markdown references
    auto_adjust_dpi: true
  )
)

library(kreuzberg)

ocr_cfg <- ocr_config(backend = "tesseract", language = "eng", dpi = 300L)
config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg)

result <- extract_file_sync("scan.png", "image/png", config)

cat(sprintf("Image extraction via OCR:\n"))
cat(sprintf("Content length: %d characters\n", nchar(result$content)))
cat(sprintf("Mime type: %s\n", result$mime_type))
cat(sprintf("Detected language: %s\n", result$detected_language))

use kreuzberg::{ExtractionConfig, ImageExtractionConfig};

fn main() {
    let config = ExtractionConfig {
        images: Some(ImageExtractionConfig {
            extract_images: Some(true),
            target_dpi: Some(200),
            max_image_dimension: Some(2048),
            inject_placeholders: Some(true), // set to false to extract images without markdown references
            auto_adjust_dpi: Some(true),
            ..Default::default()
        }),
        ..Default::default()
    };
    println!("{:?}", config.images);
}

import { extractFile } from '@kreuzberg/node';

const config = {
  images: {
    extractImages: true,
    targetDpi: 200,
    maxImageDimension: 2048,
    injectPlaceholders: true, // set to false to extract images without markdown references
    autoAdjustDpi: true,
  },
};

const result = await extractFile('document.pdf', null, config);
console.log(`Extracted ${result.images?.length ?? 0} images`);

ImagePreprocessingConfig

Image preprocessing configuration for improving OCR quality on scanned documents.

Field	Type	Default	Description
`target_dpi`	`int`	`300`	Target DPI for OCR processing (300 standard, 600 for small text)
`auto_rotate`	`bool`	`true`	Auto-detect and correct image rotation
`deskew`	`bool`	`true`	Correct skew (tilted images)
`denoise`	`bool`	`false`	Apply noise reduction filter
`contrast_enhance`	`bool`	`false`	Enhance image contrast for better text visibility
`binarization_method`	`str`	`"otsu"`	Binarization method: `"otsu"`, `"sauvola"`, `"adaptive"`, `"none"`
`invert_colors`	`bool`	`false`	Invert colors (useful for white text on black background)

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        TesseractConfig = new TesseractConfig
        {
            Preprocessing = new ImagePreprocessingConfig
            {
                TargetDpi = 300,
                Denoise = true,
                Deskew = true,
                ContrastEnhance = true,
                BinarizationMethod = "otsu"
            }
        }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("scanned.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");

package main

import (
  "log"

  "github.com/kreuzberg-dev/kreuzberg-lts/v4"
)

func main() {
  targetDPI := 300
  config := &kreuzberg.ExtractionConfig{
    OCR: &kreuzberg.OCRConfig{
      Tesseract: &kreuzberg.TesseractConfig{
        Preprocessing: &kreuzberg.ImagePreprocessingConfig{
          TargetDPI:         &targetDPI,
          Denoise:           kreuzberg.BoolPtr(true),
          Deskew:            kreuzberg.BoolPtr(true),
          ContrastEnhance:   kreuzberg.BoolPtr(true),
          BinarizationMode:  kreuzberg.StringPtr("otsu"),
        },
      },
    },
  }

  result, err := kreuzberg.ExtractFileSync("document.pdf", config)
  if err != nil {
    log.Fatalf("extract failed: %v", err)
  }

  log.Println("content length:", len(result.Content))
}

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ImagePreprocessingConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .tesseractConfig(TesseractConfig.builder()
            .preprocessing(ImagePreprocessingConfig.builder()
                .targetDpi(300)
                .denoise(true)
                .deskew(true)
                .contrastEnhance(true)
                .binarizationMethod("otsu")
                .build())
            .build())
        .build())
    .build();

import asyncio
from kreuzberg import (
    ExtractionConfig,
    OcrConfig,
    TesseractConfig,
    ImagePreprocessingConfig,
    extract_file,
)

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        ocr=OcrConfig(
            tesseract_config=TesseractConfig(
                preprocessing=ImagePreprocessingConfig(
                    target_dpi=300,
                    denoise=True,
                    deskew=True,
                    contrast_enhance=True,
                    binarization_method="otsu",
                )
            )
        )
    )
    result = await extract_file("scanned.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    tesseract_config: Kreuzberg::Config::Tesseract.new(
      preprocessing: Kreuzberg::Config::ImagePreprocessing.new(
        target_dpi: 300,
        denoise: true,
        deskew: true,
        contrast_enhance: true,
        binarization_method: 'otsu'
      )
    )
  )
)

library(kreuzberg)

dpi_settings <- c(150L, 300L, 600L)
results <- list()

for (dpi in dpi_settings) {
  ocr_cfg <- ocr_config(backend = "tesseract", language = "eng", dpi = dpi)
  config <- extraction_config(force_ocr = TRUE, ocr = ocr_cfg,
                              enable_quality_processing = TRUE)
  results[[as.character(dpi)]] <- extract_file_sync("scan.png", "image/png", config)
}

for (dpi in dpi_settings) {
  quality <- results[[as.character(dpi)]]$quality_score
  length <- nchar(results[[as.character(dpi)]]$content)
  cat(sprintf("DPI %d: quality=%.2f, length=%d\n", dpi, quality, length))
}

use kreuzberg::{ExtractionConfig, ImagePreprocessingConfig, OcrConfig, TesseractConfig};

fn main() {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            tesseract_config: Some(TesseractConfig {
                preprocessing: Some(ImagePreprocessingConfig {
                    target_dpi: 300,
                    denoise: true,
                    deskew: true,
                    contrast_enhance: true,
                    binarization_method: "otsu".to_string(),
                    ..Default::default()
                }),
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    println!("{:?}", config.ocr);
}

import { extractFile } from '@kreuzberg/node';

const config = {
  ocr: {
    backend: 'tesseract',
    tesseractConfig: {
      psm: 6,
      enableTableDetection: true,
    },
  },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

PostProcessorConfig

Configuration for the post-processing pipeline that runs after extraction.

Field	Type	Default	Description
`enabled`	`bool`	`true`	Enable post-processing pipeline
`enabled_processors`	`list[str]?`	`None`	Specific processors to enable (if None, all enabled by default)
`disabled_processors`	`list[str]?`	`None`	Specific processors to disable (takes precedence over enabled_processors)

Built-in post-processors include:

deduplication - Remove duplicate text blocks
whitespace_normalization - Normalize whitespace and line breaks
mojibake_fix - Fix mojibake (encoding corruption)
quality_scoring - Score and filter low-quality text

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    Postprocessor = new PostProcessorConfig
    {
        Enabled = true,
        EnabledProcessors = new List<string> { "deduplication" }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");

package main

import "github.com/kreuzberg-dev/kreuzberg-lts/v4"

func main() {
  enabled := true
  cfg := &kreuzberg.ExtractionConfig{
    Postprocessor: &kreuzberg.PostProcessorConfig{
      Enabled:            &enabled,
      EnabledProcessors:  []string{"deduplication", "whitespace_normalization"},
      DisabledProcessors: []string{"mojibake_fix"},
    },
  }

  _ = cfg
}

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.PostProcessorConfig;
import java.util.Arrays;

ExtractionConfig config = ExtractionConfig.builder()
    .postprocessor(PostProcessorConfig.builder()
        .enabled(true)
        .enabledProcessors(Arrays.asList("deduplication", "whitespace_normalization"))
        .disabledProcessors(Arrays.asList("mojibake_fix"))
        .build())
    .build();

import asyncio
from kreuzberg import ExtractionConfig, PostProcessorConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        postprocessor=PostProcessorConfig(
            enabled=True,
            enabled_processors=["deduplication"],
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  postprocessor: Kreuzberg::Config::PostProcessor.new(
    enabled: true,
    enabled_processors: ['deduplication', 'whitespace_normalization'],
    disabled_processors: ['mojibake_fix']
  )
)

library(kreuzberg)

config <- extraction_config(
  postprocessor = list(enabled = TRUE)
)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("Content length: %d characters\n", nchar(result$content)))
cat(sprintf("Mime type: %s\n", result$mime_type))

use kreuzberg::{ExtractionConfig, PostProcessorConfig};

fn main() {
    let config = ExtractionConfig {
        postprocessor: Some(PostProcessorConfig {
            enabled: Some(true),
            enabled_processors: Some(vec![
                "deduplication".to_string(),
                "whitespace_normalization".to_string(),
            ]),
            disabled_processors: Some(vec!["mojibake_fix".to_string()]),
        }),
        ..Default::default()
    };
    println!("{:?}", config.postprocessor);
}

import { extractFile } from '@kreuzberg/node';

const config = {
  postprocessor: {
    enabled: true,
    enabledProcessors: ['deduplication', 'whitespace_normalization'],
    disabledProcessors: ['mojibake_fix'],
  },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

TokenReductionConfig

Configuration for reducing token count in extracted text, useful for optimizing LLM context windows.

Field	Type	Default	Description
`mode`	`str`	`"off"`	Reduction mode: `"off"`, `"light"`, `"moderate"`, `"aggressive"`, `"maximum"`
`preserve_important_words`	`bool`	`true`	Preserve important words (capitalized, technical terms) during reduction

Reduction Modes

off: No token reduction
light: Remove redundant whitespace and line breaks (~5-10% reduction)
moderate: Light + remove stopwords in low-information contexts (~15-25% reduction)
aggressive: Moderate + abbreviate common phrases (~30-40% reduction)
maximum: Aggressive + remove all stopwords (~50-60% reduction, may impact quality)

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    TokenReduction = new TokenReductionConfig
    {
        Mode = "moderate",
        PreserveImportantWords = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content length: {result.Content.Length}");

package main

import (
  "fmt"

  "github.com/kreuzberg-dev/kreuzberg-lts/v4"
)

func main() {
  config := &kreuzberg.ExtractionConfig{
    TokenReduction: &kreuzberg.TokenReductionConfig{
      Mode:                   "moderate",
      PreserveImportantWords: kreuzberg.BoolPtr(true),
    },
  }

  fmt.Printf("Mode: %s, Preserve Important Words: %v\n",
    config.TokenReduction.Mode,
    *config.TokenReduction.PreserveImportantWords)
}

import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.TokenReductionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .tokenReduction(TokenReductionConfig.builder()
        .mode("moderate")
        .preserveImportantWords(true)
        .build())
    .build();

from kreuzberg import ExtractionConfig, TokenReductionConfig

config: ExtractionConfig = ExtractionConfig(
    token_reduction=TokenReductionConfig(
        mode="moderate",
        preserve_important_words=True,
    )
)

require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  token_reduction: Kreuzberg::Config::TokenReduction.new(
    mode: 'moderate',
    preserve_markdown: true,
    preserve_code: true,
    language_hint: 'eng'
  )
)

library(kreuzberg)

config <- extraction_config(
  token_reduction = list(enabled = TRUE)
)

result <- extract_file_sync("document.pdf", "application/pdf", config)

cat(sprintf("Original content length: %d characters\n", nchar(result$content)))
cat(sprintf("Content preview: %.60s...\n", result$content))

use kreuzberg::{ExtractionConfig, TokenReductionConfig};

let config = ExtractionConfig {
    token_reduction: Some(TokenReductionConfig {
        mode: "moderate".to_string(),
        preserve_markdown: true,
        preserve_code: true,
        language_hint: Some("eng".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};

import { extractFile } from '@kreuzberg/node';

const config = {
  tokenReduction: {
    mode: 'moderate',
    preserveImportantWords: true,
  },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

LayoutDetectionConfig v4.5.0

Configuration for ONNX-based document layout detection. Analyzes PDF pages to identify structural regions such as tables, figures, headers, and text blocks.

Feature Gate: Requires the layout-detection Cargo feature. Layout detection is only available when this feature is enabled.

Fields

Field	Type	Default	Description
`confidence_threshold`	`float?`	`None`	Confidence threshold override (0.0-1.0). If None, uses the model’s built-in default threshold
`apply_heuristics`	`bool`	`true`	Apply postprocessing heuristics (containment filtering, deduplication)
`table_model`	`str?`	`None` (uses `"tatr"`)	Table structure recognition model. Options: `"tatr"` (30MB, default), `"slanet_wired"` (365MB, bordered tables), `"slanet_wireless"` (365MB, borderless tables), `"slanet_plus"` (7.78MB, lightweight), `"slanet_auto"` (~737MB, classifier-routed).

Table Structure Models

Choose table_model based on the tables in your documents and your size budget:

tatr (30MB) — default. General-purpose Table Transformer; a good balance of accuracy and size for most documents.
slanet_wired (365MB) — tuned for bordered tables with visible ruling lines.
slanet_wireless (365MB) — tuned for borderless tables where structure is inferred from alignment.
slanet_plus (7.78MB) — lightweight SLANet variant for size-constrained deployments.
slanet_auto (~737MB) — bundles a classifier that routes each table to the best-fit SLANet model; highest accuracy at the largest footprint.

Configuration Examples

from kreuzberg import ExtractionConfig, LayoutDetectionConfig

config = ExtractionConfig(
    layout=LayoutDetectionConfig(
        confidence_threshold=0.5,
        apply_heuristics=True,
        table_model="slanet_auto",  # or "tatr", "slanet_wired", "slanet_wireless", "slanet_plus"
    )
)

import { extract } from "kreuzberg";

const result = await extract("document.pdf", {
  layout: {
    confidenceThreshold: 0.5,
    applyHeuristics: true,
    tableModel: "slanet_auto", // or "tatr", "slanet_wired", "slanet_wireless", "slanet_plus"
  },
});

use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig};

let config = ExtractionConfig {
    layout: Some(LayoutDetectionConfig {
        confidence_threshold: Some(0.5),
        apply_heuristics: true,
        table_model: Some("slanet_auto".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};

Configuration File Examples

TOML
YAML

[layout]
confidence_threshold = 0.5
apply_heuristics = true
# table_model = "slanet_auto"

layout:
  confidence_threshold: 0.5
  apply_heuristics: true
  # table_model: slanet_auto

AccelerationConfig v4.5.0

Controls hardware acceleration for ONNX Runtime inference (layout detection and embeddings).

Fields

Field	Type	Default	Description
`provider`	`str`	`"auto"`	Execution provider: `"auto"`, `"cpu"`, `"coreml"`, `"cuda"`, `"tensorrt"`
`device_id`	`int`	`0`	GPU device ID (for CUDA/TensorRT)

Provider Behavior

auto: CoreML on macOS, CUDA on Linux, CPU elsewhere
cpu: CPU-only inference (always available)
coreml: Apple CoreML (macOS Neural Engine / GPU)
cuda: NVIDIA CUDA GPU acceleration
tensorrt: NVIDIA TensorRT (optimized CUDA inference)

Kreuzberg bundles a CPU-only ONNX Runtime by default. When a GPU provider (cuda, tensorrt, coreml) is explicitly requested and the corresponding execution provider is not available, Kreuzberg returns an error with instructions to install a GPU-enabled ONNX Runtime and set ORT_DYLIB_PATH. When auto is used, unavailable GPU providers fall back to CPU gracefully with an info-level log. To verify which provider is active, run with RUST_LOG=kreuzberg=info.

Platform Defaults

Platform	`provider="auto"` resolves to
macOS (arm64)	`coreml`
macOS (x86_64)	`coreml`
Linux (x86_64)	`cuda` if available, else `cpu`
Linux (aarch64)	`cpu`
Windows	`cuda` if available, else `cpu`

The device_id field only matters for cuda and tensorrt. Set it to the GPU index (0, 1, …) when running on multi-GPU hosts; it is ignored for every other provider.

Configuration Examples

from kreuzberg import ExtractionConfig, AccelerationConfig

# Force CUDA on GPU 0; falls back to CPU if CUDA isn't compiled in
config = ExtractionConfig(
    acceleration=AccelerationConfig(provider="cuda", device_id=0)
)

# macOS: explicitly use CoreML for ONNX inference
coreml_config = ExtractionConfig(
    acceleration=AccelerationConfig(provider="coreml")
)

import { extract } from "kreuzberg";

const result = await extract("document.pdf", {
  acceleration: { provider: 'cuda', deviceId: 0 },
});

use kreuzberg::core::{ExtractionConfig, AccelerationConfig};

let config = ExtractionConfig {
    acceleration: Some(AccelerationConfig {
        provider: "cuda".to_string(),
        device_id: 0,
    }),
    ..Default::default()
};

Configuration File Examples

TOML
YAML

[acceleration]
provider = "cpu"
device_id = 0

acceleration:
  provider: cpu
  device_id: 0

ConcurrencyConfig v4.5.0

Controls thread pool and concurrency limits for Rayon parallelism, ONNX Runtime intra-op threading, and batch extraction semaphore.

Fields

Field	Type	Default	Description
`max_threads`	`int?`	`None`	Maximum number of threads for Rayon thread pool, ONNX intra-op, batch concurrency

Overview

Use ConcurrencyConfig to constrain resource usage on systems with limited hardware. When set, max_threads caps:

Rayon thread pool size for text extraction and parsing parallelism
ONNX Runtime intra-op parallelism for layout detection and embeddings inference
Batch extraction semaphore for limiting concurrent file extractions

Setting max_threads: None disables concurrency limits and allows libraries to use all available cores (default behavior).

Configuration Examples

from kreuzberg import ExtractionConfig, ConcurrencyConfig

# Limit to 4 threads for constrained hardware
config = ExtractionConfig(
    concurrency=ConcurrencyConfig(max_threads=4)
)

import { extract } from "kreuzberg";

const result = await extract("document.pdf", {
  concurrency: { maxThreads: 4 },
});

use kreuzberg::core::{ExtractionConfig, ConcurrencyConfig};

let config = ExtractionConfig {
    concurrency: Some(ConcurrencyConfig {
        max_threads: Some(4),
    }),
    ..Default::default()
};

package main

import "kreuzberg"

config := &kreuzberg.ExtractionConfig{
    Concurrency: &kreuzberg.ConcurrencyConfig{
        MaxThreads: intPtr(4),
    },
}

ConcurrencyConfig concurrency = new ConcurrencyConfig(4);
ExtractionConfig config = new ExtractionConfig(
    /* ... other fields ... */
    Optional.of(concurrency)
);

using Kreuzberg;

var config = new ExtractionConfig
{
    Concurrency = new ConcurrencyConfig { MaxThreads = 4 }
};

TreeSitterConfig

Configuration for tree-sitter language pack integration. Controls grammar caching and code analysis options when extracting source code files. Requires the tree-sitter feature flag.

Fields

Field	Type	Default	Description
`enabled`	`bool`	`true`	Enable code intelligence processing. When `false`, tree-sitter analysis is skipped even if config is present
`cache_dir`	`PathBuf?`	`None`	Custom cache directory for downloaded grammars. Default: `~/.cache/tree-sitter-language-pack/v{version}/libs/`
`languages`	`Vec<String>?`	`None`	Languages to pre-download on init (for example, `["python", "rust"]`)
`groups`	`Vec<String>?`	`None`	Language groups to pre-download (for example, `["web", "systems", "scripting"]`)
`process`	`TreeSitterProcessConfig`	default	Processing options for code analysis

TreeSitterProcessConfig

Controls which analysis features are enabled when extracting code files.

Field	Type	Default	Description
`structure`	`bool`	`true`	Extract structural items (functions, classes, structs, etc.)
`imports`	`bool`	`true`	Extract import statements
`exports`	`bool`	`true`	Extract export statements
`comments`	`bool`	`false`	Extract comments
`docstrings`	`bool`	`false`	Extract docstrings
`symbols`	`bool`	`false`	Extract symbol definitions (variables, constants, type aliases)
`diagnostics`	`bool`	`false`	Include parse diagnostics (errors and warnings from tree-sitter)
`chunk_max_size`	`usize?`	`None`	Maximum chunk size in bytes. `None` uses the default chunking size
`content_mode`	`CodeContentMode`	`chunks`	Controls how code content is rendered in the `content` field: `chunks` (semantic chunks, default), `raw` (raw source code), or `structure` (function/class headings + docstrings, no code bodies)

Configuration Examples

[tree_sitter]
languages = ["python", "rust", "typescript"]
groups = ["web"]

[tree_sitter.process]
structure = true
imports = true
exports = true
comments = true
docstrings = true
symbols = false
diagnostics = false

use kreuzberg::{ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};

let config = ExtractionConfig {
    tree_sitter: Some(TreeSitterConfig {
        process: TreeSitterProcessConfig {
            structure: true,
            imports: true,
            exports: true,
            comments: true,
            docstrings: true,
            ..Default::default()
        },
        ..Default::default()
    }),
    ..Default::default()
};

import kreuzberg

config = kreuzberg.ExtractionConfig(
    tree_sitter={
        "process": {
            "structure": True,
            "imports": True,
            "exports": True,
            "comments": True,
            "docstrings": True,
        }
    }
)

import { ExtractionConfig } from "@kreuzberg/node";

const config: ExtractionConfig = {
  treeSitter: {
    process: {
      structure: true,
      imports: true,
      exports: true,
      comments: true,
      docstrings: true,
    },
  },
};

config := &kreuzberg.ExtractionConfig{
    TreeSitter: &kreuzberg.TreeSitterConfig{
        Process: &kreuzberg.TreeSitterProcessConfig{
            Structure:  boolPtr(true),
            Imports:    boolPtr(true),
            Exports:    boolPtr(true),
            Comments:   boolPtr(true),
            Docstrings: boolPtr(true),
        },
    },
}

Configuration File Examples

TOML Format

use_cache = true
enable_quality_processing = true
force_ocr = false

[ocr]
backend = "tesseract"
language = "eng+fra"

[ocr.tesseract_config]
psm = 6
oem = 1
min_confidence = 0.8
enable_table_detection = true

[ocr.tesseract_config.preprocessing]
target_dpi = 300
denoise = true
deskew = true
contrast_enhance = true
binarization_method = "otsu"

[pdf_options]
extract_images = true
extract_metadata = true
passwords = ["password1", "password2"]

[images]
extract_images = true
target_dpi = 200
max_image_dimension = 4096

[chunking]
max_characters = 1000
overlap = 200

[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false

[token_reduction]
mode = "moderate"
preserve_important_words = true

[layout]
preset = "fast"

[postprocessor]
enabled = true

YAML Format

# kreuzberg.yaml
use_cache: true
enable_quality_processing: true
force_ocr: false

ocr:
  backend: tesseract
  language: eng+fra
  tesseract_config:
    psm: 6
    oem: 1
    min_confidence: 0.8
    enable_table_detection: true
    preprocessing:
      target_dpi: 300
      denoise: true
      deskew: true
      contrast_enhance: true
      binarization_method: otsu

pdf_options:
  extract_images: true
  extract_metadata: true
  passwords:
    - password1
    - password2

images:
  extract_images: true
  target_dpi: 200
  max_image_dimension: 4096

chunking:
  max_characters: 1000
  overlap: 200

language_detection:
  enabled: true
  min_confidence: 0.8
  detect_multiple: false

token_reduction:
  mode: moderate
  preserve_important_words: true

layout:
  preset: fast

postprocessor:
  enabled: true

JSON Format

{
  "use_cache": true,
  "enable_quality_processing": true,
  "force_ocr": false,
  "ocr": {
    "backend": "tesseract",
    "language": "eng+fra",
    "tesseract_config": {
      "psm": 6,
      "oem": 1,
      "min_confidence": 0.8,
      "enable_table_detection": true,
      "preprocessing": {
        "target_dpi": 300,
        "denoise": true,
        "deskew": true,
        "contrast_enhance": true,
        "binarization_method": "otsu"
      }
    }
  },
  "pdf_options": {
    "extract_images": true,
    "extract_metadata": true,
    "passwords": ["password1", "password2"]
  },
  "images": {
    "extract_images": true,
    "target_dpi": 200,
    "max_image_dimension": 4096
  },
  "chunking": {
    "max_characters": 1000,
    "overlap": 200
  },
  "language_detection": {
    "enabled": true,
    "min_confidence": 0.8,
    "detect_multiple": false
  },
  "token_reduction": {
    "mode": "moderate",
    "preserve_important_words": true
  },
  "layout": {
    "preset": "fast"
  },
  "postprocessor": {
    "enabled": true
  }
}

For complete working examples, see the e2e test suites.

Best Practices

When to Use Config Files vs Programmatic Config

Use config files when:

Settings are shared across multiple scripts/applications
Configuration needs to be version controlled
Non-developers need to modify settings
Deploying to multiple environments (dev/staging/prod)

Use programmatic config when:

Settings vary per execution or are computed dynamically
Configuration depends on runtime conditions
Building SDKs or libraries that wrap Kreuzberg
Rapid prototyping and experimentation

Performance Considerations

Caching:

Keep use_cache=true for repeated processing of the same files
Cache is automatically invalidated when files change
Cache location: platform-specific global cache (for example, ~/.cache/kreuzberg/ on Linux, ~/Library/Caches/kreuzberg/ on macOS), configurable via KREUZBERG_CACHE_DIR env var or cache_dir option

OCR Settings:

Lower target_dpi (for example, 150-200) for faster processing of low-quality scans
Higher target_dpi (for example, 400-600) for small text or high-quality documents
Disable enable_table_detection if tables aren’t needed (10-20% speedup)
Use psm=6 for clean single-column documents (faster than psm=3)

Batch Processing:

Set max_concurrent_extractions to balance speed and memory usage
Default (num_cpus * 2) works well for most systems
Reduce for memory-constrained environments
Increase for I/O-bound workloads on systems with fast storage

Token Reduction:

Use "light" or "moderate" modes for minimal quality impact
"aggressive" and "maximum" modes may affect semantic meaning
Benchmark with your specific LLM to measure quality vs. cost tradeoff

Security Considerations

API Keys and Secrets:

Never commit config files containing API keys or passwords to version control
Use environment variables for sensitive data:
Terminal
```
export KREUZBERG_OCR_API_KEY="your-key-here"
```
Add kreuzberg.toml to .gitignore if it contains secrets
Use separate config files for development vs. production

PDF Passwords:

passwords field attempts passwords in order until one succeeds
Passwords are not logged or cached

Use environment variables for sensitive passwords:

import os
config = PdfConfig(passwords=[os.getenv("PDF_PASSWORD")])

File System Access:

Kreuzberg only reads files you explicitly pass to extraction functions
Cache directory permissions should be restricted to the running user
Temporary files are automatically cleaned up after extraction

Data Privacy:

Extraction results are never sent to external services (except explicit OCR backends)
Tesseract OCR runs locally with no network access
EasyOCR and PaddleOCR may download models on first run (cached locally)
Consider disabling cache for sensitive documents requiring ephemeral processing

ApiSizeLimits

Configuration for API server request and file upload size limits.

Field	Type	Default	Description
`max_request_body_bytes`	`int`	`104857600`	Maximum size of entire request body in bytes (100 MB default)
`max_multipart_field_bytes`	`int`	`104857600`	Maximum size of individual file in multipart upload in bytes (100 MB default)

About Size Limits

Size limits protect your server from resource exhaustion and memory spikes. Both limits default to 100 MB, suitable for typical document processing workloads. Users can configure higher limits via environment variables for processing larger files.

Default Configuration:

Total request body: 100 MB (104,857,600 bytes)
Individual file: 100 MB (104,857,600 bytes)

Environment Variable Configuration:

# Set multipart field limit to 200 MB via environment variable
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=209715200
kreuzberg serve -H 0.0.0.0 -p 8000

Example

using Kreuzberg;
using Kreuzberg.Api;

// Default limits: 100 MB for both request body and individual files
var limits = new ApiSizeLimits();

// Custom limits: 200 MB for both request body and individual files
var customLimits = ApiSizeLimits.FromMB(200, 200);

// Or specify byte values directly
var customLimits2 = new ApiSizeLimits
{
    MaxRequestBodyBytes = 200 * 1024 * 1024,
    MaxMultipartFieldBytes = 200 * 1024 * 1024
};

import "kreuzberg"

// Default limits: 100 MB for both request body and individual files
limits := kreuzberg.NewApiSizeLimits(
    100 * 1024 * 1024,
    100 * 1024 * 1024,
)

// Or use convenience method for custom limits
limits := kreuzberg.ApiSizeLimitsFromMB(200, 200)

import com.kreuzberg.api.ApiSizeLimits;

// Default limits: 100 MB for both request body and individual files
ApiSizeLimits limits = new ApiSizeLimits();

// Custom limits via convenience method
ApiSizeLimits limits = ApiSizeLimits.fromMB(200, 200);

// Or specify byte values
ApiSizeLimits limits = new ApiSizeLimits(
    200 * 1024 * 1024,
    200 * 1024 * 1024
);

from kreuzberg.api import ApiSizeLimits

# Default limits: 100 MB for both request body and individual files
limits = ApiSizeLimits()

# Custom limits via convenience method
limits = ApiSizeLimits.from_mb(200, 200)

# Or specify byte values
limits = ApiSizeLimits(
    max_request_body_bytes=200 * 1024 * 1024,
    max_multipart_field_bytes=200 * 1024 * 1024
)

require 'kreuzberg'

# Default limits: 100 MB for both request body and individual files
limits = Kreuzberg::Api::ApiSizeLimits.new

# Custom limits via convenience method
limits = Kreuzberg::Api::ApiSizeLimits.from_mb(200, 200)

# Or specify byte values
limits = Kreuzberg::Api::ApiSizeLimits.new(
  max_request_body_bytes: 200 * 1024 * 1024,
  max_multipart_field_bytes: 200 * 1024 * 1024
)

use kreuzberg::api::ApiSizeLimits;

// Default limits: 100 MB for both request body and individual files
let limits = ApiSizeLimits::default();

// Custom limits via convenience method
let limits = ApiSizeLimits::from_mb(200, 200);

// Or specify byte values
let limits = ApiSizeLimits::new(
    200 * 1024 * 1024,  // max_request_body_bytes
    200 * 1024 * 1024,  // max_multipart_field_bytes
);

import { ApiSizeLimits } from 'kreuzberg';

// Default limits: 100 MB for both request body and individual files
const limits = new ApiSizeLimits();

// Custom limits via convenience method
const limits = ApiSizeLimits.fromMb(200, 200);

// Or specify byte values
const limits = new ApiSizeLimits({
    maxRequestBodyBytes: 200 * 1024 * 1024,
    maxMultipartFieldBytes: 200 * 1024 * 1024
});

Configuration Scenarios

Use Case	Recommended Limit	Rationale
Small documents (standard PDFs, Office files)	100 MB (default)	Optimal for typical business documents
Medium documents (large scans, batches)	200 MB	Good balance for batching without excessive memory
Large documents (archives, high-res scans)	500-1000 MB	Suitable for specialized workflows with adequate RAM
Development/testing	50 MB	Conservative limit to catch issues early
Memory-constrained environments	50 MB	Prevents out-of-memory errors on limited systems

For comprehensive documentation including memory impact calculations, reverse proxy configuration, and troubleshooting, see the File Size Limits Reference.

Configuration Guide - Usage guide with examples
API Server Guide - HTTP API server setup and deployment
File Size Limits Reference - Complete size limits documentation with performance tuning
OCR Guide - OCR-specific configuration and troubleshooting
E2E Test Suites - Complete working examples