Output Formats¶

Choose the format that matches your downstream processing:

Unified (default) — Plain text/Markdown, for LLM prompts and full-text search
Element-Based — Flat array of typed elements with metadata, for RAG chunking and semantic search
Document Structure — Hierarchical tree with explicit parent-child references, for knowledge graphs and structured apps
PDF Hierarchy — Font-size classification into heading levels (H1–H6) for PDFs
Image Output Formats — Normalize extracted images to PNG, JPEG, WebP, HEIF, or SVG

Unified Output (Default)¶

No configuration required. The result contains:

content — Full document text with minimal formatting
pages — Per-page breakdown for PDFs, DOCX, and PPTX
tables — Extracted tables in structured format
images — Image metadata and paths

Image Output Formats¶

Normalize extracted images to a uniform format after extraction but before post-processors.

By default, images are returned in their native format (JPEG from PDFs, PNG from rasterization, etc.). Set ImageExtractionConfig.output_format to re-encode all images to a single target format. This is useful for cloud pipelines that require uniform storage, thumbnails, or downstream processing.

Supported Formats¶

Format	Quality param	Use case	Notes
`Native`	—	Default; preserve source format	No re-encode pass. Fastest.
`Png`	—	Lossless archival	Large file sizes; recommended for quality-critical workflows.
`Jpeg`	`1`–`100`	Web/cloud storage	Default quality `85`. Lossy; good balance of size and quality.
`Webp`	`1`–`100`	Modern web use	Default quality `80`. Better compression than JPEG; requires browser support.
`Heif`	`1`–`100`	Apple ecosystem	Default quality `80`. Requires `heic` feature. Superior compression ratio vs JPEG/WebP.
`Svg`	—	Archival of vector images	(v5+) Lossless vector output. Raster sources return a warning; not auto-vectorized. Requires `svg` feature.

SVG Support (v5+)¶

When the svg feature is active and output_format is set to Svg:

SVG → SVG: Re-parses the source via usvg and re-serializes. When svg.sanitize = true (default), strips <script> elements, external xlink:href/href attributes, <foreignObject> containers, and JavaScript event handlers. This is a lossy normalization for security.
SVG → PNG/JPEG/WebP/HEIF: Rasterizes to pixel format using resvg at the specified render_dpi (default 96.0, clamped 1.0–600.0 DPI).
Raster → SVG: Returns EncodeWarning::UnsupportedDirection; bytes are left untouched. No auto-vectorization.
Security: SVG input capped at 10 MB; rasterized output capped at 16384² pixels (~1 GB peak). External resource loading is disabled.

Configuration¶

PythonTypeScriptRust

from kreuzberg import ExtractionConfig, ImageExtractionConfig, ImageOutputFormat

# Normalize all images to WebP at quality 80
config = ExtractionConfig(
    images=ImageExtractionConfig(
        output_format=ImageOutputFormat.Webp(quality=80)
    )
)

import { ExtractionConfig, ImageOutputFormat } from "@kreuzberg/node";

const config: ExtractionConfig = {
  images: {
    outputFormat: { type: "webp", quality: 80 }
  }
};

use kreuzberg::{ExtractionConfig, ImageExtractionConfig, ImageOutputFormat};

let config = ExtractionConfig {
    images: Some(ImageExtractionConfig {
        output_format: ImageOutputFormat::Webp { quality: 80 },
        ..Default::default()
    }),
    ..Default::default()
};

SVG Sanitization (v5+)¶

Enable or disable SVG security filtering:

PythonRust

from kreuzberg import ExtractionConfig, ImageExtractionConfig, ImageOutputFormat, SvgOptions

# Re-encode SVG with sanitization disabled
config = ExtractionConfig(
    images=ImageExtractionConfig(
        output_format=ImageOutputFormat.Svg,
        svg=SvgOptions(sanitize=False, render_dpi=96.0)
    )
)

use kreuzberg::{ExtractionConfig, ImageExtractionConfig, ImageOutputFormat, SvgOptions};

let config = ExtractionConfig {
    images: Some(ImageExtractionConfig {
        output_format: ImageOutputFormat::Svg,
        svg: Some(SvgOptions { sanitize: false, render_dpi: 96.0 }),
        ..Default::default()
    }),
    ..Default::default()
};

Element-Based Output¶

A flat array of typed elements (titles, paragraphs, tables, list items, code blocks, images, etc.). Each carries a page number; PDF text elements also carry bounding boxes when hierarchy extraction is enabled.

Use for RAG chunking, semantic search, or Unstructured.io-compatible pipelines.

Enable¶

PythonTypeScriptRustGoRubyRPHP

Element-Based Output (Python)

from kreuzberg import extract_file_sync, ExtractionConfig

# Configure element-based output
config = ExtractionConfig(result_format="element_based")

# Extract document
result = extract_file_sync("document.pdf", config=config)

# Access elements
for element in result.elements:
    print(f"Type: {element.element_type}")
    print(f"Text: {element.text[:100]}")

    if element.metadata.page_number:
        print(f"Page: {element.metadata.page_number}")

    if element.metadata.coordinates:
        coords = element.metadata.coordinates
        print(f"Coords: ({coords.left}, {coords.top}) - ({coords.right}, {coords.bottom})")

    print("---")

# Filter by element type
titles = [e for e in result.elements if e.element_type == "title"]
for title in titles:
    level = title.metadata.additional.get("level", "unknown")
    print(f"[{level}] {title.text}")

Element-Based Output (TypeScript)

import { extractFileSync, ExtractionConfig } from "@kreuzberg/node";

// Configure element-based output
const config: ExtractionConfig = {
  outputFormat: "element_based",
};

// Extract document
const result = extractFileSync("document.pdf", null, config);

// Access elements
for (const element of result.elements) {
  console.log(`Type: ${element.elementType}`);
  console.log(`Text: ${element.text.slice(0, 100)}`);

  if (element.metadata.pageNumber) {
    console.log(`Page: ${element.metadata.pageNumber}`);
  }

  if (element.metadata.coordinates) {
    const coords = element.metadata.coordinates;
    console.log(`Coords: (${coords.left}, ${coords.top}) - (${coords.right}, ${coords.bottom})`);
  }

  console.log("---");
}

// Filter by element type
const titles = result.elements.filter((e) => e.elementType === "title");
for (const title of titles) {
  const level = title.metadata.additional?.level || "unknown";
  console.log(`[${level}] ${title.text}`);
}

Element-Based Output (Rust)

use kreuzberg::{extract_file_sync, ExtractionConfig};
use kreuzberg::types::OutputFormat as ResultFormat;

fn main() -> kreuzberg::Result<()> {
    // Configure element-based output (result_format controls Unified vs ElementBased)
    let config = ExtractionConfig {
        result_format: ResultFormat::ElementBased,
        ..Default::default()
    };

    // Extract document
    let result = extract_file_sync("document.pdf", None, &config)?;

    // Access elements
    if let Some(elements) = result.elements {
        for element in &elements {
            println!("Type: {:?}", element.element_type);
            println!("Text: {}", &element.text[..100.min(element.text.len())]);

            if let Some(page) = element.metadata.page_number {
                println!("Page: {}", page);
            }

            if let Some(coords) = &element.metadata.coordinates {
                println!("Coords: ({}, {}) - ({}, {})",
                    coords.x0, coords.y0, coords.x1, coords.y1);
            }

            println!("---");
        }

        // Filter by element type
        let titles: Vec<_> = elements.iter()
            .filter(|e| matches!(e.element_type, kreuzberg::types::ElementType::Title))
            .collect();

        for title in titles {
            let level = title.metadata.additional.get("level")
                .map(|v| v.as_ref())
                .unwrap_or("unknown");
            println!("[{}] {}", level, title.text);
        }
    }

    Ok(())
}

Element-Based Output (Go)

package main

import (
    "fmt"
    "kreuzberg"
)

func main() {
    // Configure element-based output
    config := &kreuzberg.ExtractionConfig{
        OutputFormat: "element_based",
    }

    // Extract document
    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        panic(err)
    }

    // Access elements
    for _, element := range result.Elements {
        fmt.Printf("Type: %s\n", element.ElementType)

        text := element.Text
        if len(text) > 100 {
            text = text[:100]
        }
        fmt.Printf("Text: %s\n", text)

        if element.Metadata.PageNumber != nil {
            fmt.Printf("Page: %d\n", *element.Metadata.PageNumber)
        }

        if element.Metadata.Coordinates != nil {
            coords := element.Metadata.Coordinates
            fmt.Printf("Coords: (%f, %f) - (%f, %f)\n",
                coords.Left, coords.Top, coords.Right, coords.Bottom)
        }

        fmt.Println("---")
    }

    // Filter by element type
    var titles []kreuzberg.Element
    for _, element := range result.Elements {
        if element.ElementType == "title" {
            titles = append(titles, element)
        }
    }

    for _, title := range titles {
        level, ok := title.Metadata.Additional["level"].(string)
        if !ok {
            level = "unknown"
        }
        fmt.Printf("[%s] %s\n", level, title.Text)
    }
}

Element-Based Output (Ruby)

require 'kreuzberg'

# Configure element-based output
config = Kreuzberg::ExtractionConfig.new(output_format: 'element_based')

# Extract document
result = Kreuzberg.extract_file_sync('document.pdf', config: config)

# Access elements
result.elements.each do |element|
  puts "Type: #{element.element_type}"
  puts "Text: #{element.text[0...100]}"

  puts "Page: #{element.metadata.page_number}" if element.metadata.page_number

  if element.metadata.coordinates
    coords = element.metadata.coordinates
    puts "Coords: (#{coords.left}, #{coords.top}) - (#{coords.right}, #{coords.bottom})"
  end

  puts "---"
end

# Filter by element type
titles = result.elements.select { |e| e.element_type == 'title' }
titles.each do |title|
  level = title.metadata.additional['level'] || 'unknown'
  puts "[#{level}] #{title.text}"
end

R

library(kreuzberg)

config <- list(
  result_format = "element_based",
  output_format = "markdown"
)

json <- extract_file_sync("document.pdf", "application/pdf", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)

cat(sprintf("Total elements: %d\n\n", length(result$elements)))

for (i in seq_along(result$elements)) {
  element <- result$elements[[i]]
  cat(sprintf("Element %d:\n", i))
  cat(sprintf("  Type: %s\n", element$element_type))
  cat(sprintf("  Content: %s\n\n", substr(element$content, 1, 100)))
}

Element-Based Output (PHP)

<?php
use Kreuzberg\ExtractionConfig;
use Kreuzberg\Kreuzberg;

// Configure element-based output
$config = new ExtractionConfig();
$config->setOutputFormat('element_based');

// Extract document
$result = Kreuzberg::extractFileSync('document.pdf', $config);

// Access elements
foreach ($result->getElements() as $element) {
    echo "Type: " . $element->getElementType() . "\n";
    echo "Text: " . substr($element->getText(), 0, 100) . "\n";

    if ($element->getMetadata()->getPageNumber()) {
        echo "Page: " . $element->getMetadata()->getPageNumber() . "\n";
    }

    if ($element->getMetadata()->getCoordinates()) {
        $coords = $element->getMetadata()->getCoordinates();
        echo sprintf("Coords: (%s, %s) - (%s, %s)\n",
            $coords->getLeft(), $coords->getTop(),
            $coords->getRight(), $coords->getBottom());
    }

    echo "---\n";
}

// Filter by element type
$titles = array_filter($result->getElements(), function($e) {
    return $e->getElementType() === 'title';
});

foreach ($titles as $title) {
    $level = $title->getMetadata()->getAdditional()['level'] ?? 'unknown';
    echo "[{$level}] {$title->getText()}\n";
}
?>

Elements are in result.elements. Each element has element_id, element_type, text, and metadata.

Element Types¶

`element_type`	Description	Key `additional` fields
`title`	Main title or top-level heading	`level` (h1–h6), `font_size`, `font_name`
`heading`	Section/subsection heading	`level` (h1–h6)
`narrative_text`	Body paragraph	—
`list_item`	Bullet, numbered, or indented item	`list_type`, `list_marker`, `indent_level`
`table`	Tabular data	`row_count`, `column_count`, `format`
`image`	Embedded image	`format`, `width`, `height`, `alt_text`
`code_block`	Code snippet	`language`, `line_count`
`block_quote`	Quoted text	—
`header`	Recurring page header	`position`
`footer`	Recurring page footer	`position`
`page_break`	Page boundary marker	`next_page`

Metadata¶

Every element's metadata contains:

Field	Type	Description
`page_number`	`int \\| None`	1-indexed page number (PDF, DOCX, PPTX)
`filename`	`str \\| None`	Source filename
`coordinates`	`BoundingBox \\| None`	`x0`, `y0`, `x1`, `y1` in PDF points. Only populated for text elements when `pdf_options.hierarchy` is enabled with `include_bbox=True`. Table and image elements do not carry coordinates.
`element_index`	`int`	Zero-based position in the elements array
`additional`	`dict[str, str]`	Element-type-specific fields (see table above)

PDF coordinates use bottom-left origin in points (1/72 inch).

Example Output¶

{
  "element_id": "elem-a3f2b1c4",
  "element_type": "title",
  "text": "Introduction to Machine Learning",
  "metadata": {
    "page_number": 1,
    "element_index": 0,
    "coordinates": { "x0": 72.0, "y0": 700.0, "x1": 540.0, "y1": 730.0 },
    "additional": { "level": "h1", "font_size": "24" }
  }
}

Filtering Elements¶

config = ExtractionConfig(result_format="element_based")
result = extract_file_sync("document.pdf", config=config)

titles = [e for e in result.elements if e.element_type == "title"]
tables = [e for e in result.elements if e.element_type == "table"]

for title in titles:
    level = title.metadata.additional.get("level", "h1")
    print(f"[{level}] {title.text}")

Migrating from Unstructured.io¶

If you're migrating from Unstructured.io, element-based output follows a similar structure with these key differences:

Aspect	Unstructured.io	Kreuzberg
Type names	PascalCase (`Title`, `NarrativeText`)	snake_case (`title`, `narrative_text`)
Element IDs	Not always present	Always present (deterministic hash)
Metadata	Basic (`page_number`, `filename`)	Extended (coordinates, `additional` fields)
Config key	—	`result_format="element_based"`

Document Structure¶

A flat list of nodes with explicit parent-child index references — a traversable tree with heading levels, content layers, inline annotations, and structured table grids.

Use when you need hierarchical relationships between sections.

Comparison¶

Aspect	Unified (default)	Element-based	Document structure
Output shape	`content: string`	`elements: array`	`nodes: array` with index refs
Hierarchy	None	Inferred from levels	Explicit parent/child indices
Inline annotations	No	No	Bold, italic, links per node
Tables	`result.tables`	Table elements	`TableGrid` with cell coords
Content layers	Not classified	Not classified	body, header, footer, footnote
Best for	LLM prompts, full-text	RAG chunking	Knowledge graphs, structured apps

Enable¶

PythonTypeScriptRustGoJavaC#RubyR

Document Structure Config (Python)

from kreuzberg import extract_file_sync, ExtractionConfig

# Enable document structure output
config = ExtractionConfig(include_document_structure=True)

result = extract_file_sync("document.pdf", config=config)

# Access the document tree
if result.document:
    for node in result.document["nodes"]:
        node_type = node["content"]["node_type"]
        text = node["content"].get("text", "")
        print(f"[{node_type}] {text[:80]}")

Document Structure Config (TypeScript)

import { extractFileSync, ExtractionConfig } from "@kreuzberg/node";

const config: ExtractionConfig = {
  includeDocumentStructure: true,
};

const result = extractFileSync("document.pdf", undefined, config);

if (result.document) {
  for (const node of result.document.nodes) {
    console.log(`[${node.content.nodeType}] ${node.content.text ?? ""}`);
  }
}

Document Structure Config (Rust)

use kreuzberg::{extract_file_sync, ExtractionConfig};

let config = ExtractionConfig {
    include_document_structure: true,
    ..Default::default()
};

let result = extract_file_sync("document.pdf", None, &config)?;

if let Some(document) = &result.document {
    for node in &document.nodes {
        let text = node.content.text().unwrap_or("");
        println!("[{}] {}", node.content.node_type_str(), &text[..text.len().min(80)]);
    }
}

Document Structure Config (Go)

package main

import (
    "fmt"
    kreuzberg "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)

func main() {
    config := kreuzberg.NewExtractionConfig(
        kreuzberg.WithIncludeDocumentStructure(true),
    )

    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        panic(err)
    }

    if result.Document != nil {
        for _, node := range result.Document.Nodes {
            fmt.Printf("[%s]\n", node.Content.NodeType)
        }
    }
}

Document Structure Config (Java)

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.ExtractionResult;

ExtractionConfig config = ExtractionConfig.builder()
    .includeDocumentStructure(true)
    .build();

ExtractionResult result = Kreuzberg.extractFileSync("document.pdf", config);

if (result.getDocumentStructure().isPresent()) {
    var document = result.getDocumentStructure().get();
    for (var node : document.nodes()) {
        System.out.println("[" + node.content().nodeType() + "]");
    }
}

Document Structure Config (C#)

using Kreuzberg;

var config = new ExtractionConfig
{
    IncludeDocumentStructure = true
};

var result = KreuzbergLib.ExtractFileSync("document.pdf", config);

if (result.Document is not null)
{
    foreach (var node in result.Document.Nodes)
    {
        Console.WriteLine($"[{node.Content.NodeType}]");
    }
}

Document Structure Config (Ruby)

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(include_document_structure: true)

result = Kreuzberg.extract_file_sync('document.pdf', config: config)

if result.document
  result.document['nodes'].each do |node|
    node_type = node['content']['node_type']
    text = node['content']['text'] || ''
    puts "[#{node_type}] #{text[0...80]}"
  end
end

R

library(kreuzberg)

config <- list(
  include_document_structure = TRUE,
  output_format = "markdown"
)

json <- extract_file_sync("document.pdf", "application/pdf", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)

cat(sprintf("Total pages: %d\n", length(result$pages)))
cat(sprintf("MIME type: %s\n\n", result$mime_type))

for (i in seq_along(result$pages)) {
  page <- result$pages[[i]]
  cat(sprintf("Page %d structure:\n", i))
  cat(sprintf("  Content: %s\n", substr(page$content, 1, 100)))
  cat("\n")
}

Node Shape¶

Each node in result.document.nodes:

{
  "id": "node-a3f2b1c4",
  "content": { "node_type": "heading", "level": 2, "text": "Supervised Learning" },
  "parent": 0,
  "children": [4, 5, 6],
  "content_layer": "body",
  "page": 5,
  "page_end": null,
  "bbox": { "x0": 72.0, "y0": 600.0, "x1": 400.0, "y1": 620.0 },
  "annotations": []
}

parent and children are integer indices into the nodes array (null if absent)
bbox is present when bounding box data is available
annotations contains inline formatting spans

Node Types¶

`node_type`	Key fields	Notes
`title`	`text`	Document title
`heading`	`level` (1–6), `text`	Section heading
`paragraph`	`text`	Body paragraph; may have `annotations`
`list`	`ordered` (bool)	Container; children are `list_item` nodes
`list_item`	`text`	Child of `list`
`table`	`grid` (TableGrid)	Grid with cell-level data
`image`	`description`, `image_index`	`image_index` references `result.images`
`code`	`text`, `language`	Code block
`quote`	(container)	Children are typically paragraphs
`formula`	`text`	Math formula (plain text, LaTeX, or MathML)
`footnote`	`text`	Usually `content_layer: "footnote"`
`group`	`label`, `heading_level`, `heading_text`	Section grouping container
`page_break`	(marker)	Page boundary

Content Layers¶

Layer	Description
`body`	Main document content
`header`	Page header area (repeated chapter titles)
`footer`	Page footer area (page numbers, copyright)
`footnote`	Footnotes and endnotes

for node in result.document["nodes"]:
    if node["content_layer"] == "body":
        process_main_content(node)

Text Annotations¶

Paragraphs carry a list of annotations marking character spans:

{ "start": 0, "end": 16, "kind": { "annotation_type": "bold" } }

`annotation_type`	Extra fields
`bold`, `italic`, `underline`, `strikethrough`	—
`code`, `subscript`, `superscript`	—
`link`	`url`, `title` (optional)

for node in result.document["nodes"]:
    for ann in node.get("annotations", []):
        text = node["content"].get("text", "")
        span = text[ann["start"]:ann["end"]]
        kind = ann["kind"]["annotation_type"]
        if kind == "link":
            print(f"Link: {span} -> {ann['kind']['url']}")
        else:
            print(f"{kind}: {span}")

Table Grid¶

Table nodes contain a grid with cell-level data:

{
  "rows": 3,
  "cols": 3,
  "cells": [
    { "content": "Algorithm", "row": 0, "col": 0, "row_span": 1, "col_span": 1, "is_header": true },
    {
      "content": "Decision Tree",
      "row": 1,
      "col": 0,
      "row_span": 1,
      "col_span": 1,
      "is_header": false
    }
  ]
}

Each cell has row, col, row_span, col_span, is_header, and optionally bbox.

for node in result.document["nodes"]:
    if node["content"]["node_type"] == "table":
        grid = node["content"]["grid"]
        rows, cols = grid["rows"], grid["cols"]
        table = [[None] * cols for _ in range(rows)]
        for cell in grid["cells"]:
            table[cell["row"]][cell["col"]] = cell["content"]
        for row in table:
            print(" | ".join(str(c or "") for c in row))

PDF Hierarchy Detection¶

Classifies PDF text blocks into heading levels (H1–H6) and body text via K-means clustering on font sizes — largest cluster is H1, second-largest H2, and so on.

Quick Start¶

PythonTypeScriptRustGoJavaC#Ruby

Python

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig

config: ExtractionConfig = ExtractionConfig(
    pdf_options=PdfConfig(
        extract_metadata=True,
        hierarchy=HierarchyConfig(
            enabled=True,
            k_clusters=6,
            include_bbox=True,
            ocr_coverage_threshold=0.8
        )
    )
)

result = extract_file_sync("document.pdf", config=config)

# Access hierarchy information
for page in result.pages or []:
    print(f"Page {page.page_number}:")
    print(f"  Content: {page.content[:100]}...")

TypeScript

import { extractFile } from "@kreuzberg/node";

const config = {
  pdfOptions: {
    extractMetadata: true,
    hierarchy: {
      enabled: true,
      kClusters: 6,
      includeBbox: true,
      ocrCoverageThreshold: 0.8,
    },
  },
};

const result = await extractFile("document.pdf", null, config);
if (result.pages) {
  result.pages.forEach((page) => {
    console.log(`Page ${page.pageNumber}:`);
    console.log(`  Content: ${page.content.substring(0, 100)}...`);
  });
}

Rust

use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig, HierarchyConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            hierarchy: Some(HierarchyConfig {
                enabled: true,
                detection_threshold: Some(0.75),
                ocr_coverage_threshold: Some(0.8),
                min_level: Some(1),
                max_level: Some(5),
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None::<&str>, &config)?;
    println!("Hierarchy levels: {}", result.hierarchy.len());
    Ok(())
}

Go

package main

import "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"

func main() {
    enabled := true
    includeBbox := true
    kClusters := uint(6)
    kClustersAdvanced := uint(12)
    threshold := float32(0.8)

    // Basic hierarchy configuration
    config := kreuzberg.ExtractionConfig{
        PdfOptions: &kreuzberg.PdfConfig{
            ExtractImages: true,
            Hierarchy: &kreuzberg.HierarchyConfig{
                Enabled:              &enabled,
                KClusters:            &kClusters,
                IncludeBbox:          &includeBbox,
                OcrCoverageThreshold: &threshold,
            },
        },
    }

    // Advanced hierarchy configuration with more clusters
    advancedConfig := kreuzberg.ExtractionConfig{
        PdfOptions: &kreuzberg.PdfConfig{
            ExtractImages: true,
            Hierarchy: &kreuzberg.HierarchyConfig{
                Enabled:              &enabled,
                KClusters:            &kClustersAdvanced,
                IncludeBbox:          &includeBbox,
                OcrCoverageThreshold: &threshold,
            },
        },
    }

    _ = config
    _ = advancedConfig
}

Java

import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.PdfConfig;
import dev.kreuzberg.HierarchyConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .pdfOptions(PdfConfig.builder()
        .hierarchyConfig(HierarchyConfig.builder()
            .enabled(true)
            .detectionThreshold(0.75)
            .ocrCoverageThreshold(0.8)
            .minLevel(1)
            .maxLevel(5)
            .build())
        .build())
    .build();

C#

using Kreuzberg;

// Basic hierarchy configuration with properties
var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        ExtractImages = true,
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 6,
            IncludeBbox = true,
            OcrCoverageThreshold = 0.8f
        }
    }
};

var result = await KreuzbergLib.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content length: {result.Content.Length}");

// Advanced hierarchy detection with custom parameters
var advancedConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        ExtractImages = true,
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            KClusters = 12,           // More clusters for detailed hierarchy
            IncludeBbox = true,       // Include bounding box coordinates
            OcrCoverageThreshold = 0.7f  // Higher OCR threshold for stricter detection
        }
    }
};

var result = await KreuzbergLib.ExtractFileAsync("complex_document.pdf", advancedConfig);
Console.WriteLine($"Advanced hierarchy detection completed: {result.Content.Length} chars");

// Minimal configuration with only enabled flag
var minimalConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = true,
            // Other properties use defaults:
            // KClusters = 6
            // IncludeBbox = true
        }
    }
};

var result = await KreuzbergLib.ExtractFileAsync("document.pdf", minimalConfig);
Console.WriteLine("Extraction with default hierarchy settings complete");

// Disabling hierarchy detection
var noHierarchyConfig = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Hierarchy = new HierarchyConfig
        {
            Enabled = false
        }
    }
};

var result = await KreuzbergLib.ExtractFileAsync("document.pdf", noHierarchyConfig);
Console.WriteLine("Extraction without hierarchy detection complete");

Ruby

require 'kreuzberg'

# Using keyword arguments with defaults
config = Kreuzberg::ExtractionConfig.new(
  pdf_options: Kreuzberg::PdfConfig.new(
    extract_images: true,
    hierarchy: Kreuzberg::HierarchyConfig.new(
      enabled: true,
      k_clusters: 6,
      include_bbox: true,
      ocr_coverage_threshold: 0.8
    )
  )
)

# Using hash syntax alternative
config = Kreuzberg::ExtractionConfig.new(
  pdf_options: Kreuzberg::PdfConfig.new(
    extract_images: true,
    hierarchy: {
      enabled: true,
      k_clusters: 6,
      include_bbox: true,
      ocr_coverage_threshold: 0.8
    }
  )
)

Output¶

Hierarchy data is in result.pages[n].hierarchy. Each page has a blocks list:

{
  "block_count": 4,
  "blocks": [
    {
      "text": "Chapter 1: Introduction",
      "level": "h1",
      "font_size": 24.0,
      "bbox": [50.0, 100.0, 400.0, 125.0]
    },
    { "text": "Background", "level": "h2", "font_size": 18.0, "bbox": [50.0, 150.0, 300.0, 168.0] },
    {
      "text": "This chapter provides...",
      "level": "body",
      "font_size": 12.0,
      "bbox": [50.0, 200.0, 550.0, 450.0]
    }
  ]
}

bbox: [left, top, right, bottom] in PDF points (present when include_bbox=True). This is the only way to obtain bounding box coordinates for text elements — they are not included by default.
level: "h1" – "h6" or "body"

Configuration¶

Parameter	Type	Default	Description
`enabled`	`bool`	`true`	Enable hierarchy extraction
`k_clusters`	`int`	`6`	Font size clusters (2–10), maps to heading levels
`include_bbox`	`bool`	`true`	Include bounding box coordinates
`ocr_coverage_threshold`	`float \\| None`	`None`	Trigger OCR if text coverage is below this fraction

Choosing k_clusters¶

`k_clusters`	Heading levels	Use when
2–3	H1–H2	Simple documents with 1–2 heading sizes
4–5	H1–H4	Standard documents
6 (default)	H1–H6	Most documents
7–8	H1–H6+	Books, specs with deep nesting

Ocr_coverage_threshold¶

Threshold	Behavior
`None`	OCR never triggered by coverage
`0.3`	OCR if < 30% of page has text
`0.5`	OCR if < 50% of page has text

Requires an OCR backend to be configured separately.

Troubleshooting¶

hierarchy is None — Check hierarchy.enabled is True. If the PDF is image-only, enable OCR. If fewer text blocks than k_clusters, reduce k_clusters.
Most blocks classified as body — Document may use uniform font sizes. Reduce k_clusters (try 3–4).
Heading levels don't match visual inspection — Levels are assigned by font size rank, not absolute size. Filter on block.font_size directly for absolute thresholds.

See the HierarchyConfig reference for the full parameter list.

Edit this page on GitHub