Skip to content

Advanced Features

Kreuzberg provides text processing, analysis, and optimization features beyond basic extraction.

Text Chunking

flowchart TD
    Start[Extracted Text] --> Detect{Detect Format}

    Detect -->|Markdown| MarkdownChunker[Markdown Chunker]
    Detect -->|Plain Text| TextChunker[Text Chunker]

    MarkdownChunker --> MDStrategy[Structure-Aware Splitting]
    MDStrategy --> MDPreserve[Preserve:<br/>- Headings<br/>- Lists<br/>- Code blocks<br/>- Formatting]

    TextChunker --> TextStrategy[Generic Text Splitting]
    TextStrategy --> TextBoundaries[Smart Boundaries:<br/>- Whitespace<br/>- Punctuation<br/>- Sentence breaks]

    MDPreserve --> CreateChunks[Create Chunks]
    TextBoundaries --> CreateChunks

    CreateChunks --> Config[Apply ChunkingConfig]
    Config --> MaxChars[max_chars: Max size]
    Config --> Overlap[max_overlap: Overlap]

    MaxChars --> FinalChunks[Final Chunks]
    Overlap --> FinalChunks

    FinalChunks --> Metadata[Add Metadata:<br/>- char_start/end<br/>- chunk_index<br/>- total_chunks<br/>- token_count]

    Metadata --> Embeddings{Generate<br/>Embeddings?}
    Embeddings -->|Yes| AddEmbeddings[Add Embedding Vectors]
    Embeddings -->|No| Return[Return Chunks]

    AddEmbeddings --> Return

    style MarkdownChunker fill:#FFD700
    style TextChunker fill:#87CEEB
    style CreateChunks fill:#90EE90
    style AddEmbeddings fill:#FFB6C1

Split extracted text into chunks for downstream processing like RAG (Retrieval-Augmented Generation) systems, vector databases, or LLM context windows.

Overview

Kreuzberg uses the text-splitter library with two chunking strategies:

  • Text Chunker: Generic text splitting with smart boundaries (whitespace, punctuation)
  • Markdown Chunker: Structure-aware splitting that preserves headings, lists, code blocks, and formatting

Configuration

using Kreuzberg;

class Program { static async Task Main() { var config = new ExtractionConfig { Chunking = new ChunkingConfig { MaxChars = 1000, MaxOverlap = 200, Embedding = new EmbeddingConfig { Model = EmbeddingModelType.Preset("all-minilm-l6-v2"), Normalize = true, BatchSize = 32 } } };

    try
    {
        var result = await KreuzbergClient.ExtractFileAsync(
            "document.pdf",
            config
        ).ConfigureAwait(false);

        Console.WriteLine($"Chunks: {result.Chunks.Count}");
        foreach (var chunk in result.Chunks)
        {
            Console.WriteLine($"Content length: {chunk.Content.Length}");
            if (chunk.Embedding != null)
            {
                Console.WriteLine($"Embedding dimensions: {chunk.Embedding.Length}");
            }
        }
    }
    catch (KreuzbergException ex)
    {
        Console.WriteLine($"Error: {ex.Message}");
    }
}

}

Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    maxChars := 1000
    maxOverlap := 200
    config := &kreuzberg.ExtractionConfig{
        Chunking: &kreuzberg.ChunkingConfig{
            MaxChars:   &maxChars,
            MaxOverlap: &maxOverlap,
        },
    }

    fmt.Printf("Config: MaxChars=%d, MaxOverlap=%d\n", *config.Chunking.MaxChars, *config.Chunking.MaxOverlap)
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .chunking(ChunkingConfig.builder()
        .maxChars(1000)
        .maxOverlap(200)
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(
            max_chars=1000,
            max_overlap=200,
            separator="sentence"
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Chunks: {len(result.chunks or [])}")
    for chunk in result.chunks or []:
        print(f"Length: {len(chunk.content)}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(
    max_chars: 1000,
    max_overlap: 200
  )
)
Rust
use kreuzberg::{ExtractionConfig, ChunkingConfig};

let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_chars: 1000,
        max_overlap: 200,
        embedding: None,
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    chunking: {
        maxChars: 1000,
        maxOverlap: 200,
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(`Total chunks: ${result.chunks?.length ?? 0}`);
WASM
import { initWasm, extractBytes } from '@kreuzberg/wasm';

await initWasm();

const config = {
  chunking: {
    maxChars: 1000,
    chunkOverlap: 100
  }
};

const bytes = new Uint8Array(buffer);
const result = await extractBytes(bytes, 'application/pdf', config);

result.chunks?.forEach((chunk, idx) => {
  console.log(`Chunk ${idx}: ${chunk.content.substring(0, 50)}...`);
  console.log(`Tokens: ${chunk.metadata?.token_count}`);
});

Chunk Output

Each chunk includes:

  • content: The chunk text
  • metadata:
  • char_start: Start position in original text
  • char_end: End position in original text
  • chunk_index: Zero-based chunk number
  • total_chunks: Total number of chunks
  • token_count: Token count (if embeddings enabled)
  • embedding: Optional embedding vector (if configured)

Example: RAG Pipeline

using Kreuzberg; using System.Collections.Generic; using System.Linq;

class RagPipelineExample { static async Task Main() { var config = new ExtractionConfig { Chunking = new ChunkingConfig { MaxChars = 500, MaxOverlap = 50, Embedding = new EmbeddingConfig { Model = EmbeddingModelType.Preset("all-mpnet-base-v2"), Normalize = true, BatchSize = 16 } } };

    try
    {
        var result = await KreuzbergClient.ExtractFileAsync(
            "research_paper.pdf",
            config
        ).ConfigureAwait(false);

        var vectorStore = await BuildVectorStoreAsync(result.Chunks)
            .ConfigureAwait(false);

        var query = "machine learning optimization";
        var relevantChunks = await SearchAsync(vectorStore, query)
            .ConfigureAwait(false);

        Console.WriteLine($"Found {relevantChunks.Count} relevant chunks");
        foreach (var chunk in relevantChunks.Take(3))
        {
            Console.WriteLine($"Content: {chunk.Content[..80]}...");
            Console.WriteLine($"Similarity: {chunk.Similarity:F3}\n");
        }
    }
    catch (KreuzbergException ex)
    {
        Console.WriteLine($"Error: {ex.Message}");
    }
}

static async Task<List<VectorEntry>> BuildVectorStoreAsync(
    IEnumerable<Chunk> chunks)
{
    return await Task.Run(() =>
    {
        return chunks.Select(c => new VectorEntry
        {
            Content = c.Content,
            Embedding = c.Embedding?.ToArray() ?? Array.Empty<float>(),
            Similarity = 0f
        }).ToList();
    }).ConfigureAwait(false);
}

static async Task<List<VectorEntry>> SearchAsync(
    List<VectorEntry> store,
    string query)
{
    return await Task.Run(() =>
    {
        return store
            .OrderByDescending(e => e.Similarity)
            .ToList();
    }).ConfigureAwait(false);
}

class VectorEntry
{
    public string Content { get; set; } = string.Empty;
    public float[] Embedding { get; set; } = Array.Empty<float>();
    public float Similarity { get; set; }
}

}

Go
package main

import (
    "fmt"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

maxChars := 500
maxOverlap := 50
normalize := true
batchSize := int32(16)

config := &kreuzberg.ExtractionConfig{
    Chunking: &kreuzberg.ChunkingConfig{
        MaxChars:   &maxChars,
        MaxOverlap: &maxOverlap,
        Embedding: &kreuzberg.EmbeddingConfig{
            Model:      kreuzberg.EmbeddingModelType_Preset("all-mpnet-base-v2"),
            Normalize:  &normalize,
            BatchSize:  &batchSize,
        },
    },
}

result, err := kreuzberg.ExtractFileSync("research_paper.pdf", config)
if err != nil {
    log.Fatalf("RAG extraction failed: %v", err)
}

chunks := result.Chunks
fmt.Printf("Found %d chunks for RAG pipeline\n", len(chunks))

for i := 0; i < len(chunks) && i < 3; i++ {
    chunk := chunks[i]
    content := chunk.Content
    if len(content) > 80 {
        content = content[:80]
    }
    fmt.Printf("Chunk %d: %s...\n", i, content)
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;
import dev.kreuzberg.config.EmbeddingConfig;
import dev.kreuzberg.config.EmbeddingModelType;
import java.util.List;

ExtractionConfig config = ExtractionConfig.builder()
    .chunking(ChunkingConfig.builder()
        .maxChars(500)
        .maxOverlap(50)
        .embedding(EmbeddingConfig.builder()
            .model(EmbeddingModelType.preset("all-mpnet-base-v2"))
            .normalize(true)
            .batchSize(16)
            .build())
        .build())
    .build();

try {
    ExtractionResult result = Kreuzberg.extractFile("research_paper.pdf", config);

    List<Object> chunks = result.getChunks() != null ? result.getChunks() : List.of();
    System.out.println("Found " + chunks.size() + " chunks for RAG pipeline");

    for (int i = 0; i < Math.min(3, chunks.size()); i++) {
        Object chunk = chunks.get(i);
        System.out.println("Chunk " + i + ": " + chunk.toString().substring(0, Math.min(80, chunk.toString().length())) + "...");
    }
} catch (Exception ex) {
    System.err.println("RAG extraction failed: " + ex.getMessage());
}
Python
import asyncio
from kreuzberg import (
    extract_file,
    ExtractionConfig,
    ChunkingConfig,
    EmbeddingConfig,
    EmbeddingModelType,
)

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(
            max_chars=500,
            max_overlap=50,
            embedding=EmbeddingConfig(
                model=EmbeddingModelType.preset("balanced"),
                normalize=True,
                batch_size=16
            )
        )
    )
    result = await extract_file("research_paper.pdf", config=config)

    chunks_with_embeddings: list = []
    for chunk in result.chunks or []:
        if chunk.embedding:
            chunks_with_embeddings.append({
                "content": chunk.content[:100],
                "embedding_dims": len(chunk.embedding)
            })

    print(f"Chunks with embeddings: {len(chunks_with_embeddings)}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(
    max_chars: 500,
    max_overlap: 50,
    embedding: Kreuzberg::Config::Embedding.new(
      model: Kreuzberg::EmbeddingModelType.new(
        type: 'preset',
        name: 'all-mpnet-base-v2'
      ),
      normalize: true,
      batch_size: 16
    )
  )
)

result = Kreuzberg.extract_file_sync('research_paper.pdf', config: config)

vector_store = build_vector_store(result.chunks)
query = 'machine learning optimization'
relevant_chunks = search_vector_store(vector_store, query)

puts "Found #{relevant_chunks.length} relevant chunks"
relevant_chunks.take(3).each do |chunk|
  puts "Content: #{chunk[:content][0..80]}..."
  puts "Similarity: #{chunk[:similarity]&.round(3)}\n"
end

def build_vector_store(chunks)
  chunks.map.with_index do |chunk, idx|
    {
      id: idx,
      content: chunk.content,
      embedding: chunk.embedding,
      similarity: 0.0
    }
  end
end

def search_vector_store(store, query)
  store.sort_by { |entry| entry[:similarity] }.reverse
end
Rust
use kreuzberg::{extract_file, ExtractionConfig, ChunkingConfig, EmbeddingConfig};

let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_chars: 500,
        max_overlap: 50,
        embedding: Some(EmbeddingConfig {
            model: "balanced".to_string(),
            normalize: true,
            ..Default::default()
        }),
        ..Default::default()
    }),
    ..Default::default()
};

let result = extract_file("research_paper.pdf", None, &config).await?;

if let Some(chunks) = result.chunks {
    for chunk in chunks {
        println!("Chunk {}/{}",
            chunk.metadata.chunk_index + 1,
            chunk.metadata.total_chunks
        );
        println!("Position: {}-{}",
            chunk.metadata.char_start,
            chunk.metadata.char_end
        );
        println!("Content: {}...", &chunk.content[..100.min(chunk.content.len())]);
        if let Some(embedding) = chunk.embedding {
            println!("Embedding: {} dimensions", embedding.len());
        }
    }
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    chunking: {
        maxChars: 500,
        maxOverlap: 50,
        embedding: {
            preset: 'balanced',
        },
    },
};

const result = await extractFile('research_paper.pdf', null, config);

if (result.chunks) {
    for (const chunk of result.chunks) {
        console.log(`Chunk ${chunk.metadata.chunkIndex + 1}/${chunk.metadata.totalChunks}`);
        console.log(`Position: ${chunk.metadata.charStart}-${chunk.metadata.charEnd}`);
        console.log(`Content: ${chunk.content.slice(0, 100)}...`);
        if (chunk.embedding) {
            console.log(`Embedding: ${chunk.embedding.length} dimensions`);
        }
    }
}
import { initWasm, extractBytes } from '@kreuzberg/wasm';

await initWasm();

const config = {
  chunking: {
    maxChars: 1000,
    chunkOverlap: 100,
    embedding: {
      model: { preset: 'all-MiniLM-L6-v2' }
    }
  }
};

const bytes = new Uint8Array(buffer);
const result = await extractBytes(bytes, 'application/pdf', config);

for (const chunk of result.chunks || []) {
  console.log(`Chunk: ${chunk.content.substring(0, 100)}...`);
  console.log(`Embedding: ${chunk.embedding?.slice(0, 5).join(', ')}...`);
}

Language Detection

flowchart TD
    Start[Extracted Text] --> Config{detect_multiple?}

    Config -->|false| SingleMode[Single Language Mode]
    Config -->|true| MultiMode[Multiple Languages Mode]

    SingleMode --> FullText[Analyze Full Text]
    FullText --> DetectSingle[Detect Dominant Language]
    DetectSingle --> CheckConfSingle{Confidence ≥<br/>min_confidence?}

    CheckConfSingle -->|Yes| ReturnSingle[Return Language]
    CheckConfSingle -->|No| EmptySingle[Return Empty]

    MultiMode --> ChunkText[Split into 200-char Chunks]
    ChunkText --> AnalyzeChunks[Analyze Each Chunk]
    AnalyzeChunks --> DetectPerChunk[Detect Language per Chunk]

    DetectPerChunk --> FilterConfidence[Filter by min_confidence]
    FilterConfidence --> CountFrequency[Count Language Frequency]
    CountFrequency --> SortByFrequency[Sort by Frequency]
    SortByFrequency --> ReturnMultiple[Return Language List]

    ReturnSingle --> Result[detected_languages]
    EmptySingle --> Result
    ReturnMultiple --> Result

    Result --> ISOCodes[ISO 639-3 Codes:<br/>eng, spa, fra, deu, cmn,<br/>jpn, ara, rus, etc.]

    style SingleMode fill:#87CEEB
    style MultiMode fill:#FFD700
    style ReturnSingle fill:#90EE90
    style ReturnMultiple fill:#90EE90
    style ISOCodes fill:#FFB6C1

Detect languages in extracted text using the fast whatlang library. Supports 60+ languages with ISO 639-3 codes.

Configuration

using Kreuzberg;

class Program { static async Task Main() { var config = new ExtractionConfig { LanguageDetection = new LanguageDetectionConfig { Enabled = true, MinConfidence = 0.8m, DetectMultiple = false } };

    try
    {
        var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);

        if (result.DetectedLanguages?.Count > 0)
        {
            Console.WriteLine($"Detected Language: {result.DetectedLanguages[0]}");
        }
        else
        {
            Console.WriteLine("No language detected");
        }

        Console.WriteLine($"Content length: {result.Content.Length} characters");
    }
    catch (KreuzbergException ex)
    {
        Console.WriteLine($"Extraction failed: {ex.Message}");
    }
}

}

Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    minConfidence := 0.8
    config := &kreuzberg.ExtractionConfig{
        LanguageDetection: &kreuzberg.LanguageDetectionConfig{
            Enabled:        true,
            MinConfidence:  &minConfidence,
            DetectMultiple: false,
        },
    }

    fmt.Printf("Language detection enabled: %v\n", config.LanguageDetection.Enabled)
    fmt.Printf("Min confidence: %f\n", *config.LanguageDetection.MinConfidence)
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.LanguageDetectionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .languageDetection(LanguageDetectionConfig.builder()
        .enabled(true)
        .minConfidence(0.8)
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, LanguageDetectionConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        language_detection=LanguageDetectionConfig(
            enabled=True,
            min_confidence=0.85,
            detect_multiple=False
        )
    )
    result = await extract_file("document.pdf", config=config)
    if result.detected_languages:
        print(f"Primary language: {result.detected_languages[0]}")
    print(f"Content length: {len(result.content)} chars")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  language_detection: Kreuzberg::Config::LanguageDetection.new(
    enabled: true,
    min_confidence: 0.8,
    detect_multiple: false
  )
)
Rust
use kreuzberg::{ExtractionConfig, LanguageDetectionConfig};

let config = ExtractionConfig {
    language_detection: Some(LanguageDetectionConfig {
        enabled: true,
        min_confidence: 0.8,
        detect_multiple: false,
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    languageDetection: {
        enabled: true,
        minConfidence: 0.8,
        detectMultiple: false,
    },
};

const result = await extractFile('document.pdf', null, config);
if (result.detectedLanguages) {
    console.log(`Detected languages: ${result.detectedLanguages.join(', ')}`);
}
import { initWasm, extractBytes } from '@kreuzberg/wasm';

await initWasm();

const config = {
  language_detection: {
    detect_multiple: true,
    min_confidence: 0.5
  }
};

const bytes = new Uint8Array(buffer);
const result = await extractBytes(bytes, 'application/pdf', config);

console.log('Detected languages:', result.language);

Detection Modes

Single Language (detect_multiple: false): - Detects dominant language only - Faster, single-pass detection - Best for monolingual documents

Multiple Languages (detect_multiple: true): - Chunks text into 200-character segments - Detects language in each chunk - Returns languages sorted by frequency - Best for multilingual documents

Supported Languages

ISO 639-3 codes including:

  • European: eng (English), spa (Spanish), fra (French), deu (German), ita (Italian), por (Portuguese), rus (Russian), nld (Dutch), pol (Polish), swe (Swedish)
  • Asian: cmn (Chinese), jpn (Japanese), kor (Korean), tha (Thai), vie (Vietnamese), ind (Indonesian)
  • Middle Eastern: ara (Arabic), pes (Persian), urd (Urdu), heb (Hebrew)
  • And 40+ more

Example

using Kreuzberg;

class Program { static async Task Main() { var config = new ExtractionConfig { LanguageDetection = new LanguageDetectionConfig { Enabled = true, MinConfidence = 0.8m, DetectMultiple = true } };

    try
    {
        var result = await KreuzbergClient.ExtractFileAsync("multilingual_document.pdf", config);

        var languages = result.DetectedLanguages ?? new List<string>();

        if (languages.Count > 0)
        {
            Console.WriteLine($"Detected {languages.Count} language(s): {string.Join(", ", languages)}");
        }
        else
        {
            Console.WriteLine("No languages detected");
        }

        Console.WriteLine($"Total content: {result.Content.Length} characters");
        Console.WriteLine($"MIME type: {result.MimeType}");
    }
    catch (KreuzbergException ex)
    {
        Console.WriteLine($"Processing failed: {ex.Message}");
    }
}

}

Go
package main

import (
    "fmt"
    "log"
    "strings"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

enabled := true
detectMultiple := true
minConfidence := 0.8

config := &kreuzberg.ExtractionConfig{
    LanguageDetection: &kreuzberg.LanguageDetectionConfig{
        Enabled:         &enabled,
        MinConfidence:   &minConfidence,
        DetectMultiple:  &detectMultiple,
    },
}

result, err := kreuzberg.ExtractFileSync("multilingual_document.pdf", config)
if err != nil {
    log.Fatalf("Processing failed: %v", err)
}

languages := result.DetectedLanguages
if len(languages) > 0 {
    fmt.Printf("Detected %d language(s): %s\n", len(languages), strings.Join(languages, ", "))
} else {
    fmt.Println("No languages detected")
}

fmt.Printf("Total content: %d characters\n", len(result.Content))
fmt.Printf("MIME type: %s\n", result.MimeType)
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.LanguageDetectionConfig;
import java.math.BigDecimal;
import java.util.List;

ExtractionConfig config = ExtractionConfig.builder()
    .languageDetection(LanguageDetectionConfig.builder()
        .enabled(true)
        .minConfidence(new BigDecimal("0.8"))
        .detectMultiple(true)
        .build())
    .build();

try {
    ExtractionResult result = Kreuzberg.extractFile("multilingual_document.pdf", config);

    List<String> languages = result.getDetectedLanguages() != null
        ? result.getDetectedLanguages()
        : List.of();

    if (!languages.isEmpty()) {
        System.out.println("Detected " + languages.size() + " language(s): " + String.join(", ", languages));
    } else {
        System.out.println("No languages detected");
    }

    System.out.println("Total content: " + result.getContent().length() + " characters");
    System.out.println("MIME type: " + result.getMimeType());
} catch (Exception ex) {
    System.err.println("Processing failed: " + ex.getMessage());
}
Python
import asyncio
from kreuzberg import extract_file, ExtractionConfig, LanguageDetectionConfig

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        language_detection=LanguageDetectionConfig(
            enabled=True,
            min_confidence=0.7,
            detect_multiple=True
        )
    )
    result = await extract_file("multilingual_document.pdf", config=config)
    languages: list[str] = result.detected_languages or []
    print(f"Detected {len(languages)} languages: {languages}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  language_detection: Kreuzberg::Config::LanguageDetection.new(
    enabled: true,
    min_confidence: 0.8,
    detect_multiple: true
  )
)

result = Kreuzberg.extract_file_sync('multilingual_document.pdf', config: config)

languages = result.detected_languages || []

if languages.any?
  puts "Detected #{languages.length} language(s): #{languages.join(', ')}"
else
  puts "No languages detected"
end

puts "Total content: #{result.content.length} characters"
puts "MIME type: #{result.mime_type}"
Rust
use kreuzberg::{extract_file, ExtractionConfig, LanguageDetectionConfig};

let config = ExtractionConfig {
    language_detection: Some(LanguageDetectionConfig {
        enabled: true,
        min_confidence: 0.8,
        detect_multiple: true,
    }),
    ..Default::default()
};

let result = extract_file("multilingual_document.pdf", None, &config).await?;

println!("Detected languages: {:?}", result.detected_languages);
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    languageDetection: {
        enabled: true,
        minConfidence: 0.8,
        detectMultiple: true,
    },
};

const result = await extractFile('multilingual_document.pdf', null, config);
if (result.detectedLanguages) {
    console.log(`Detected languages: ${result.detectedLanguages.join(', ')}`);
}

Embedding Generation

Generate embeddings for vector databases, semantic search, and RAG systems using ONNX models via fastembed-rs.

Available Presets

Preset Model Dimensions Max Tokens Use Case
fast AllMiniLML6V2Q 384 512 Rapid prototyping, development
balanced BGEBaseENV15 768 1024 Production RAG, general-purpose
quality BGELargeENV15 1024 2000 Maximum accuracy, complex docs
multilingual MultilingualE5Base 768 1024 100+ languages, international

Max Tokens vs. max_chars

The "Max Tokens" values shown are the model's maximum token limits. These don't directly correspond to the max_chars setting in ChunkingConfig, which controls character-based chunking. The embedding model will process chunks up to its token limit.

Configuration

C#
using Kreuzberg;
using System;
using System.Collections.Generic;
using System.Threading.Tasks;

var config = new ExtractionConfig
{
    Chunking = new ChunkingConfig
    {
        MaxChars = 512,
        MaxOverlap = 50,
        Embedding = new EmbeddingConfig
        {
            Model = EmbeddingModelType.Preset("balanced"),
            Normalize = true,
            BatchSize = 32,
            ShowDownloadProgress = false
        }
    }
};

var result = await Kreuzberg.ExtractFileAsync("document.pdf", config);

var chunks = result.Chunks ?? new List<Chunk>();
foreach (var (index, chunk) in chunks.WithIndex())
{
    var chunkId = $"doc_chunk_{index}";
    Console.WriteLine($"Chunk {chunkId}: {chunk.Content[..Math.Min(50, chunk.Content.Length)]}");

    if (chunk.Embedding != null)
    {
        Console.WriteLine($"  Embedding dimensions: {chunk.Embedding.Length}");
    }
}

internal static class EnumerableExtensions
{
    public static IEnumerable<(int Index, T Item)> WithIndex<T>(
        this IEnumerable<T> items)
    {
        var index = 0;
        foreach (var item in items)
        {
            yield return (index++, item);
        }
    }
}
Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

maxChars := 512
maxOverlap := 50
normalize := true
batchSize := int32(32)
showProgress := false

config := &kreuzberg.ExtractionConfig{
    Chunking: &kreuzberg.ChunkingConfig{
        MaxChars:   &maxChars,
        MaxOverlap: &maxOverlap,
        Embedding: &kreuzberg.EmbeddingConfig{
            Model:                 kreuzberg.EmbeddingModelType_Preset("balanced"),
            Normalize:             &normalize,
            BatchSize:             &batchSize,
            ShowDownloadProgress:  &showProgress,
        },
    },
}

result, err := kreuzberg.ExtractFileSync("document.pdf", config)
if err != nil {
    fmt.Printf("Error: %v\n", err)
    return
}

for index, chunk := range result.Chunks {
    chunkID := fmt.Sprintf("doc_chunk_%d", index)
    content := chunk.Content
    if len(content) > 50 {
        content = content[:50]
    }
    fmt.Printf("Chunk %s: %s\n", chunkID, content)

    if chunk.Embedding != nil && len(chunk.Embedding) > 0 {
        fmt.Printf("  Embedding dimensions: %d\n", len(chunk.Embedding))
    }
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;
import dev.kreuzberg.config.EmbeddingConfig;
import dev.kreuzberg.config.EmbeddingModelType;
import java.util.List;

ExtractionConfig config = ExtractionConfig.builder()
    .chunking(ChunkingConfig.builder()
        .maxChars(512)
        .maxOverlap(50)
        .embedding(EmbeddingConfig.builder()
            .model(EmbeddingModelType.preset("balanced"))
            .normalize(true)
            .batchSize(32)
            .showDownloadProgress(false)
            .build())
        .build())
    .build();

ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);

List<Object> chunks = result.getChunks() != null ? result.getChunks() : List.of();
for (int index = 0; index < chunks.size(); index++) {
    Object chunk = chunks.get(index);
    String chunkId = "doc_chunk_" + index;
    System.out.println("Chunk " + chunkId + ": " + chunk.toString().substring(0, Math.min(50, chunk.toString().length())));

    if (chunk instanceof java.util.Map) {
        Object embedding = ((java.util.Map<String, Object>) chunk).get("embedding");
        if (embedding != null) {
            System.out.println("  Embedding dimensions: " + ((float[]) embedding).length);
        }
    }
}
Python
from kreuzberg import (
    ExtractionConfig,
    ChunkingConfig,
    EmbeddingConfig,
    EmbeddingModelType,
)

config: ExtractionConfig = ExtractionConfig(
    chunking=ChunkingConfig(
        max_chars=1024,
        max_overlap=100,
        embedding=EmbeddingConfig(
            model=EmbeddingModelType.preset("balanced"),
            normalize=True,
            batch_size=32,
            show_download_progress=False,
        ),
    )
)
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(
    max_chars: 512,
    max_overlap: 50,
    embedding: Kreuzberg::Config::Embedding.new(
      model: Kreuzberg::EmbeddingModelType.new(
        type: 'preset',
        name: 'balanced'
      ),
      normalize: true,
      batch_size: 32,
      show_download_progress: false
    )
  )
)

result = Kreuzberg.extract_file_sync('document.pdf', config: config)

chunks = result.chunks || []
chunks.each_with_index do |chunk, idx|
  chunk_id = "doc_chunk_#{idx}"
  puts "Chunk #{chunk_id}: #{chunk.content[0...50]}"

  if chunk.embedding
    puts "  Embedding dimensions: #{chunk.embedding.length}"
  end
end
Rust
use kreuzberg::{ExtractionConfig, ChunkingConfig, EmbeddingConfig};

let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_chars: 1024,
        max_overlap: 100,
        embedding: Some(EmbeddingConfig {
            model: "balanced".to_string(),
            normalize: true,
            batch_size: 32,
            show_download_progress: false,
            ..Default::default()
        }),
        ..Default::default()
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    chunking: {
        maxChars: 1024,
        maxOverlap: 100,
        embedding: {
            preset: 'balanced',
        },
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(`Chunks: ${result.chunks?.length ?? 0}`);

Example: Vector Database Integration

C#
using Kreuzberg;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

public class VectorDatabaseIntegration
{
    public class VectorRecord
    {
        public string Id { get; set; }
        public float[] Embedding { get; set; }
        public string Content { get; set; }
        public Dictionary<string, string> Metadata { get; set; }
    }

    public async Task<List<VectorRecord>> ExtractAndVectorize(
        string documentPath,
        string documentId)
    {
        var config = new ExtractionConfig
        {
            Chunking = new ChunkingConfig
            {
                MaxChars = 512,
                MaxOverlap = 50,
                Embedding = new EmbeddingConfig
                {
                    Model = EmbeddingModelType.Preset("balanced"),
                    Normalize = true,
                    BatchSize = 32
                }
            }
        };

        var result = await Kreuzberg.ExtractFileAsync(documentPath, config);
        var chunks = result.Chunks ?? new List<Chunk>();

        var vectorRecords = chunks
            .Select((chunk, index) => new VectorRecord
            {
                Id = $"{documentId}_chunk_{index}",
                Content = chunk.Content,
                Embedding = chunk.Embedding,
                Metadata = new Dictionary<string, string>
                {
                    { "document_id", documentId },
                    { "chunk_index", index.ToString() },
                    { "content_length", chunk.Content.Length.ToString() }
                }
            })
            .ToList();

        await StoreInVectorDatabase(vectorRecords);
        return vectorRecords;
    }

    private async Task StoreInVectorDatabase(List<VectorRecord> records)
    {
        foreach (var record in records)
        {
            if (record.Embedding != null && record.Embedding.Length > 0)
            {
                Console.WriteLine(
                    $"Storing {record.Id}: {record.Content.Length} chars, " +
                    $"{record.Embedding.Length} dims");
            }
        }

        await Task.CompletedTask;
    }
}
Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

type VectorRecord struct {
    ID        string
    Embedding []float32
    Content   string
    Metadata  map[string]string
}

func extractAndVectorize(documentPath string, documentID string) ([]VectorRecord, error) {
    maxChars := 512
    maxOverlap := 50
    normalize := true
    batchSize := int32(32)

    config := &kreuzberg.ExtractionConfig{
        Chunking: &kreuzberg.ChunkingConfig{
            MaxChars:   &maxChars,
            MaxOverlap: &maxOverlap,
            Embedding: &kreuzberg.EmbeddingConfig{
                Model:     kreuzberg.EmbeddingModelType_Preset("balanced"),
                Normalize: &normalize,
                BatchSize: &batchSize,
            },
        },
    }

    result, err := kreuzberg.ExtractFileSync(documentPath, config)
    if err != nil {
        return nil, err
    }

    var vectorRecords []VectorRecord
    for index, chunk := range result.Chunks {
        record := VectorRecord{
            ID:        fmt.Sprintf("%s_chunk_%d", documentID, index),
            Content:   chunk.Content,
            Embedding: chunk.Embedding,
            Metadata: map[string]string{
                "document_id":  documentID,
                "chunk_index":  fmt.Sprintf("%d", index),
                "content_length": fmt.Sprintf("%d", len(chunk.Content)),
            },
        }
        vectorRecords = append(vectorRecords, record)
    }

    storeInVectorDatabase(vectorRecords)
    return vectorRecords, nil
}

func storeInVectorDatabase(records []VectorRecord) {
    for _, record := range records {
        if len(record.Embedding) > 0 {
            fmt.Printf("Storing %s: %d chars, %d dims\n",
                record.ID, len(record.Content), len(record.Embedding))
        }
    }
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;
import dev.kreuzberg.config.EmbeddingConfig;
import dev.kreuzberg.config.EmbeddingModelType;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class VectorDatabaseIntegration {
    public static class VectorRecord {
        public String id;
        public float[] embedding;
        public String content;
        public Map<String, String> metadata;
    }

    public static List<VectorRecord> extractAndVectorize(String documentPath, String documentId) throws Exception {
        ExtractionConfig config = ExtractionConfig.builder()
            .chunking(ChunkingConfig.builder()
                .maxChars(512)
                .maxOverlap(50)
                .embedding(EmbeddingConfig.builder()
                    .model(EmbeddingModelType.preset("balanced"))
                    .normalize(true)
                    .batchSize(32)
                    .build())
                .build())
            .build();

        ExtractionResult result = Kreuzberg.extractFile(documentPath, config);
        List<Object> chunks = result.getChunks() != null ? result.getChunks() : List.of();

        List<VectorRecord> vectorRecords = new java.util.ArrayList<>();
        for (int index = 0; index < chunks.size(); index++) {
            VectorRecord record = new VectorRecord();
            record.id = documentId + "_chunk_" + index;
            record.metadata = new HashMap<>();
            record.metadata.put("document_id", documentId);
            record.metadata.put("chunk_index", String.valueOf(index));

            if (chunk instanceof java.util.Map) {
                Map<String, Object> chunkMap = (Map<String, Object>) chunks.get(index);
                record.content = (String) chunkMap.get("content");
                record.embedding = (float[]) chunkMap.get("embedding");
                record.metadata.put("content_length", String.valueOf(record.content.length()));
            }

            vectorRecords.add(record);
        }

        storeInVectorDatabase(vectorRecords);
        return vectorRecords;
    }

    private static void storeInVectorDatabase(List<VectorRecord> records) {
        for (VectorRecord record : records) {
            if (record.embedding != null && record.embedding.length > 0) {
                System.out.println("Storing " + record.id + ": " + record.content.length()
                    + " chars, " + record.embedding.length + " dims");
            }
        }
    }
}
Python
import asyncio
from kreuzberg import (
    extract_file,
    ExtractionConfig,
    ChunkingConfig,
    EmbeddingConfig,
    EmbeddingModelType,
)

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(
            max_chars=512,
            max_overlap=50,
            embedding=EmbeddingConfig(
                model=EmbeddingModelType.preset("balanced"), normalize=True
            ),
        )
    )
    result = await extract_file("document.pdf", config=config)
    chunks = result.chunks or []
    for i, chunk in enumerate(chunks):
        chunk_id: str = f"doc_chunk_{i}"
        print(f"Chunk {chunk_id}: {chunk.content[:50]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

class VectorDatabaseIntegration
  VectorRecord = Struct.new(:id, :embedding, :content, :metadata, keyword_init: true)

  def extract_and_vectorize(document_path, document_id)
    config = Kreuzberg::Config::Extraction.new(
      chunking: Kreuzberg::Config::Chunking.new(
        max_chars: 512,
        max_overlap: 50,
        embedding: Kreuzberg::Config::Embedding.new(
          model: Kreuzberg::EmbeddingModelType.new(
            type: 'preset',
            name: 'balanced'
          ),
          normalize: true,
          batch_size: 32
        )
      )
    )

    result = Kreuzberg.extract_file_sync(document_path, config: config)
    chunks = result.chunks || []

    vector_records = chunks.map.with_index do |chunk, idx|
      VectorRecord.new(
        id: "#{document_id}_chunk_#{idx}",
        content: chunk.content,
        embedding: chunk.embedding,
        metadata: {
          document_id: document_id,
          chunk_index: idx,
          content_length: chunk.content.length
        }
      )
    end

    store_in_vector_database(vector_records)
    vector_records
  end

  private

  def store_in_vector_database(records)
    records.each do |record|
      if record.embedding&.any?
        puts "Storing #{record.id}: #{record.content.length} chars, #{record.embedding.length} dims"
      end
    end
  end
end
Rust
use kreuzberg::{extract_file, ExtractionConfig, ChunkingConfig, EmbeddingConfig};

struct VectorRecord {
    id: String,
    content: String,
    embedding: Vec<f32>,
    metadata: std::collections::HashMap<String, String>,
}

async fn extract_and_vectorize(
    document_path: &str,
    document_id: &str,
) -> Result<Vec<VectorRecord>, Box<dyn std::error::Error>> {
    let config = ExtractionConfig {
        chunking: Some(ChunkingConfig {
            max_chars: 512,
            max_overlap: 50,
            embedding: Some(EmbeddingConfig {
                model: kreuzberg::EmbeddingModelType::Preset {
                    name: "balanced".to_string(),
                },
                normalize: true,
                batch_size: 32,
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file(document_path, None, &config).await?;

    let mut records = Vec::new();
    if let Some(chunks) = result.chunks {
        for (index, chunk) in chunks.iter().enumerate() {
            if let Some(embedding) = &chunk.embedding {
                let mut metadata = std::collections::HashMap::new();
                metadata.insert("document_id".to_string(), document_id.to_string());
                metadata.insert("chunk_index".to_string(), index.to_string());
                metadata.insert("content_length".to_string(), chunk.content.len().to_string());

                records.push(VectorRecord {
                    id: format!("{}_chunk_{}", document_id, index),
                    content: chunk.content.clone(),
                    embedding: embedding.clone(),
                    metadata,
                });
            }
        }
    }

    Ok(records)
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    chunking: {
        maxChars: 512,
        maxOverlap: 50,
        embedding: {
            preset: 'balanced',
        },
    },
};

const result = await extractFile('document.pdf', null, config);

if (result.chunks) {
    for (const chunk of result.chunks) {
        console.log(`Chunk: ${chunk.content.slice(0, 100)}...`);
        if (chunk.embedding) {
            console.log(`Embedding dims: ${chunk.embedding.length}`);
        }
    }
}

Token Reduction

Intelligently reduce token count while preserving meaning. Removes stopwords, redundancy, and applies compression.

Reduction Levels

Level Reduction Features
off 0% No reduction, pass-through
moderate 15-25% Stopwords + redundancy removal
aggressive 30-50% Semantic clustering, importance scoring

Configuration

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    TokenReduction = new TokenReductionConfig
    {
        Mode = "moderate",              // "off", "moderate", or "aggressive"
        PreserveMarkdown = true,
        PreserveCode = true,
        LanguageHint = "eng"
    }
};
Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    config := &kreuzberg.ExtractionConfig{
        TokenReduction: &kreuzberg.TokenReductionConfig{
            Mode:                   "moderate",
            PreserveImportantWords: kreuzberg.BoolPtr(true),
        },
    }

    fmt.Printf("Mode: %s, Preserve Important Words: %v\n",
        config.TokenReduction.Mode,
        *config.TokenReduction.PreserveImportantWords)
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.TokenReductionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .tokenReduction(TokenReductionConfig.builder()
        .mode("moderate")
        .preserveImportantWords(true)
        .build())
    .build();
Python
from kreuzberg import ExtractionConfig, TokenReductionConfig

config: ExtractionConfig = ExtractionConfig(
    token_reduction=TokenReductionConfig(
        mode="moderate",
        preserve_markdown=True,
        preserve_code=True,
        language_hint="eng"
    )
)
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  token_reduction: Kreuzberg::Config::TokenReduction.new(
    mode: 'moderate',
    preserve_markdown: true,
    preserve_code: true,
    language_hint: 'eng'
  )
)
Rust
use kreuzberg::{ExtractionConfig, TokenReductionConfig};

let config = ExtractionConfig {
    token_reduction: Some(TokenReductionConfig {
        mode: "moderate".to_string(),
        preserve_markdown: true,
        preserve_code: true,
        language_hint: Some("eng".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    tokenReduction: {
        mode: 'moderate',
        preserveImportantWords: true,
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    TokenReduction = new TokenReductionConfig
    {
        Mode = "moderate",
        PreserveMarkdown = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync(
    "verbose_document.pdf",
    config
);

var original = result.Metadata.ContainsKey("original_token_count")
    ? (int)result.Metadata["original_token_count"]
    : 0;

var reduced = result.Metadata.ContainsKey("token_count")
    ? (int)result.Metadata["token_count"]
    : 0;

var ratio = result.Metadata.ContainsKey("token_reduction_ratio")
    ? (double)result.Metadata["token_reduction_ratio"]
    : 0.0;

Console.WriteLine($"Reduced from {original} to {reduced} tokens");
Console.WriteLine($"Reduction: {ratio * 100:F1}%");
Go
package main

import (
    "fmt"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

preserveMarkdown := true
mode := "moderate"

config := &kreuzberg.ExtractionConfig{
    TokenReduction: &kreuzberg.TokenReductionConfig{
        Mode:             &mode,
        PreserveMarkdown: &preserveMarkdown,
    },
}

result, err := kreuzberg.ExtractFileSync("verbose_document.pdf", config)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

original := 0
reduced := 0
ratio := 0.0

if val, ok := result.Metadata["original_token_count"]; ok {
    original = val.(int)
}

if val, ok := result.Metadata["token_count"]; ok {
    reduced = val.(int)
}

if val, ok := result.Metadata["token_reduction_ratio"]; ok {
    ratio = val.(float64)
}

fmt.Printf("Reduced from %d to %d tokens\n", original, reduced)
fmt.Printf("Reduction: %.1f%%\n", ratio*100)
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.TokenReductionConfig;
import java.util.Map;

ExtractionConfig config = ExtractionConfig.builder()
    .tokenReduction(TokenReductionConfig.builder()
        .mode("moderate")
        .preserveMarkdown(true)
        .build())
    .build();

ExtractionResult result = Kreuzberg.extractFile("verbose_document.pdf", config);

Map<String, Object> metadata = result.getMetadata() != null ? result.getMetadata() : Map.of();

int original = metadata.containsKey("original_token_count")
    ? ((Number) metadata.get("original_token_count")).intValue()
    : 0;

int reduced = metadata.containsKey("token_count")
    ? ((Number) metadata.get("token_count")).intValue()
    : 0;

double ratio = metadata.containsKey("token_reduction_ratio")
    ? ((Number) metadata.get("token_reduction_ratio")).doubleValue()
    : 0.0;

System.out.println("Reduced from " + original + " to " + reduced + " tokens");
System.out.println(String.format("Reduction: %.1f%%", ratio * 100));
Python
import asyncio
from kreuzberg import extract_file, ExtractionConfig, TokenReductionConfig

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        token_reduction=TokenReductionConfig(
            mode="moderate", preserve_markdown=True
        )
    )
    result = await extract_file("verbose_document.pdf", config=config)
    original: int = result.metadata.get("original_token_count", 0)
    reduced: int = result.metadata.get("token_count", 0)
    ratio: float = result.metadata.get("token_reduction_ratio", 0.0)
    print(f"Reduced from {original} to {reduced} tokens")
    print(f"Reduction: {ratio * 100:.1f}%")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  token_reduction: Kreuzberg::Config::TokenReduction.new(
    mode: 'moderate',
    preserve_markdown: true
  )
)

result = Kreuzberg.extract_file_sync('verbose_document.pdf', config: config)

original_tokens = result.metadata&.dig('original_token_count') || 0
reduced_tokens = result.metadata&.dig('token_count') || 0
reduction_ratio = result.metadata&.dig('token_reduction_ratio') || 0.0

puts "Reduced from #{original_tokens} to #{reduced_tokens} tokens"
puts "Reduction: #{(reduction_ratio * 100).round(1)}%"
Rust
use kreuzberg::{extract_file, ExtractionConfig, TokenReductionConfig};

let config = ExtractionConfig {
    token_reduction: Some(TokenReductionConfig {
        mode: "moderate".to_string(),
        preserve_markdown: true,
        ..Default::default()
    }),
    ..Default::default()
};

let result = extract_file("verbose_document.pdf", None, &config).await?;

if let Some(original) = result.metadata.additional.get("original_token_count") {
    println!("Original tokens: {}", original);
}
if let Some(reduced) = result.metadata.additional.get("token_count") {
    println!("Reduced tokens: {}", reduced);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    tokenReduction: {
        mode: 'moderate',
        preserveImportantWords: true,
    },
};

const result = await extractFile('verbose_document.pdf', null, config);
console.log(`Content length: ${result.content.length}`);
console.log(`Metadata: ${JSON.stringify(result.metadata)}`);

Keyword Extraction

Extract important keywords and phrases using YAKE or RAKE algorithms.

Feature Flag Required

Keyword extraction requires the keywords feature flag enabled when building Kreuzberg.

Available Algorithms

YAKE (Yet Another Keyword Extractor): - Statistical/unsupervised approach - Factors: term frequency, position, capitalization, context - Best for: General-purpose extraction

RAKE (Rapid Automatic Keyword Extraction): - Co-occurrence based - Analyzes word frequency and degree in phrases - Best for: Domain-specific terms, phrase extraction

Configuration

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Keywords = new KeywordConfig
    {
        Algorithm = KeywordAlgorithm.Yake,
        MaxKeywords = 10,
        MinScore = 0.3,
        NgramRange = (1, 3),
        Language = "en"
    }
};
Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    config := &kreuzberg.ExtractionConfig{
        Keywords: &kreuzberg.KeywordConfig{
            Algorithm:  "YAKE",
            MaxKeywords: 10,
            MinScore:   0.3,
            NgramRange: "1,3",
            Language:   "en",
        },
    }

    fmt.Printf("Keywords config: Algorithm=%s, MaxKeywords=%d, MinScore=%f\n",
        config.Keywords.Algorithm,
        config.Keywords.MaxKeywords,
        config.Keywords.MinScore)
}
Java
// Note: Keyword extraction is not yet available in Java bindings
// This feature requires the 'keywords' feature flag and is planned for a future release
Python
import asyncio
from kreuzberg import (
    ExtractionConfig,
    KeywordConfig,
    KeywordAlgorithm,
    extract_file,
)

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        keywords=KeywordConfig(
            algorithm=KeywordAlgorithm.YAKE,
            max_keywords=10,
            min_score=0.3,
            ngram_range=(1, 3),
            language="en"
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Content extracted: {len(result.content)} chars")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  keywords: Kreuzberg::Config::Keywords.new(
    algorithm: Kreuzberg::KeywordAlgorithm::YAKE,
    max_keywords: 10,
    min_score: 0.3,
    ngram_range: [1, 3],
    language: 'en'
  )
)
Rust
use kreuzberg::{ExtractionConfig, KeywordConfig, KeywordAlgorithm};

let config = ExtractionConfig {
    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Yake,
        max_keywords: 10,
        min_score: 0.3,
        ngram_range: (1, 3),
        language: Some("en".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    keywords: {
        algorithm: 'yake',
        maxKeywords: 10,
        minScore: 0.3,
        ngramRange: [1, 3],
        language: 'en',
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(`Content: ${result.content}`);

Example

C#
using Kreuzberg;
using System.Collections.Generic;

var config = new ExtractionConfig
{
    Keywords = new KeywordConfig
    {
        Algorithm = KeywordAlgorithm.Yake,
        MaxKeywords = 10,
        MinScore = 0.3
    }
};

var result = await KreuzbergClient.ExtractFileAsync(
    "research_paper.pdf",
    config
);

if (result.Metadata.ContainsKey("keywords"))
{
    var keywords = (List<Dictionary<string, object>>)result.Metadata["keywords"];
    foreach (var kw in keywords)
    {
        var text = (string)kw["text"];
        var score = (double)kw["score"];
        Console.WriteLine($"{text}: {score:F3}");
    }
}
Go
package main

import (
    "fmt"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

maxKeywords := int32(10)
minScore := 0.3

config := &kreuzberg.ExtractionConfig{
    Keywords: &kreuzberg.KeywordConfig{
        Algorithm:   kreuzberg.KeywordAlgorithm_YAKE,
        MaxKeywords: &maxKeywords,
        MinScore:    &minScore,
    },
}

result, err := kreuzberg.ExtractFileSync("research_paper.pdf", config)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

if keywords, ok := result.Metadata["keywords"]; ok {
    keywordList := keywords.([]map[string]interface{})
    for _, kw := range keywordList {
        text := kw["text"].(string)
        score := kw["score"].(float64)
        fmt.Printf("%s: %.3f\n", text, score)
    }
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.KeywordConfig;
import dev.kreuzberg.config.KeywordAlgorithm;
import java.util.List;
import java.util.Map;

ExtractionConfig config = ExtractionConfig.builder()
    .keywords(KeywordConfig.builder()
        .algorithm(KeywordAlgorithm.YAKE)
        .maxKeywords(10)
        .minScore(0.3)
        .build())
    .build();

ExtractionResult result = Kreuzberg.extractFile("research_paper.pdf", config);

Map<String, Object> metadata = result.getMetadata() != null ? result.getMetadata() : Map.of();

if (metadata.containsKey("keywords")) {
    List<Map<String, Object>> keywords = (List<Map<String, Object>>) metadata.get("keywords");
    for (Map<String, Object> kw : keywords) {
        String text = (String) kw.get("text");
        Double score = ((Number) kw.get("score")).doubleValue();
        System.out.println(text + ": " + String.format("%.3f", score));
    }
}
Python
import asyncio
from kreuzberg import extract_file, ExtractionConfig, KeywordConfig, KeywordAlgorithm

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        keywords=KeywordConfig(
            algorithm=KeywordAlgorithm.YAKE,
            max_keywords=10,
            min_score=0.3
        )
    )
    result = await extract_file("research_paper.pdf", config=config)

    keywords: list = result.metadata.get("keywords", [])
    for kw in keywords:
        score: float = kw.get("score", 0.0)
        text: str = kw.get("text", "")
        print(f"{text}: {score:.3f}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  keywords: Kreuzberg::Config::Keywords.new(
    algorithm: Kreuzberg::KeywordAlgorithm::YAKE,
    max_keywords: 10,
    min_score: 0.3
  )
)

result = Kreuzberg.extract_file_sync('research_paper.pdf', config: config)

keywords = result.metadata&.dig('keywords') || []
keywords.each do |kw|
  text = kw['text']
  score = kw['score']
  puts "#{text}: #{score.round(3)}"
end
Rust
use kreuzberg::{extract_file, ExtractionConfig, KeywordConfig, KeywordAlgorithm};

let config = ExtractionConfig {
    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Yake,
        max_keywords: 10,
        min_score: 0.3,
        ..Default::default()
    }),
    ..Default::default()
};

let result = extract_file("research_paper.pdf", None, &config).await?;

if let Some(keywords) = result.metadata.additional.get("keywords") {
    println!("Keywords: {:?}", keywords);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    keywords: {
        algorithm: 'yake',
        maxKeywords: 10,
        minScore: 0.3,
    },
};

const result = await extractFile('research_paper.pdf', null, config);
console.log(`Content length: ${result.content.length}`);
console.log(`Metadata: ${JSON.stringify(result.metadata)}`);

Quality Processing

Automatic text quality scoring that detects OCR artifacts, script content, navigation elements, and evaluates document structure.

Quality Factors

Factor Weight Detects
OCR Artifacts 30% Scattered chars, repeated punctuation, malformed words
Script Content 20% JavaScript, CSS, HTML tags
Navigation Elements 10% Breadcrumbs, pagination, skip links
Document Structure 20% Sentence/paragraph length, punctuation
Metadata Quality 10% Title, author, subject presence

Configuration

Quality processing is enabled by default:

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    EnableQualityProcessing = true
};

var result = await KreuzbergClient.ExtractFileAsync(
    "document.pdf",
    config
);

var qualityScore = result.Metadata.ContainsKey("quality_score")
    ? (double)result.Metadata["quality_score"]
    : 0.0;

Console.WriteLine($"Quality score: {qualityScore:F2}");
Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    config := &kreuzberg.ExtractionConfig{
        EnableQualityProcessing: true,  // Default
    }

    fmt.Printf("Quality processing enabled: %v\n", config.EnableQualityProcessing)
}
Java
import dev.kreuzberg.config.ExtractionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .enableQualityProcessing(true)  // Default
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        enable_quality_processing=True
    )
    result = await extract_file("document.pdf", config=config)

    quality_score: float = result.metadata.get("quality_score", 0.0)
    print(f"Quality score: {quality_score:.2f}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  enable_quality_processing: true
)
Rust
use kreuzberg::ExtractionConfig;

let config = ExtractionConfig {
    enable_quality_processing: true,
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    enableQualityProcessing: true,
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

Quality Score

The quality score ranges from 0.0 (lowest quality) to 1.0 (highest quality):

  • 0.0-0.3: Very low quality (heavy OCR artifacts, script content)
  • 0.3-0.6: Low quality (some artifacts, poor structure)
  • 0.6-0.8: Moderate quality (clean text, decent structure)
  • 0.8-1.0: High quality (excellent structure, no artifacts)

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    EnableQualityProcessing = true
};

var result = KreuzbergClient.ExtractFile(
    "scanned_document.pdf",
    config
);

var qualityScore = result.Metadata.ContainsKey("quality_score")
    ? (double)result.Metadata["quality_score"]
    : 0.0;

if (qualityScore < 0.5)
{
    Console.WriteLine(
        $"Warning: Low quality extraction ({qualityScore:F2})"
    );
    Console.WriteLine(
        "Consider re-scanning with higher DPI or adjusting OCR settings"
    );
}
else
{
    Console.WriteLine($"Quality score: {qualityScore:F2}");
}
Go
package main

import (
    "fmt"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

enableQualityProcessing := true

config := &kreuzberg.ExtractionConfig{
    EnableQualityProcessing: &enableQualityProcessing,
}

result, err := kreuzberg.ExtractFileSync("scanned_document.pdf", config)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

qualityScore := 0.0
if val, ok := result.Metadata["quality_score"]; ok {
    qualityScore = val.(float64)
}

if qualityScore < 0.5 {
    fmt.Printf("Warning: Low quality extraction (%.2f)\n", qualityScore)
    fmt.Println("Consider re-scanning with higher DPI or adjusting OCR settings")
} else {
    fmt.Printf("Quality score: %.2f\n", qualityScore)
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import java.util.Map;

ExtractionConfig config = ExtractionConfig.builder()
    .enableQualityProcessing(true)
    .build();

ExtractionResult result = Kreuzberg.extractFile("scanned_document.pdf", config);

Map<String, Object> metadata = result.getMetadata() != null ? result.getMetadata() : Map.of();

double qualityScore = metadata.containsKey("quality_score")
    ? ((Number) metadata.get("quality_score")).doubleValue()
    : 0.0;

if (qualityScore < 0.5) {
    System.out.println(String.format("Warning: Low quality extraction (%.2f)", qualityScore));
    System.out.println("Consider re-scanning with higher DPI or adjusting OCR settings");
} else {
    System.out.println(String.format("Quality score: %.2f", qualityScore));
}
Python
from kreuzberg import extract_file, ExtractionConfig

config = ExtractionConfig(enable_quality_processing=True)
result = extract_file("scanned_document.pdf", config=config)

quality_score = result.metadata.get("quality_score", 0.0)

if quality_score < 0.5:
    print(f"Warning: Low quality extraction ({quality_score:.2f})")
    print("Consider re-scanning with higher DPI or adjusting OCR settings")
else:
    print(f"Quality score: {quality_score:.2f}")
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  enable_quality_processing: true
)

result = Kreuzberg.extract_file_sync('scanned_document.pdf', config: config)

quality_score = result.metadata&.dig('quality_score') || 0.0

if quality_score < 0.5
  puts "Warning: Low quality extraction (#{quality_score.round(2)})"
  puts "Consider re-scanning with higher DPI or adjusting OCR settings"
else
  puts "Quality score: #{quality_score.round(2)}"
end
Rust
use kreuzberg::{extract_file, ExtractionConfig};

let config = ExtractionConfig {
    enable_quality_processing: true,
    ..Default::default()
};
let result = extract_file("scanned_document.pdf", None, &config).await?;

if let Some(quality) = result.metadata.additional.get("quality_score") {
    let score: f64 = quality.as_f64().unwrap_or(0.0);
    if score < 0.5 {
        println!("Warning: Low quality extraction ({:.2})", score);
    } else {
        println!("Quality score: {:.2}", score);
    }
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    enableQualityProcessing: true,
};

const result = await extractFile('scanned_document.pdf', null, config);
console.log(`Content length: ${result.content.length} characters`);
console.log(`Metadata: ${JSON.stringify(result.metadata)}`);

Combining Features

Advanced features work together:

C#
using System;
using System.Threading.Tasks;
using Kreuzberg;

async Task RunRagPipeline()
{
    var config = new ExtractionConfig
    {
        EnableQualityProcessing = true,

        LanguageDetection = new LanguageDetectionConfig
        {
            Enabled = true,
            DetectMultiple = true,
            MinConfidence = 0.8,
        },

        TokenReduction = new TokenReductionConfig
        {
            Mode = "moderate",
            PreserveImportantWords = true,
        },

        Chunking = new ChunkingConfig
        {
            MaxChars = 512,
            MaxOverlap = 50,
            Embedding = new Dictionary<string, object?>
            {
                { "preset", "balanced" },
            },
            Enabled = true,
        },

        Keywords = new KeywordConfig
        {
            Algorithm = "yake",
            MaxKeywords = 10,
        },
    };

    var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);

    Console.WriteLine($"Content length: {result.Content.Length} characters");

    if (result.DetectedLanguages?.Count > 0)
    {
        Console.WriteLine($"Languages: {string.Join(", ", result.DetectedLanguages)}");
    }

    if (result.Chunks?.Count > 0)
    {
        Console.WriteLine($"Total chunks: {result.Chunks.Count}");
        var firstChunk = result.Chunks[0];
        Console.WriteLine($"First chunk tokens: {firstChunk.Metadata.TokenCount}");
        if (firstChunk.Embedding?.Length > 0)
        {
            Console.WriteLine($"Embedding dimensions: {firstChunk.Embedding.Length}");
        }
    }

    if (result.Metadata?.Additional?.ContainsKey("quality_score") == true)
    {
        Console.WriteLine($"Quality score: {result.Metadata.Additional["quality_score"]}");
    }

    if (result.Metadata?.Additional?.ContainsKey("keywords") == true)
    {
        Console.WriteLine($"Keywords: {result.Metadata.Additional["keywords"]}");
    }
}

await RunRagPipeline();
Go
package main

import (
    "fmt"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    maxChars := 512
    maxOverlap := 50
    minConfidence := 0.8
    config := &kreuzberg.ExtractionConfig{
        EnableQualityProcessing: true,

        LanguageDetection: &kreuzberg.LanguageDetectionConfig{
            Enabled:        true,
            MinConfidence:  &minConfidence,
            DetectMultiple: true,
        },

        TokenReduction: &kreuzberg.TokenReductionConfig{
            Mode:             "moderate",
            PreserveMarkdown: true,
        },

        Chunking: &kreuzberg.ChunkingConfig{
            MaxChars:   &maxChars,
            MaxOverlap: &maxOverlap,
            Embedding: &kreuzberg.EmbeddingConfig{
                Model:     "balanced",
                Normalize: true,
            },
        },

        Keywords: &kreuzberg.KeywordConfig{
            Algorithm:   "YAKE",
            MaxKeywords: 10,
        },
    }

    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    fmt.Printf("Quality: %v\n", result.Metadata.Additional["quality_score"])
    fmt.Printf("Languages: %v\n", result.DetectedLanguages)
    fmt.Printf("Keywords: %v\n", result.Metadata.Additional["keywords"])
    if result.Chunks != nil && len(result.Chunks) > 0 && result.Chunks[0].Embedding != nil {
        fmt.Printf("Chunks: %d with %d dimensions\n", len(result.Chunks), len(result.Chunks[0].Embedding))
    }
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;
import dev.kreuzberg.config.LanguageDetectionConfig;
import dev.kreuzberg.config.TokenReductionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .enableQualityProcessing(true)
    .languageDetection(LanguageDetectionConfig.builder()
        .enabled(true)
        .minConfidence(0.8)
        .build())
    .tokenReduction(TokenReductionConfig.builder()
        .mode("moderate")
        .preserveImportantWords(true)
        .build())
    .chunking(ChunkingConfig.builder()
        .maxChars(512)
        .maxOverlap(50)
        .embedding("balanced")
        .build())
    .build();

ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);

Object qualityScore = result.getMetadata().get("quality_score");
System.out.printf("Quality: %.2f%n", ((Number)qualityScore).doubleValue());
System.out.println("Languages: " + result.getDetectedLanguages());
System.out.println("Content length: " + result.getContent().length() + " characters");
Python
import asyncio
from kreuzberg import (
    extract_file,
    ExtractionConfig,
    ChunkingConfig,
    EmbeddingConfig,
    EmbeddingModelType,
    LanguageDetectionConfig,
    TokenReductionConfig,
)

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        enable_quality_processing=True,
        language_detection=LanguageDetectionConfig(enabled=True),
        token_reduction=TokenReductionConfig(mode="moderate"),
        chunking=ChunkingConfig(
            max_chars=512,
            max_overlap=50,
            embedding=EmbeddingConfig(
                model=EmbeddingModelType.preset("balanced"), normalize=True
            ),
        ),
    )
    result = await extract_file("document.pdf", config=config)
    quality = result.metadata.get("quality_score", 0)
    print(f"Quality: {quality:.2f}")
    print(f"Languages: {result.detected_languages}")
    if result.chunks:
        print(f"Chunks: {len(result.chunks)}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  enable_quality_processing: true,
  language_detection: Kreuzberg::Config::LanguageDetection.new(
    enabled: true,
    detect_multiple: true
  ),
  token_reduction: Kreuzberg::Config::TokenReduction.new(mode: 'moderate'),
  chunking: Kreuzberg::Config::Chunking.new(
    max_chars: 512,
    max_overlap: 50,
    embedding: { normalize: true }
  ),
  keywords: Kreuzberg::Config::Keywords.new(
    algorithm: 'yake',
    max_keywords: 10
  )
)

result = Kreuzberg.extract_file_sync('document.pdf', config: config)
puts "Languages: #{result.detected_languages.inspect}"
puts "Chunks: #{result.chunks&.length || 0}"
Rust
use kreuzberg::{
    extract_file, ExtractionConfig, ChunkingConfig, EmbeddingConfig,
    LanguageDetectionConfig, TokenReductionConfig,
    KeywordConfig, KeywordAlgorithm
};

let config = ExtractionConfig {
    enable_quality_processing: true,

    language_detection: Some(LanguageDetectionConfig {
        enabled: true,
        detect_multiple: true,
        ..Default::default()
    }),

    token_reduction: Some(TokenReductionConfig {
        mode: "moderate".to_string(),
        preserve_markdown: true,
        ..Default::default()
    }),

    chunking: Some(ChunkingConfig {
        max_chars: 512,
        max_overlap: 50,
        embedding: Some(EmbeddingConfig {
            model: kreuzberg::EmbeddingModelType::Preset { name: "balanced".to_string() },
            normalize: true,
            ..Default::default()
        }),
        ..Default::default()
    }),

    keywords: Some(KeywordConfig {
        algorithm: KeywordAlgorithm::Yake,
        max_keywords: 10,
        ..Default::default()
    }),

    ..Default::default()
};

let result = extract_file("document.pdf", None, &config).await?;

if let Some(quality) = result.metadata.additional.get("quality_score") {
    println!("Quality: {:?}", quality);
}
println!("Languages: {:?}", result.detected_languages);
println!("Keywords: {:?}", result.metadata.additional.get("keywords"));
if let Some(chunks) = result.chunks {
    if let Some(first_chunk) = chunks.first() {
        if let Some(embedding) = &first_chunk.embedding {
            println!("Chunks: {} with {} dimensions", chunks.len(), embedding.len());
        }
    }
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    enableQualityProcessing: true,
    languageDetection: {
        enabled: true,
        detectMultiple: true,
    },
    tokenReduction: {
        mode: 'moderate',
        preserveImportantWords: true,
    },
    chunking: {
        maxChars: 512,
        maxOverlap: 50,
        embedding: {
            preset: 'balanced',
        },
    },
    keywords: {
        algorithm: 'yake',
        maxKeywords: 10,
    },
};

const result = await extractFile('document.pdf', null, config);

console.log(`Content length: ${result.content.length}`);
if (result.detectedLanguages) {
    console.log(`Languages: ${result.detectedLanguages.join(', ')}`);
}
if (result.chunks && result.chunks.length > 0) {
    console.log(`Chunks: ${result.chunks.length}`);
}

Page Tracking Patterns

Advanced patterns for using page tracking in real-world applications.

Chunk-to-Page Mapping

When both chunking and page tracking are enabled, chunks automatically include page metadata:

from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig, PageConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(chunk_size=500, overlap=50),
    pages=PageConfig(extract_pages=True)
)

result = extract_file_sync("document.pdf", config=config)

if result.chunks:
    for chunk in result.chunks:
        if chunk.metadata.first_page:
            page_range = (
                f"Page {chunk.metadata.first_page}"
                if chunk.metadata.first_page == chunk.metadata.last_page
                else f"Pages {chunk.metadata.first_page}-{chunk.metadata.last_page}"
            )
            print(f"Chunk: {chunk.text[:50]}... ({page_range})")

Filter chunks by page range for focused retrieval:

def search_in_pages(chunks: list[Chunk], query: str, page_start: int, page_end: int) -> list[Chunk]:
    """Search only within specified page range."""
    page_chunks = [
        c for c in chunks
        if c.metadata.first_page and c.metadata.last_page
        and c.metadata.first_page >= page_start
        and c.metadata.last_page <= page_end
    ]
    return search_chunks(page_chunks, query)

Page-Aware Embeddings

Include page context in embeddings for better retrieval:

for chunk in result.chunks:
    if chunk.metadata.first_page:
        context = f"Page {chunk.metadata.first_page}: {chunk.text}"
        embedding = embed(context)
        store_with_metadata(embedding, {
            "page": chunk.metadata.first_page,
            "text": chunk.text
        })

Per-Page Processing

Process each page independently:

from kreuzberg import extract_file_sync, ExtractionConfig, PageConfig

config = ExtractionConfig(
    pages=PageConfig(extract_pages=True)
)

result = extract_file_sync("document.pdf", config=config)

if result.pages:
    for page in result.pages:
        print(f"Page {page.page_number}:")
        print(f"  Content: {len(page.content)} chars")
        print(f"  Tables: {len(page.tables)}")
        print(f"  Images: {len(page.images)}")

Format-Specific Strategies

PDF Documents: Use byte boundaries for precise page lookups. Ideal for legal documents, research papers.

Presentations (PPTX): Process slides independently. Use PageUnitType::Slide to distinguish from regular pages.

Word Documents (DOCX): Page breaks may be approximate. Verify PageStructure.boundaries exists before using.

Multi-Format: Check PageStructure availability:

if result.metadata.pages and result.metadata.pages.boundaries:
    # Page tracking available
    process_with_pages(result)
else:
    # Fallback to page-less processing
    process_without_pages(result)