Advanced Features¶
Kreuzberg provides text processing, analysis, and optimization features beyond basic extraction.
Text Chunking¶
flowchart TD
Start[Extracted Text] --> Detect{Detect Format}
Detect -->|Markdown| MarkdownChunker[Markdown Chunker]
Detect -->|Plain Text| TextChunker[Text Chunker]
MarkdownChunker --> MDStrategy[Structure-Aware Splitting]
MDStrategy --> MDPreserve[Preserve:<br/>- Headings<br/>- Lists<br/>- Code blocks<br/>- Formatting]
TextChunker --> TextStrategy[Generic Text Splitting]
TextStrategy --> TextBoundaries[Smart Boundaries:<br/>- Whitespace<br/>- Punctuation<br/>- Sentence breaks]
MDPreserve --> CreateChunks[Create Chunks]
TextBoundaries --> CreateChunks
CreateChunks --> Config[Apply ChunkingConfig]
Config --> MaxChars[max_chars: Max size]
Config --> Overlap[max_overlap: Overlap]
MaxChars --> FinalChunks[Final Chunks]
Overlap --> FinalChunks
FinalChunks --> Metadata[Add Metadata:<br/>- char_start/end<br/>- chunk_index<br/>- total_chunks<br/>- token_count]
Metadata --> Embeddings{Generate<br/>Embeddings?}
Embeddings -->|Yes| AddEmbeddings[Add Embedding Vectors]
Embeddings -->|No| Return[Return Chunks]
AddEmbeddings --> Return
style MarkdownChunker fill:#FFD700
style TextChunker fill:#87CEEB
style CreateChunks fill:#90EE90
style AddEmbeddings fill:#FFB6C1 Split extracted text into chunks for downstream processing like RAG (Retrieval-Augmented Generation) systems, vector databases, or LLM context windows.
Overview¶
Kreuzberg uses the text-splitter library with two chunking strategies:
- Text Chunker: Generic text splitting with smart boundaries (whitespace, punctuation)
- Markdown Chunker: Structure-aware splitting that preserves headings, lists, code blocks, and formatting
Configuration¶
using Kreuzberg;
class Program { static async Task Main() { var config = new ExtractionConfig { Chunking = new ChunkingConfig { MaxChars = 1000, MaxOverlap = 200, Embedding = new EmbeddingConfig { Model = EmbeddingModelType.Preset("all-minilm-l6-v2"), Normalize = true, BatchSize = 32 } } };
try
{
var result = await KreuzbergClient.ExtractFileAsync(
"document.pdf",
config
).ConfigureAwait(false);
Console.WriteLine($"Chunks: {result.Chunks.Count}");
foreach (var chunk in result.Chunks)
{
Console.WriteLine($"Content length: {chunk.Content.Length}");
if (chunk.Embedding != null)
{
Console.WriteLine($"Embedding dimensions: {chunk.Embedding.Length}");
}
}
}
catch (KreuzbergException ex)
{
Console.WriteLine($"Error: {ex.Message}");
}
}
}
package main
import (
"fmt"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
maxChars := 1000
maxOverlap := 200
config := &kreuzberg.ExtractionConfig{
Chunking: &kreuzberg.ChunkingConfig{
MaxChars: &maxChars,
MaxOverlap: &maxOverlap,
},
}
fmt.Printf("Config: MaxChars=%d, MaxOverlap=%d\n", *config.Chunking.MaxChars, *config.Chunking.MaxOverlap)
}
import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
chunking=ChunkingConfig(
max_chars=1000,
max_overlap=200,
separator="sentence"
)
)
result = await extract_file("document.pdf", config=config)
print(f"Chunks: {len(result.chunks or [])}")
for chunk in result.chunks or []:
print(f"Length: {len(chunk.content)}")
asyncio.run(main())
import { initWasm, extractBytes } from '@kreuzberg/wasm';
await initWasm();
const config = {
chunking: {
maxChars: 1000,
chunkOverlap: 100
}
};
const bytes = new Uint8Array(buffer);
const result = await extractBytes(bytes, 'application/pdf', config);
result.chunks?.forEach((chunk, idx) => {
console.log(`Chunk ${idx}: ${chunk.content.substring(0, 50)}...`);
console.log(`Tokens: ${chunk.metadata?.token_count}`);
});
Chunk Output¶
Each chunk includes:
content: The chunk textmetadata:char_start: Start position in original textchar_end: End position in original textchunk_index: Zero-based chunk numbertotal_chunks: Total number of chunkstoken_count: Token count (if embeddings enabled)embedding: Optional embedding vector (if configured)
Example: RAG Pipeline¶
using Kreuzberg; using System.Collections.Generic; using System.Linq;
class RagPipelineExample { static async Task Main() { var config = new ExtractionConfig { Chunking = new ChunkingConfig { MaxChars = 500, MaxOverlap = 50, Embedding = new EmbeddingConfig { Model = EmbeddingModelType.Preset("all-mpnet-base-v2"), Normalize = true, BatchSize = 16 } } };
try
{
var result = await KreuzbergClient.ExtractFileAsync(
"research_paper.pdf",
config
).ConfigureAwait(false);
var vectorStore = await BuildVectorStoreAsync(result.Chunks)
.ConfigureAwait(false);
var query = "machine learning optimization";
var relevantChunks = await SearchAsync(vectorStore, query)
.ConfigureAwait(false);
Console.WriteLine($"Found {relevantChunks.Count} relevant chunks");
foreach (var chunk in relevantChunks.Take(3))
{
Console.WriteLine($"Content: {chunk.Content[..80]}...");
Console.WriteLine($"Similarity: {chunk.Similarity:F3}\n");
}
}
catch (KreuzbergException ex)
{
Console.WriteLine($"Error: {ex.Message}");
}
}
static async Task<List<VectorEntry>> BuildVectorStoreAsync(
IEnumerable<Chunk> chunks)
{
return await Task.Run(() =>
{
return chunks.Select(c => new VectorEntry
{
Content = c.Content,
Embedding = c.Embedding?.ToArray() ?? Array.Empty<float>(),
Similarity = 0f
}).ToList();
}).ConfigureAwait(false);
}
static async Task<List<VectorEntry>> SearchAsync(
List<VectorEntry> store,
string query)
{
return await Task.Run(() =>
{
return store
.OrderByDescending(e => e.Similarity)
.ToList();
}).ConfigureAwait(false);
}
class VectorEntry
{
public string Content { get; set; } = string.Empty;
public float[] Embedding { get; set; } = Array.Empty<float>();
public float Similarity { get; set; }
}
}
package main
import (
"fmt"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
maxChars := 500
maxOverlap := 50
normalize := true
batchSize := int32(16)
config := &kreuzberg.ExtractionConfig{
Chunking: &kreuzberg.ChunkingConfig{
MaxChars: &maxChars,
MaxOverlap: &maxOverlap,
Embedding: &kreuzberg.EmbeddingConfig{
Model: kreuzberg.EmbeddingModelType_Preset("all-mpnet-base-v2"),
Normalize: &normalize,
BatchSize: &batchSize,
},
},
}
result, err := kreuzberg.ExtractFileSync("research_paper.pdf", config)
if err != nil {
log.Fatalf("RAG extraction failed: %v", err)
}
chunks := result.Chunks
fmt.Printf("Found %d chunks for RAG pipeline\n", len(chunks))
for i := 0; i < len(chunks) && i < 3; i++ {
chunk := chunks[i]
content := chunk.Content
if len(content) > 80 {
content = content[:80]
}
fmt.Printf("Chunk %d: %s...\n", i, content)
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;
import dev.kreuzberg.config.EmbeddingConfig;
import dev.kreuzberg.config.EmbeddingModelType;
import java.util.List;
ExtractionConfig config = ExtractionConfig.builder()
.chunking(ChunkingConfig.builder()
.maxChars(500)
.maxOverlap(50)
.embedding(EmbeddingConfig.builder()
.model(EmbeddingModelType.preset("all-mpnet-base-v2"))
.normalize(true)
.batchSize(16)
.build())
.build())
.build();
try {
ExtractionResult result = Kreuzberg.extractFile("research_paper.pdf", config);
List<Object> chunks = result.getChunks() != null ? result.getChunks() : List.of();
System.out.println("Found " + chunks.size() + " chunks for RAG pipeline");
for (int i = 0; i < Math.min(3, chunks.size()); i++) {
Object chunk = chunks.get(i);
System.out.println("Chunk " + i + ": " + chunk.toString().substring(0, Math.min(80, chunk.toString().length())) + "...");
}
} catch (Exception ex) {
System.err.println("RAG extraction failed: " + ex.getMessage());
}
import asyncio
from kreuzberg import (
extract_file,
ExtractionConfig,
ChunkingConfig,
EmbeddingConfig,
EmbeddingModelType,
)
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
chunking=ChunkingConfig(
max_chars=500,
max_overlap=50,
embedding=EmbeddingConfig(
model=EmbeddingModelType.preset("balanced"),
normalize=True,
batch_size=16
)
)
)
result = await extract_file("research_paper.pdf", config=config)
chunks_with_embeddings: list = []
for chunk in result.chunks or []:
if chunk.embedding:
chunks_with_embeddings.append({
"content": chunk.content[:100],
"embedding_dims": len(chunk.embedding)
})
print(f"Chunks with embeddings: {len(chunks_with_embeddings)}")
asyncio.run(main())
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
chunking: Kreuzberg::Config::Chunking.new(
max_chars: 500,
max_overlap: 50,
embedding: Kreuzberg::Config::Embedding.new(
model: Kreuzberg::EmbeddingModelType.new(
type: 'preset',
name: 'all-mpnet-base-v2'
),
normalize: true,
batch_size: 16
)
)
)
result = Kreuzberg.extract_file_sync('research_paper.pdf', config: config)
vector_store = build_vector_store(result.chunks)
query = 'machine learning optimization'
relevant_chunks = search_vector_store(vector_store, query)
puts "Found #{relevant_chunks.length} relevant chunks"
relevant_chunks.take(3).each do |chunk|
puts "Content: #{chunk[:content][0..80]}..."
puts "Similarity: #{chunk[:similarity]&.round(3)}\n"
end
def build_vector_store(chunks)
chunks.map.with_index do |chunk, idx|
{
id: idx,
content: chunk.content,
embedding: chunk.embedding,
similarity: 0.0
}
end
end
def search_vector_store(store, query)
store.sort_by { |entry| entry[:similarity] }.reverse
end
use kreuzberg::{extract_file, ExtractionConfig, ChunkingConfig, EmbeddingConfig};
let config = ExtractionConfig {
chunking: Some(ChunkingConfig {
max_chars: 500,
max_overlap: 50,
embedding: Some(EmbeddingConfig {
model: "balanced".to_string(),
normalize: true,
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
let result = extract_file("research_paper.pdf", None, &config).await?;
if let Some(chunks) = result.chunks {
for chunk in chunks {
println!("Chunk {}/{}",
chunk.metadata.chunk_index + 1,
chunk.metadata.total_chunks
);
println!("Position: {}-{}",
chunk.metadata.char_start,
chunk.metadata.char_end
);
println!("Content: {}...", &chunk.content[..100.min(chunk.content.len())]);
if let Some(embedding) = chunk.embedding {
println!("Embedding: {} dimensions", embedding.len());
}
}
}
import { extractFile } from '@kreuzberg/node';
const config = {
chunking: {
maxChars: 500,
maxOverlap: 50,
embedding: {
preset: 'balanced',
},
},
};
const result = await extractFile('research_paper.pdf', null, config);
if (result.chunks) {
for (const chunk of result.chunks) {
console.log(`Chunk ${chunk.metadata.chunkIndex + 1}/${chunk.metadata.totalChunks}`);
console.log(`Position: ${chunk.metadata.charStart}-${chunk.metadata.charEnd}`);
console.log(`Content: ${chunk.content.slice(0, 100)}...`);
if (chunk.embedding) {
console.log(`Embedding: ${chunk.embedding.length} dimensions`);
}
}
}
import { initWasm, extractBytes } from '@kreuzberg/wasm';
await initWasm();
const config = {
chunking: {
maxChars: 1000,
chunkOverlap: 100,
embedding: {
model: { preset: 'all-MiniLM-L6-v2' }
}
}
};
const bytes = new Uint8Array(buffer);
const result = await extractBytes(bytes, 'application/pdf', config);
for (const chunk of result.chunks || []) {
console.log(`Chunk: ${chunk.content.substring(0, 100)}...`);
console.log(`Embedding: ${chunk.embedding?.slice(0, 5).join(', ')}...`);
}
Language Detection¶
flowchart TD
Start[Extracted Text] --> Config{detect_multiple?}
Config -->|false| SingleMode[Single Language Mode]
Config -->|true| MultiMode[Multiple Languages Mode]
SingleMode --> FullText[Analyze Full Text]
FullText --> DetectSingle[Detect Dominant Language]
DetectSingle --> CheckConfSingle{Confidence ≥<br/>min_confidence?}
CheckConfSingle -->|Yes| ReturnSingle[Return Language]
CheckConfSingle -->|No| EmptySingle[Return Empty]
MultiMode --> ChunkText[Split into 200-char Chunks]
ChunkText --> AnalyzeChunks[Analyze Each Chunk]
AnalyzeChunks --> DetectPerChunk[Detect Language per Chunk]
DetectPerChunk --> FilterConfidence[Filter by min_confidence]
FilterConfidence --> CountFrequency[Count Language Frequency]
CountFrequency --> SortByFrequency[Sort by Frequency]
SortByFrequency --> ReturnMultiple[Return Language List]
ReturnSingle --> Result[detected_languages]
EmptySingle --> Result
ReturnMultiple --> Result
Result --> ISOCodes[ISO 639-3 Codes:<br/>eng, spa, fra, deu, cmn,<br/>jpn, ara, rus, etc.]
style SingleMode fill:#87CEEB
style MultiMode fill:#FFD700
style ReturnSingle fill:#90EE90
style ReturnMultiple fill:#90EE90
style ISOCodes fill:#FFB6C1 Detect languages in extracted text using the fast whatlang library. Supports 60+ languages with ISO 639-3 codes.
Configuration¶
using Kreuzberg;
class Program { static async Task Main() { var config = new ExtractionConfig { LanguageDetection = new LanguageDetectionConfig { Enabled = true, MinConfidence = 0.8m, DetectMultiple = false } };
try
{
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
if (result.DetectedLanguages?.Count > 0)
{
Console.WriteLine($"Detected Language: {result.DetectedLanguages[0]}");
}
else
{
Console.WriteLine("No language detected");
}
Console.WriteLine($"Content length: {result.Content.Length} characters");
}
catch (KreuzbergException ex)
{
Console.WriteLine($"Extraction failed: {ex.Message}");
}
}
}
package main
import (
"fmt"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
minConfidence := 0.8
config := &kreuzberg.ExtractionConfig{
LanguageDetection: &kreuzberg.LanguageDetectionConfig{
Enabled: true,
MinConfidence: &minConfidence,
DetectMultiple: false,
},
}
fmt.Printf("Language detection enabled: %v\n", config.LanguageDetection.Enabled)
fmt.Printf("Min confidence: %f\n", *config.LanguageDetection.MinConfidence)
}
import asyncio
from kreuzberg import ExtractionConfig, LanguageDetectionConfig, extract_file
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
language_detection=LanguageDetectionConfig(
enabled=True,
min_confidence=0.85,
detect_multiple=False
)
)
result = await extract_file("document.pdf", config=config)
if result.detected_languages:
print(f"Primary language: {result.detected_languages[0]}")
print(f"Content length: {len(result.content)} chars")
asyncio.run(main())
import { extractFile } from '@kreuzberg/node';
const config = {
languageDetection: {
enabled: true,
minConfidence: 0.8,
detectMultiple: false,
},
};
const result = await extractFile('document.pdf', null, config);
if (result.detectedLanguages) {
console.log(`Detected languages: ${result.detectedLanguages.join(', ')}`);
}
import { initWasm, extractBytes } from '@kreuzberg/wasm';
await initWasm();
const config = {
language_detection: {
detect_multiple: true,
min_confidence: 0.5
}
};
const bytes = new Uint8Array(buffer);
const result = await extractBytes(bytes, 'application/pdf', config);
console.log('Detected languages:', result.language);
Detection Modes¶
Single Language (detect_multiple: false): - Detects dominant language only - Faster, single-pass detection - Best for monolingual documents
Multiple Languages (detect_multiple: true): - Chunks text into 200-character segments - Detects language in each chunk - Returns languages sorted by frequency - Best for multilingual documents
Supported Languages¶
ISO 639-3 codes including:
- European: eng (English), spa (Spanish), fra (French), deu (German), ita (Italian), por (Portuguese), rus (Russian), nld (Dutch), pol (Polish), swe (Swedish)
- Asian: cmn (Chinese), jpn (Japanese), kor (Korean), tha (Thai), vie (Vietnamese), ind (Indonesian)
- Middle Eastern: ara (Arabic), pes (Persian), urd (Urdu), heb (Hebrew)
- And 40+ more
Example¶
using Kreuzberg;
class Program { static async Task Main() { var config = new ExtractionConfig { LanguageDetection = new LanguageDetectionConfig { Enabled = true, MinConfidence = 0.8m, DetectMultiple = true } };
try
{
var result = await KreuzbergClient.ExtractFileAsync("multilingual_document.pdf", config);
var languages = result.DetectedLanguages ?? new List<string>();
if (languages.Count > 0)
{
Console.WriteLine($"Detected {languages.Count} language(s): {string.Join(", ", languages)}");
}
else
{
Console.WriteLine("No languages detected");
}
Console.WriteLine($"Total content: {result.Content.Length} characters");
Console.WriteLine($"MIME type: {result.MimeType}");
}
catch (KreuzbergException ex)
{
Console.WriteLine($"Processing failed: {ex.Message}");
}
}
}
package main
import (
"fmt"
"log"
"strings"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
enabled := true
detectMultiple := true
minConfidence := 0.8
config := &kreuzberg.ExtractionConfig{
LanguageDetection: &kreuzberg.LanguageDetectionConfig{
Enabled: &enabled,
MinConfidence: &minConfidence,
DetectMultiple: &detectMultiple,
},
}
result, err := kreuzberg.ExtractFileSync("multilingual_document.pdf", config)
if err != nil {
log.Fatalf("Processing failed: %v", err)
}
languages := result.DetectedLanguages
if len(languages) > 0 {
fmt.Printf("Detected %d language(s): %s\n", len(languages), strings.Join(languages, ", "))
} else {
fmt.Println("No languages detected")
}
fmt.Printf("Total content: %d characters\n", len(result.Content))
fmt.Printf("MIME type: %s\n", result.MimeType)
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.LanguageDetectionConfig;
import java.math.BigDecimal;
import java.util.List;
ExtractionConfig config = ExtractionConfig.builder()
.languageDetection(LanguageDetectionConfig.builder()
.enabled(true)
.minConfidence(new BigDecimal("0.8"))
.detectMultiple(true)
.build())
.build();
try {
ExtractionResult result = Kreuzberg.extractFile("multilingual_document.pdf", config);
List<String> languages = result.getDetectedLanguages() != null
? result.getDetectedLanguages()
: List.of();
if (!languages.isEmpty()) {
System.out.println("Detected " + languages.size() + " language(s): " + String.join(", ", languages));
} else {
System.out.println("No languages detected");
}
System.out.println("Total content: " + result.getContent().length() + " characters");
System.out.println("MIME type: " + result.getMimeType());
} catch (Exception ex) {
System.err.println("Processing failed: " + ex.getMessage());
}
import asyncio
from kreuzberg import extract_file, ExtractionConfig, LanguageDetectionConfig
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
language_detection=LanguageDetectionConfig(
enabled=True,
min_confidence=0.7,
detect_multiple=True
)
)
result = await extract_file("multilingual_document.pdf", config=config)
languages: list[str] = result.detected_languages or []
print(f"Detected {len(languages)} languages: {languages}")
asyncio.run(main())
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
language_detection: Kreuzberg::Config::LanguageDetection.new(
enabled: true,
min_confidence: 0.8,
detect_multiple: true
)
)
result = Kreuzberg.extract_file_sync('multilingual_document.pdf', config: config)
languages = result.detected_languages || []
if languages.any?
puts "Detected #{languages.length} language(s): #{languages.join(', ')}"
else
puts "No languages detected"
end
puts "Total content: #{result.content.length} characters"
puts "MIME type: #{result.mime_type}"
use kreuzberg::{extract_file, ExtractionConfig, LanguageDetectionConfig};
let config = ExtractionConfig {
language_detection: Some(LanguageDetectionConfig {
enabled: true,
min_confidence: 0.8,
detect_multiple: true,
}),
..Default::default()
};
let result = extract_file("multilingual_document.pdf", None, &config).await?;
println!("Detected languages: {:?}", result.detected_languages);
import { extractFile } from '@kreuzberg/node';
const config = {
languageDetection: {
enabled: true,
minConfidence: 0.8,
detectMultiple: true,
},
};
const result = await extractFile('multilingual_document.pdf', null, config);
if (result.detectedLanguages) {
console.log(`Detected languages: ${result.detectedLanguages.join(', ')}`);
}
Embedding Generation¶
Generate embeddings for vector databases, semantic search, and RAG systems using ONNX models via fastembed-rs.
Available Presets¶
| Preset | Model | Dimensions | Max Tokens | Use Case |
|---|---|---|---|---|
| fast | AllMiniLML6V2Q | 384 | 512 | Rapid prototyping, development |
| balanced | BGEBaseENV15 | 768 | 1024 | Production RAG, general-purpose |
| quality | BGELargeENV15 | 1024 | 2000 | Maximum accuracy, complex docs |
| multilingual | MultilingualE5Base | 768 | 1024 | 100+ languages, international |
Max Tokens vs. max_chars
The "Max Tokens" values shown are the model's maximum token limits. These don't directly correspond to the max_chars setting in ChunkingConfig, which controls character-based chunking. The embedding model will process chunks up to its token limit.
Configuration¶
using Kreuzberg;
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
var config = new ExtractionConfig
{
Chunking = new ChunkingConfig
{
MaxChars = 512,
MaxOverlap = 50,
Embedding = new EmbeddingConfig
{
Model = EmbeddingModelType.Preset("balanced"),
Normalize = true,
BatchSize = 32,
ShowDownloadProgress = false
}
}
};
var result = await Kreuzberg.ExtractFileAsync("document.pdf", config);
var chunks = result.Chunks ?? new List<Chunk>();
foreach (var (index, chunk) in chunks.WithIndex())
{
var chunkId = $"doc_chunk_{index}";
Console.WriteLine($"Chunk {chunkId}: {chunk.Content[..Math.Min(50, chunk.Content.Length)]}");
if (chunk.Embedding != null)
{
Console.WriteLine($" Embedding dimensions: {chunk.Embedding.Length}");
}
}
internal static class EnumerableExtensions
{
public static IEnumerable<(int Index, T Item)> WithIndex<T>(
this IEnumerable<T> items)
{
var index = 0;
foreach (var item in items)
{
yield return (index++, item);
}
}
}
package main
import (
"fmt"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
maxChars := 512
maxOverlap := 50
normalize := true
batchSize := int32(32)
showProgress := false
config := &kreuzberg.ExtractionConfig{
Chunking: &kreuzberg.ChunkingConfig{
MaxChars: &maxChars,
MaxOverlap: &maxOverlap,
Embedding: &kreuzberg.EmbeddingConfig{
Model: kreuzberg.EmbeddingModelType_Preset("balanced"),
Normalize: &normalize,
BatchSize: &batchSize,
ShowDownloadProgress: &showProgress,
},
},
}
result, err := kreuzberg.ExtractFileSync("document.pdf", config)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
for index, chunk := range result.Chunks {
chunkID := fmt.Sprintf("doc_chunk_%d", index)
content := chunk.Content
if len(content) > 50 {
content = content[:50]
}
fmt.Printf("Chunk %s: %s\n", chunkID, content)
if chunk.Embedding != nil && len(chunk.Embedding) > 0 {
fmt.Printf(" Embedding dimensions: %d\n", len(chunk.Embedding))
}
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;
import dev.kreuzberg.config.EmbeddingConfig;
import dev.kreuzberg.config.EmbeddingModelType;
import java.util.List;
ExtractionConfig config = ExtractionConfig.builder()
.chunking(ChunkingConfig.builder()
.maxChars(512)
.maxOverlap(50)
.embedding(EmbeddingConfig.builder()
.model(EmbeddingModelType.preset("balanced"))
.normalize(true)
.batchSize(32)
.showDownloadProgress(false)
.build())
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
List<Object> chunks = result.getChunks() != null ? result.getChunks() : List.of();
for (int index = 0; index < chunks.size(); index++) {
Object chunk = chunks.get(index);
String chunkId = "doc_chunk_" + index;
System.out.println("Chunk " + chunkId + ": " + chunk.toString().substring(0, Math.min(50, chunk.toString().length())));
if (chunk instanceof java.util.Map) {
Object embedding = ((java.util.Map<String, Object>) chunk).get("embedding");
if (embedding != null) {
System.out.println(" Embedding dimensions: " + ((float[]) embedding).length);
}
}
}
from kreuzberg import (
ExtractionConfig,
ChunkingConfig,
EmbeddingConfig,
EmbeddingModelType,
)
config: ExtractionConfig = ExtractionConfig(
chunking=ChunkingConfig(
max_chars=1024,
max_overlap=100,
embedding=EmbeddingConfig(
model=EmbeddingModelType.preset("balanced"),
normalize=True,
batch_size=32,
show_download_progress=False,
),
)
)
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
chunking: Kreuzberg::Config::Chunking.new(
max_chars: 512,
max_overlap: 50,
embedding: Kreuzberg::Config::Embedding.new(
model: Kreuzberg::EmbeddingModelType.new(
type: 'preset',
name: 'balanced'
),
normalize: true,
batch_size: 32,
show_download_progress: false
)
)
)
result = Kreuzberg.extract_file_sync('document.pdf', config: config)
chunks = result.chunks || []
chunks.each_with_index do |chunk, idx|
chunk_id = "doc_chunk_#{idx}"
puts "Chunk #{chunk_id}: #{chunk.content[0...50]}"
if chunk.embedding
puts " Embedding dimensions: #{chunk.embedding.length}"
end
end
use kreuzberg::{ExtractionConfig, ChunkingConfig, EmbeddingConfig};
let config = ExtractionConfig {
chunking: Some(ChunkingConfig {
max_chars: 1024,
max_overlap: 100,
embedding: Some(EmbeddingConfig {
model: "balanced".to_string(),
normalize: true,
batch_size: 32,
show_download_progress: false,
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
Example: Vector Database Integration¶
using Kreuzberg;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
public class VectorDatabaseIntegration
{
public class VectorRecord
{
public string Id { get; set; }
public float[] Embedding { get; set; }
public string Content { get; set; }
public Dictionary<string, string> Metadata { get; set; }
}
public async Task<List<VectorRecord>> ExtractAndVectorize(
string documentPath,
string documentId)
{
var config = new ExtractionConfig
{
Chunking = new ChunkingConfig
{
MaxChars = 512,
MaxOverlap = 50,
Embedding = new EmbeddingConfig
{
Model = EmbeddingModelType.Preset("balanced"),
Normalize = true,
BatchSize = 32
}
}
};
var result = await Kreuzberg.ExtractFileAsync(documentPath, config);
var chunks = result.Chunks ?? new List<Chunk>();
var vectorRecords = chunks
.Select((chunk, index) => new VectorRecord
{
Id = $"{documentId}_chunk_{index}",
Content = chunk.Content,
Embedding = chunk.Embedding,
Metadata = new Dictionary<string, string>
{
{ "document_id", documentId },
{ "chunk_index", index.ToString() },
{ "content_length", chunk.Content.Length.ToString() }
}
})
.ToList();
await StoreInVectorDatabase(vectorRecords);
return vectorRecords;
}
private async Task StoreInVectorDatabase(List<VectorRecord> records)
{
foreach (var record in records)
{
if (record.Embedding != null && record.Embedding.Length > 0)
{
Console.WriteLine(
$"Storing {record.Id}: {record.Content.Length} chars, " +
$"{record.Embedding.Length} dims");
}
}
await Task.CompletedTask;
}
}
package main
import (
"fmt"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
type VectorRecord struct {
ID string
Embedding []float32
Content string
Metadata map[string]string
}
func extractAndVectorize(documentPath string, documentID string) ([]VectorRecord, error) {
maxChars := 512
maxOverlap := 50
normalize := true
batchSize := int32(32)
config := &kreuzberg.ExtractionConfig{
Chunking: &kreuzberg.ChunkingConfig{
MaxChars: &maxChars,
MaxOverlap: &maxOverlap,
Embedding: &kreuzberg.EmbeddingConfig{
Model: kreuzberg.EmbeddingModelType_Preset("balanced"),
Normalize: &normalize,
BatchSize: &batchSize,
},
},
}
result, err := kreuzberg.ExtractFileSync(documentPath, config)
if err != nil {
return nil, err
}
var vectorRecords []VectorRecord
for index, chunk := range result.Chunks {
record := VectorRecord{
ID: fmt.Sprintf("%s_chunk_%d", documentID, index),
Content: chunk.Content,
Embedding: chunk.Embedding,
Metadata: map[string]string{
"document_id": documentID,
"chunk_index": fmt.Sprintf("%d", index),
"content_length": fmt.Sprintf("%d", len(chunk.Content)),
},
}
vectorRecords = append(vectorRecords, record)
}
storeInVectorDatabase(vectorRecords)
return vectorRecords, nil
}
func storeInVectorDatabase(records []VectorRecord) {
for _, record := range records {
if len(record.Embedding) > 0 {
fmt.Printf("Storing %s: %d chars, %d dims\n",
record.ID, len(record.Content), len(record.Embedding))
}
}
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;
import dev.kreuzberg.config.EmbeddingConfig;
import dev.kreuzberg.config.EmbeddingModelType;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class VectorDatabaseIntegration {
public static class VectorRecord {
public String id;
public float[] embedding;
public String content;
public Map<String, String> metadata;
}
public static List<VectorRecord> extractAndVectorize(String documentPath, String documentId) throws Exception {
ExtractionConfig config = ExtractionConfig.builder()
.chunking(ChunkingConfig.builder()
.maxChars(512)
.maxOverlap(50)
.embedding(EmbeddingConfig.builder()
.model(EmbeddingModelType.preset("balanced"))
.normalize(true)
.batchSize(32)
.build())
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile(documentPath, config);
List<Object> chunks = result.getChunks() != null ? result.getChunks() : List.of();
List<VectorRecord> vectorRecords = new java.util.ArrayList<>();
for (int index = 0; index < chunks.size(); index++) {
VectorRecord record = new VectorRecord();
record.id = documentId + "_chunk_" + index;
record.metadata = new HashMap<>();
record.metadata.put("document_id", documentId);
record.metadata.put("chunk_index", String.valueOf(index));
if (chunk instanceof java.util.Map) {
Map<String, Object> chunkMap = (Map<String, Object>) chunks.get(index);
record.content = (String) chunkMap.get("content");
record.embedding = (float[]) chunkMap.get("embedding");
record.metadata.put("content_length", String.valueOf(record.content.length()));
}
vectorRecords.add(record);
}
storeInVectorDatabase(vectorRecords);
return vectorRecords;
}
private static void storeInVectorDatabase(List<VectorRecord> records) {
for (VectorRecord record : records) {
if (record.embedding != null && record.embedding.length > 0) {
System.out.println("Storing " + record.id + ": " + record.content.length()
+ " chars, " + record.embedding.length + " dims");
}
}
}
}
import asyncio
from kreuzberg import (
extract_file,
ExtractionConfig,
ChunkingConfig,
EmbeddingConfig,
EmbeddingModelType,
)
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
chunking=ChunkingConfig(
max_chars=512,
max_overlap=50,
embedding=EmbeddingConfig(
model=EmbeddingModelType.preset("balanced"), normalize=True
),
)
)
result = await extract_file("document.pdf", config=config)
chunks = result.chunks or []
for i, chunk in enumerate(chunks):
chunk_id: str = f"doc_chunk_{i}"
print(f"Chunk {chunk_id}: {chunk.content[:50]}")
asyncio.run(main())
require 'kreuzberg'
class VectorDatabaseIntegration
VectorRecord = Struct.new(:id, :embedding, :content, :metadata, keyword_init: true)
def extract_and_vectorize(document_path, document_id)
config = Kreuzberg::Config::Extraction.new(
chunking: Kreuzberg::Config::Chunking.new(
max_chars: 512,
max_overlap: 50,
embedding: Kreuzberg::Config::Embedding.new(
model: Kreuzberg::EmbeddingModelType.new(
type: 'preset',
name: 'balanced'
),
normalize: true,
batch_size: 32
)
)
)
result = Kreuzberg.extract_file_sync(document_path, config: config)
chunks = result.chunks || []
vector_records = chunks.map.with_index do |chunk, idx|
VectorRecord.new(
id: "#{document_id}_chunk_#{idx}",
content: chunk.content,
embedding: chunk.embedding,
metadata: {
document_id: document_id,
chunk_index: idx,
content_length: chunk.content.length
}
)
end
store_in_vector_database(vector_records)
vector_records
end
private
def store_in_vector_database(records)
records.each do |record|
if record.embedding&.any?
puts "Storing #{record.id}: #{record.content.length} chars, #{record.embedding.length} dims"
end
end
end
end
use kreuzberg::{extract_file, ExtractionConfig, ChunkingConfig, EmbeddingConfig};
struct VectorRecord {
id: String,
content: String,
embedding: Vec<f32>,
metadata: std::collections::HashMap<String, String>,
}
async fn extract_and_vectorize(
document_path: &str,
document_id: &str,
) -> Result<Vec<VectorRecord>, Box<dyn std::error::Error>> {
let config = ExtractionConfig {
chunking: Some(ChunkingConfig {
max_chars: 512,
max_overlap: 50,
embedding: Some(EmbeddingConfig {
model: kreuzberg::EmbeddingModelType::Preset {
name: "balanced".to_string(),
},
normalize: true,
batch_size: 32,
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
let result = extract_file(document_path, None, &config).await?;
let mut records = Vec::new();
if let Some(chunks) = result.chunks {
for (index, chunk) in chunks.iter().enumerate() {
if let Some(embedding) = &chunk.embedding {
let mut metadata = std::collections::HashMap::new();
metadata.insert("document_id".to_string(), document_id.to_string());
metadata.insert("chunk_index".to_string(), index.to_string());
metadata.insert("content_length".to_string(), chunk.content.len().to_string());
records.push(VectorRecord {
id: format!("{}_chunk_{}", document_id, index),
content: chunk.content.clone(),
embedding: embedding.clone(),
metadata,
});
}
}
}
Ok(records)
}
import { extractFile } from '@kreuzberg/node';
const config = {
chunking: {
maxChars: 512,
maxOverlap: 50,
embedding: {
preset: 'balanced',
},
},
};
const result = await extractFile('document.pdf', null, config);
if (result.chunks) {
for (const chunk of result.chunks) {
console.log(`Chunk: ${chunk.content.slice(0, 100)}...`);
if (chunk.embedding) {
console.log(`Embedding dims: ${chunk.embedding.length}`);
}
}
}
Token Reduction¶
Intelligently reduce token count while preserving meaning. Removes stopwords, redundancy, and applies compression.
Reduction Levels¶
| Level | Reduction | Features |
|---|---|---|
| off | 0% | No reduction, pass-through |
| moderate | 15-25% | Stopwords + redundancy removal |
| aggressive | 30-50% | Semantic clustering, importance scoring |
Configuration¶
package main
import (
"fmt"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
config := &kreuzberg.ExtractionConfig{
TokenReduction: &kreuzberg.TokenReductionConfig{
Mode: "moderate",
PreserveImportantWords: kreuzberg.BoolPtr(true),
},
}
fmt.Printf("Mode: %s, Preserve Important Words: %v\n",
config.TokenReduction.Mode,
*config.TokenReduction.PreserveImportantWords)
}
use kreuzberg::{ExtractionConfig, TokenReductionConfig};
let config = ExtractionConfig {
token_reduction: Some(TokenReductionConfig {
mode: "moderate".to_string(),
preserve_markdown: true,
preserve_code: true,
language_hint: Some("eng".to_string()),
..Default::default()
}),
..Default::default()
};
Example¶
using Kreuzberg;
var config = new ExtractionConfig
{
TokenReduction = new TokenReductionConfig
{
Mode = "moderate",
PreserveMarkdown = true
}
};
var result = await KreuzbergClient.ExtractFileAsync(
"verbose_document.pdf",
config
);
var original = result.Metadata.ContainsKey("original_token_count")
? (int)result.Metadata["original_token_count"]
: 0;
var reduced = result.Metadata.ContainsKey("token_count")
? (int)result.Metadata["token_count"]
: 0;
var ratio = result.Metadata.ContainsKey("token_reduction_ratio")
? (double)result.Metadata["token_reduction_ratio"]
: 0.0;
Console.WriteLine($"Reduced from {original} to {reduced} tokens");
Console.WriteLine($"Reduction: {ratio * 100:F1}%");
package main
import (
"fmt"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
preserveMarkdown := true
mode := "moderate"
config := &kreuzberg.ExtractionConfig{
TokenReduction: &kreuzberg.TokenReductionConfig{
Mode: &mode,
PreserveMarkdown: &preserveMarkdown,
},
}
result, err := kreuzberg.ExtractFileSync("verbose_document.pdf", config)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
original := 0
reduced := 0
ratio := 0.0
if val, ok := result.Metadata["original_token_count"]; ok {
original = val.(int)
}
if val, ok := result.Metadata["token_count"]; ok {
reduced = val.(int)
}
if val, ok := result.Metadata["token_reduction_ratio"]; ok {
ratio = val.(float64)
}
fmt.Printf("Reduced from %d to %d tokens\n", original, reduced)
fmt.Printf("Reduction: %.1f%%\n", ratio*100)
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.TokenReductionConfig;
import java.util.Map;
ExtractionConfig config = ExtractionConfig.builder()
.tokenReduction(TokenReductionConfig.builder()
.mode("moderate")
.preserveMarkdown(true)
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("verbose_document.pdf", config);
Map<String, Object> metadata = result.getMetadata() != null ? result.getMetadata() : Map.of();
int original = metadata.containsKey("original_token_count")
? ((Number) metadata.get("original_token_count")).intValue()
: 0;
int reduced = metadata.containsKey("token_count")
? ((Number) metadata.get("token_count")).intValue()
: 0;
double ratio = metadata.containsKey("token_reduction_ratio")
? ((Number) metadata.get("token_reduction_ratio")).doubleValue()
: 0.0;
System.out.println("Reduced from " + original + " to " + reduced + " tokens");
System.out.println(String.format("Reduction: %.1f%%", ratio * 100));
import asyncio
from kreuzberg import extract_file, ExtractionConfig, TokenReductionConfig
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
token_reduction=TokenReductionConfig(
mode="moderate", preserve_markdown=True
)
)
result = await extract_file("verbose_document.pdf", config=config)
original: int = result.metadata.get("original_token_count", 0)
reduced: int = result.metadata.get("token_count", 0)
ratio: float = result.metadata.get("token_reduction_ratio", 0.0)
print(f"Reduced from {original} to {reduced} tokens")
print(f"Reduction: {ratio * 100:.1f}%")
asyncio.run(main())
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
token_reduction: Kreuzberg::Config::TokenReduction.new(
mode: 'moderate',
preserve_markdown: true
)
)
result = Kreuzberg.extract_file_sync('verbose_document.pdf', config: config)
original_tokens = result.metadata&.dig('original_token_count') || 0
reduced_tokens = result.metadata&.dig('token_count') || 0
reduction_ratio = result.metadata&.dig('token_reduction_ratio') || 0.0
puts "Reduced from #{original_tokens} to #{reduced_tokens} tokens"
puts "Reduction: #{(reduction_ratio * 100).round(1)}%"
use kreuzberg::{extract_file, ExtractionConfig, TokenReductionConfig};
let config = ExtractionConfig {
token_reduction: Some(TokenReductionConfig {
mode: "moderate".to_string(),
preserve_markdown: true,
..Default::default()
}),
..Default::default()
};
let result = extract_file("verbose_document.pdf", None, &config).await?;
if let Some(original) = result.metadata.additional.get("original_token_count") {
println!("Original tokens: {}", original);
}
if let Some(reduced) = result.metadata.additional.get("token_count") {
println!("Reduced tokens: {}", reduced);
}
import { extractFile } from '@kreuzberg/node';
const config = {
tokenReduction: {
mode: 'moderate',
preserveImportantWords: true,
},
};
const result = await extractFile('verbose_document.pdf', null, config);
console.log(`Content length: ${result.content.length}`);
console.log(`Metadata: ${JSON.stringify(result.metadata)}`);
Keyword Extraction¶
Extract important keywords and phrases using YAKE or RAKE algorithms.
Feature Flag Required
Keyword extraction requires the keywords feature flag enabled when building Kreuzberg.
Available Algorithms¶
YAKE (Yet Another Keyword Extractor): - Statistical/unsupervised approach - Factors: term frequency, position, capitalization, context - Best for: General-purpose extraction
RAKE (Rapid Automatic Keyword Extraction): - Co-occurrence based - Analyzes word frequency and degree in phrases - Best for: Domain-specific terms, phrase extraction
Configuration¶
package main
import (
"fmt"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
config := &kreuzberg.ExtractionConfig{
Keywords: &kreuzberg.KeywordConfig{
Algorithm: "YAKE",
MaxKeywords: 10,
MinScore: 0.3,
NgramRange: "1,3",
Language: "en",
},
}
fmt.Printf("Keywords config: Algorithm=%s, MaxKeywords=%d, MinScore=%f\n",
config.Keywords.Algorithm,
config.Keywords.MaxKeywords,
config.Keywords.MinScore)
}
import asyncio
from kreuzberg import (
ExtractionConfig,
KeywordConfig,
KeywordAlgorithm,
extract_file,
)
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
keywords=KeywordConfig(
algorithm=KeywordAlgorithm.YAKE,
max_keywords=10,
min_score=0.3,
ngram_range=(1, 3),
language="en"
)
)
result = await extract_file("document.pdf", config=config)
print(f"Content extracted: {len(result.content)} chars")
asyncio.run(main())
use kreuzberg::{ExtractionConfig, KeywordConfig, KeywordAlgorithm};
let config = ExtractionConfig {
keywords: Some(KeywordConfig {
algorithm: KeywordAlgorithm::Yake,
max_keywords: 10,
min_score: 0.3,
ngram_range: (1, 3),
language: Some("en".to_string()),
..Default::default()
}),
..Default::default()
};
Example¶
using Kreuzberg;
using System.Collections.Generic;
var config = new ExtractionConfig
{
Keywords = new KeywordConfig
{
Algorithm = KeywordAlgorithm.Yake,
MaxKeywords = 10,
MinScore = 0.3
}
};
var result = await KreuzbergClient.ExtractFileAsync(
"research_paper.pdf",
config
);
if (result.Metadata.ContainsKey("keywords"))
{
var keywords = (List<Dictionary<string, object>>)result.Metadata["keywords"];
foreach (var kw in keywords)
{
var text = (string)kw["text"];
var score = (double)kw["score"];
Console.WriteLine($"{text}: {score:F3}");
}
}
package main
import (
"fmt"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
maxKeywords := int32(10)
minScore := 0.3
config := &kreuzberg.ExtractionConfig{
Keywords: &kreuzberg.KeywordConfig{
Algorithm: kreuzberg.KeywordAlgorithm_YAKE,
MaxKeywords: &maxKeywords,
MinScore: &minScore,
},
}
result, err := kreuzberg.ExtractFileSync("research_paper.pdf", config)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
if keywords, ok := result.Metadata["keywords"]; ok {
keywordList := keywords.([]map[string]interface{})
for _, kw := range keywordList {
text := kw["text"].(string)
score := kw["score"].(float64)
fmt.Printf("%s: %.3f\n", text, score)
}
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.KeywordConfig;
import dev.kreuzberg.config.KeywordAlgorithm;
import java.util.List;
import java.util.Map;
ExtractionConfig config = ExtractionConfig.builder()
.keywords(KeywordConfig.builder()
.algorithm(KeywordAlgorithm.YAKE)
.maxKeywords(10)
.minScore(0.3)
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("research_paper.pdf", config);
Map<String, Object> metadata = result.getMetadata() != null ? result.getMetadata() : Map.of();
if (metadata.containsKey("keywords")) {
List<Map<String, Object>> keywords = (List<Map<String, Object>>) metadata.get("keywords");
for (Map<String, Object> kw : keywords) {
String text = (String) kw.get("text");
Double score = ((Number) kw.get("score")).doubleValue();
System.out.println(text + ": " + String.format("%.3f", score));
}
}
import asyncio
from kreuzberg import extract_file, ExtractionConfig, KeywordConfig, KeywordAlgorithm
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
keywords=KeywordConfig(
algorithm=KeywordAlgorithm.YAKE,
max_keywords=10,
min_score=0.3
)
)
result = await extract_file("research_paper.pdf", config=config)
keywords: list = result.metadata.get("keywords", [])
for kw in keywords:
score: float = kw.get("score", 0.0)
text: str = kw.get("text", "")
print(f"{text}: {score:.3f}")
asyncio.run(main())
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
keywords: Kreuzberg::Config::Keywords.new(
algorithm: Kreuzberg::KeywordAlgorithm::YAKE,
max_keywords: 10,
min_score: 0.3
)
)
result = Kreuzberg.extract_file_sync('research_paper.pdf', config: config)
keywords = result.metadata&.dig('keywords') || []
keywords.each do |kw|
text = kw['text']
score = kw['score']
puts "#{text}: #{score.round(3)}"
end
use kreuzberg::{extract_file, ExtractionConfig, KeywordConfig, KeywordAlgorithm};
let config = ExtractionConfig {
keywords: Some(KeywordConfig {
algorithm: KeywordAlgorithm::Yake,
max_keywords: 10,
min_score: 0.3,
..Default::default()
}),
..Default::default()
};
let result = extract_file("research_paper.pdf", None, &config).await?;
if let Some(keywords) = result.metadata.additional.get("keywords") {
println!("Keywords: {:?}", keywords);
}
import { extractFile } from '@kreuzberg/node';
const config = {
keywords: {
algorithm: 'yake',
maxKeywords: 10,
minScore: 0.3,
},
};
const result = await extractFile('research_paper.pdf', null, config);
console.log(`Content length: ${result.content.length}`);
console.log(`Metadata: ${JSON.stringify(result.metadata)}`);
Quality Processing¶
Automatic text quality scoring that detects OCR artifacts, script content, navigation elements, and evaluates document structure.
Quality Factors¶
| Factor | Weight | Detects |
|---|---|---|
| OCR Artifacts | 30% | Scattered chars, repeated punctuation, malformed words |
| Script Content | 20% | JavaScript, CSS, HTML tags |
| Navigation Elements | 10% | Breadcrumbs, pagination, skip links |
| Document Structure | 20% | Sentence/paragraph length, punctuation |
| Metadata Quality | 10% | Title, author, subject presence |
Configuration¶
Quality processing is enabled by default:
using Kreuzberg;
var config = new ExtractionConfig
{
EnableQualityProcessing = true
};
var result = await KreuzbergClient.ExtractFileAsync(
"document.pdf",
config
);
var qualityScore = result.Metadata.ContainsKey("quality_score")
? (double)result.Metadata["quality_score"]
: 0.0;
Console.WriteLine($"Quality score: {qualityScore:F2}");
import asyncio
from kreuzberg import ExtractionConfig, extract_file
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
enable_quality_processing=True
)
result = await extract_file("document.pdf", config=config)
quality_score: float = result.metadata.get("quality_score", 0.0)
print(f"Quality score: {quality_score:.2f}")
asyncio.run(main())
Quality Score¶
The quality score ranges from 0.0 (lowest quality) to 1.0 (highest quality):
- 0.0-0.3: Very low quality (heavy OCR artifacts, script content)
- 0.3-0.6: Low quality (some artifacts, poor structure)
- 0.6-0.8: Moderate quality (clean text, decent structure)
- 0.8-1.0: High quality (excellent structure, no artifacts)
Example¶
using Kreuzberg;
var config = new ExtractionConfig
{
EnableQualityProcessing = true
};
var result = KreuzbergClient.ExtractFile(
"scanned_document.pdf",
config
);
var qualityScore = result.Metadata.ContainsKey("quality_score")
? (double)result.Metadata["quality_score"]
: 0.0;
if (qualityScore < 0.5)
{
Console.WriteLine(
$"Warning: Low quality extraction ({qualityScore:F2})"
);
Console.WriteLine(
"Consider re-scanning with higher DPI or adjusting OCR settings"
);
}
else
{
Console.WriteLine($"Quality score: {qualityScore:F2}");
}
package main
import (
"fmt"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
enableQualityProcessing := true
config := &kreuzberg.ExtractionConfig{
EnableQualityProcessing: &enableQualityProcessing,
}
result, err := kreuzberg.ExtractFileSync("scanned_document.pdf", config)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
qualityScore := 0.0
if val, ok := result.Metadata["quality_score"]; ok {
qualityScore = val.(float64)
}
if qualityScore < 0.5 {
fmt.Printf("Warning: Low quality extraction (%.2f)\n", qualityScore)
fmt.Println("Consider re-scanning with higher DPI or adjusting OCR settings")
} else {
fmt.Printf("Quality score: %.2f\n", qualityScore)
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import java.util.Map;
ExtractionConfig config = ExtractionConfig.builder()
.enableQualityProcessing(true)
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned_document.pdf", config);
Map<String, Object> metadata = result.getMetadata() != null ? result.getMetadata() : Map.of();
double qualityScore = metadata.containsKey("quality_score")
? ((Number) metadata.get("quality_score")).doubleValue()
: 0.0;
if (qualityScore < 0.5) {
System.out.println(String.format("Warning: Low quality extraction (%.2f)", qualityScore));
System.out.println("Consider re-scanning with higher DPI or adjusting OCR settings");
} else {
System.out.println(String.format("Quality score: %.2f", qualityScore));
}
from kreuzberg import extract_file, ExtractionConfig
config = ExtractionConfig(enable_quality_processing=True)
result = extract_file("scanned_document.pdf", config=config)
quality_score = result.metadata.get("quality_score", 0.0)
if quality_score < 0.5:
print(f"Warning: Low quality extraction ({quality_score:.2f})")
print("Consider re-scanning with higher DPI or adjusting OCR settings")
else:
print(f"Quality score: {quality_score:.2f}")
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
enable_quality_processing: true
)
result = Kreuzberg.extract_file_sync('scanned_document.pdf', config: config)
quality_score = result.metadata&.dig('quality_score') || 0.0
if quality_score < 0.5
puts "Warning: Low quality extraction (#{quality_score.round(2)})"
puts "Consider re-scanning with higher DPI or adjusting OCR settings"
else
puts "Quality score: #{quality_score.round(2)}"
end
use kreuzberg::{extract_file, ExtractionConfig};
let config = ExtractionConfig {
enable_quality_processing: true,
..Default::default()
};
let result = extract_file("scanned_document.pdf", None, &config).await?;
if let Some(quality) = result.metadata.additional.get("quality_score") {
let score: f64 = quality.as_f64().unwrap_or(0.0);
if score < 0.5 {
println!("Warning: Low quality extraction ({:.2})", score);
} else {
println!("Quality score: {:.2}", score);
}
}
import { extractFile } from '@kreuzberg/node';
const config = {
enableQualityProcessing: true,
};
const result = await extractFile('scanned_document.pdf', null, config);
console.log(`Content length: ${result.content.length} characters`);
console.log(`Metadata: ${JSON.stringify(result.metadata)}`);
Combining Features¶
Advanced features work together:
using System;
using System.Threading.Tasks;
using Kreuzberg;
async Task RunRagPipeline()
{
var config = new ExtractionConfig
{
EnableQualityProcessing = true,
LanguageDetection = new LanguageDetectionConfig
{
Enabled = true,
DetectMultiple = true,
MinConfidence = 0.8,
},
TokenReduction = new TokenReductionConfig
{
Mode = "moderate",
PreserveImportantWords = true,
},
Chunking = new ChunkingConfig
{
MaxChars = 512,
MaxOverlap = 50,
Embedding = new Dictionary<string, object?>
{
{ "preset", "balanced" },
},
Enabled = true,
},
Keywords = new KeywordConfig
{
Algorithm = "yake",
MaxKeywords = 10,
},
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content length: {result.Content.Length} characters");
if (result.DetectedLanguages?.Count > 0)
{
Console.WriteLine($"Languages: {string.Join(", ", result.DetectedLanguages)}");
}
if (result.Chunks?.Count > 0)
{
Console.WriteLine($"Total chunks: {result.Chunks.Count}");
var firstChunk = result.Chunks[0];
Console.WriteLine($"First chunk tokens: {firstChunk.Metadata.TokenCount}");
if (firstChunk.Embedding?.Length > 0)
{
Console.WriteLine($"Embedding dimensions: {firstChunk.Embedding.Length}");
}
}
if (result.Metadata?.Additional?.ContainsKey("quality_score") == true)
{
Console.WriteLine($"Quality score: {result.Metadata.Additional["quality_score"]}");
}
if (result.Metadata?.Additional?.ContainsKey("keywords") == true)
{
Console.WriteLine($"Keywords: {result.Metadata.Additional["keywords"]}");
}
}
await RunRagPipeline();
package main
import (
"fmt"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
maxChars := 512
maxOverlap := 50
minConfidence := 0.8
config := &kreuzberg.ExtractionConfig{
EnableQualityProcessing: true,
LanguageDetection: &kreuzberg.LanguageDetectionConfig{
Enabled: true,
MinConfidence: &minConfidence,
DetectMultiple: true,
},
TokenReduction: &kreuzberg.TokenReductionConfig{
Mode: "moderate",
PreserveMarkdown: true,
},
Chunking: &kreuzberg.ChunkingConfig{
MaxChars: &maxChars,
MaxOverlap: &maxOverlap,
Embedding: &kreuzberg.EmbeddingConfig{
Model: "balanced",
Normalize: true,
},
},
Keywords: &kreuzberg.KeywordConfig{
Algorithm: "YAKE",
MaxKeywords: 10,
},
}
result, err := kreuzberg.ExtractFileSync("document.pdf", config)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
fmt.Printf("Quality: %v\n", result.Metadata.Additional["quality_score"])
fmt.Printf("Languages: %v\n", result.DetectedLanguages)
fmt.Printf("Keywords: %v\n", result.Metadata.Additional["keywords"])
if result.Chunks != nil && len(result.Chunks) > 0 && result.Chunks[0].Embedding != nil {
fmt.Printf("Chunks: %d with %d dimensions\n", len(result.Chunks), len(result.Chunks[0].Embedding))
}
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;
import dev.kreuzberg.config.LanguageDetectionConfig;
import dev.kreuzberg.config.TokenReductionConfig;
ExtractionConfig config = ExtractionConfig.builder()
.enableQualityProcessing(true)
.languageDetection(LanguageDetectionConfig.builder()
.enabled(true)
.minConfidence(0.8)
.build())
.tokenReduction(TokenReductionConfig.builder()
.mode("moderate")
.preserveImportantWords(true)
.build())
.chunking(ChunkingConfig.builder()
.maxChars(512)
.maxOverlap(50)
.embedding("balanced")
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
Object qualityScore = result.getMetadata().get("quality_score");
System.out.printf("Quality: %.2f%n", ((Number)qualityScore).doubleValue());
System.out.println("Languages: " + result.getDetectedLanguages());
System.out.println("Content length: " + result.getContent().length() + " characters");
import asyncio
from kreuzberg import (
extract_file,
ExtractionConfig,
ChunkingConfig,
EmbeddingConfig,
EmbeddingModelType,
LanguageDetectionConfig,
TokenReductionConfig,
)
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
enable_quality_processing=True,
language_detection=LanguageDetectionConfig(enabled=True),
token_reduction=TokenReductionConfig(mode="moderate"),
chunking=ChunkingConfig(
max_chars=512,
max_overlap=50,
embedding=EmbeddingConfig(
model=EmbeddingModelType.preset("balanced"), normalize=True
),
),
)
result = await extract_file("document.pdf", config=config)
quality = result.metadata.get("quality_score", 0)
print(f"Quality: {quality:.2f}")
print(f"Languages: {result.detected_languages}")
if result.chunks:
print(f"Chunks: {len(result.chunks)}")
asyncio.run(main())
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
enable_quality_processing: true,
language_detection: Kreuzberg::Config::LanguageDetection.new(
enabled: true,
detect_multiple: true
),
token_reduction: Kreuzberg::Config::TokenReduction.new(mode: 'moderate'),
chunking: Kreuzberg::Config::Chunking.new(
max_chars: 512,
max_overlap: 50,
embedding: { normalize: true }
),
keywords: Kreuzberg::Config::Keywords.new(
algorithm: 'yake',
max_keywords: 10
)
)
result = Kreuzberg.extract_file_sync('document.pdf', config: config)
puts "Languages: #{result.detected_languages.inspect}"
puts "Chunks: #{result.chunks&.length || 0}"
use kreuzberg::{
extract_file, ExtractionConfig, ChunkingConfig, EmbeddingConfig,
LanguageDetectionConfig, TokenReductionConfig,
KeywordConfig, KeywordAlgorithm
};
let config = ExtractionConfig {
enable_quality_processing: true,
language_detection: Some(LanguageDetectionConfig {
enabled: true,
detect_multiple: true,
..Default::default()
}),
token_reduction: Some(TokenReductionConfig {
mode: "moderate".to_string(),
preserve_markdown: true,
..Default::default()
}),
chunking: Some(ChunkingConfig {
max_chars: 512,
max_overlap: 50,
embedding: Some(EmbeddingConfig {
model: kreuzberg::EmbeddingModelType::Preset { name: "balanced".to_string() },
normalize: true,
..Default::default()
}),
..Default::default()
}),
keywords: Some(KeywordConfig {
algorithm: KeywordAlgorithm::Yake,
max_keywords: 10,
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
if let Some(quality) = result.metadata.additional.get("quality_score") {
println!("Quality: {:?}", quality);
}
println!("Languages: {:?}", result.detected_languages);
println!("Keywords: {:?}", result.metadata.additional.get("keywords"));
if let Some(chunks) = result.chunks {
if let Some(first_chunk) = chunks.first() {
if let Some(embedding) = &first_chunk.embedding {
println!("Chunks: {} with {} dimensions", chunks.len(), embedding.len());
}
}
}
import { extractFile } from '@kreuzberg/node';
const config = {
enableQualityProcessing: true,
languageDetection: {
enabled: true,
detectMultiple: true,
},
tokenReduction: {
mode: 'moderate',
preserveImportantWords: true,
},
chunking: {
maxChars: 512,
maxOverlap: 50,
embedding: {
preset: 'balanced',
},
},
keywords: {
algorithm: 'yake',
maxKeywords: 10,
},
};
const result = await extractFile('document.pdf', null, config);
console.log(`Content length: ${result.content.length}`);
if (result.detectedLanguages) {
console.log(`Languages: ${result.detectedLanguages.join(', ')}`);
}
if (result.chunks && result.chunks.length > 0) {
console.log(`Chunks: ${result.chunks.length}`);
}
Page Tracking Patterns¶
Advanced patterns for using page tracking in real-world applications.
Chunk-to-Page Mapping¶
When both chunking and page tracking are enabled, chunks automatically include page metadata:
from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig, PageConfig
config = ExtractionConfig(
chunking=ChunkingConfig(chunk_size=500, overlap=50),
pages=PageConfig(extract_pages=True)
)
result = extract_file_sync("document.pdf", config=config)
if result.chunks:
for chunk in result.chunks:
if chunk.metadata.first_page:
page_range = (
f"Page {chunk.metadata.first_page}"
if chunk.metadata.first_page == chunk.metadata.last_page
else f"Pages {chunk.metadata.first_page}-{chunk.metadata.last_page}"
)
print(f"Chunk: {chunk.text[:50]}... ({page_range})")
Page-Filtered Search¶
Filter chunks by page range for focused retrieval:
def search_in_pages(chunks: list[Chunk], query: str, page_start: int, page_end: int) -> list[Chunk]:
"""Search only within specified page range."""
page_chunks = [
c for c in chunks
if c.metadata.first_page and c.metadata.last_page
and c.metadata.first_page >= page_start
and c.metadata.last_page <= page_end
]
return search_chunks(page_chunks, query)
Page-Aware Embeddings¶
Include page context in embeddings for better retrieval:
for chunk in result.chunks:
if chunk.metadata.first_page:
context = f"Page {chunk.metadata.first_page}: {chunk.text}"
embedding = embed(context)
store_with_metadata(embedding, {
"page": chunk.metadata.first_page,
"text": chunk.text
})
Per-Page Processing¶
Process each page independently:
from kreuzberg import extract_file_sync, ExtractionConfig, PageConfig
config = ExtractionConfig(
pages=PageConfig(extract_pages=True)
)
result = extract_file_sync("document.pdf", config=config)
if result.pages:
for page in result.pages:
print(f"Page {page.page_number}:")
print(f" Content: {len(page.content)} chars")
print(f" Tables: {len(page.tables)}")
print(f" Images: {len(page.images)}")
Format-Specific Strategies¶
PDF Documents: Use byte boundaries for precise page lookups. Ideal for legal documents, research papers.
Presentations (PPTX): Process slides independently. Use PageUnitType::Slide to distinguish from regular pages.
Word Documents (DOCX): Page breaks may be approximate. Verify PageStructure.boundaries exists before using.
Multi-Format: Check PageStructure availability: