C# Bindings v4.0.0¶
.NET bindings for Kreuzberg document extraction via P/Invoke to the native Rust library. All extraction, configuration, and plugin APIs are exposed through the static KreuzbergClient class.
For cross-language extraction concepts, see Extraction Basics. For configuration options, see Configuration Guide.
Installation¶
Requires .NET 10.0+ on Windows, macOS, or Linux.
For OCR support, install Tesseract separately:
Quick Start¶
Extract a File¶
using Kreuzberg;
var result = KreuzbergClient.ExtractFileSync("document.pdf", new ExtractionConfig());
Console.WriteLine(result.Content);
Console.WriteLine($"Tables: {result.Tables.Count}");
Console.WriteLine($"Metadata: {result.Metadata.FormatType}");
Async Extraction¶
using Kreuzberg;
var result = await KreuzbergClient.ExtractFileAsync("document.pdf");
Console.WriteLine(result.Content);
Console.WriteLine(result.MimeType);
Extract from Bytes¶
When you already have file contents in memory, pass the byte array with its MIME type:
using Kreuzberg;
var data = await File.ReadAllBytesAsync("document.pdf");
var result = KreuzbergClient.ExtractBytesSync(data, "application/pdf");
Console.WriteLine(result.Content);
Console.WriteLine(result.MimeType);
Batch Processing¶
Process multiple files in a single call with automatic parallelization:
using Kreuzberg;
var files = new[] { "doc1.pdf", "doc2.docx", "doc3.pptx" };
var results = KreuzbergClient.BatchExtractFilesSync(files, new ExtractionConfig());
foreach (var result in results)
{
Console.WriteLine($"Content length: {result.Content.Length}");
}
Configuration¶
Pass an ExtractionConfig to any extraction method. All fields are optional — defaults work for most cases.
Basic Configuration¶
using Kreuzberg;
var config = new ExtractionConfig
{
UseCache = true,
EnableQualityProcessing = true
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine(result.Content);
OCR Configuration¶
using Kreuzberg;
var config = new ExtractionConfig
{
Ocr = new OcrConfig
{
Backend = "tesseract",
Language = "eng+fra",
TesseractConfig = new TesseractConfig { Psm = 3 }
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine(result.Content);
For fine-grained Tesseract tuning (PSM modes, confidence thresholds, table detection):
using Kreuzberg;
var config = new ExtractionConfig
{
Ocr = new OcrConfig
{
Language = "eng+fra+deu",
TesseractConfig = new TesseractConfig
{
Psm = 6,
Oem = 1,
MinConfidence = 0.8m,
EnableTableDetection = true
}
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
See OCR Guide for backend options and language setup.
Load from File¶
Kreuzberg discovers configuration files automatically by walking up the directory tree, or you can load from an explicit path:
using Kreuzberg;
var config = new ExtractionConfig();
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine(result.Content[..Math.Min(100, result.Content.Length)]);
Console.WriteLine($"Total length: {result.Content.Length}");
See Configuration Guide for file formats (TOML, YAML, JSON) and discovery order.
Full Configuration Example¶
using Kreuzberg;
var config = new ExtractionConfig
{
UseCache = true,
EnableQualityProcessing = true,
Ocr = new OcrConfig
{
Backend = "tesseract",
Language = "eng+fra",
TesseractConfig = new TesseractConfig { Psm = 3 }
},
PdfOptions = new PdfConfig { ExtractImages = true },
Chunking = new ChunkingConfig
{
MaxChars = 1000,
MaxOverlap = 200,
Embedding = new EmbeddingConfig
{
Model = EmbeddingModelType.Preset("all-MiniLM-L6-v2")
}
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
Working with Results¶
Tables¶
using Kreuzberg;
var result = KreuzbergClient.ExtractFileSync("document.pdf", new ExtractionConfig());
foreach (var table in result.Tables)
{
Console.WriteLine($"Table with {table.Cells.Count} rows");
Console.WriteLine(table.Markdown);
foreach (var row in table.Cells)
{
Console.WriteLine(string.Join(" | ", row));
}
}
Metadata¶
using Kreuzberg;
var config = new ExtractionConfig
{
PdfOptions = new PdfConfig { ExtractMetadata = true }
};
var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
if (result.Metadata?.Format.Pdf != null)
{
var pdfMeta = result.Metadata.Format.Pdf;
Console.WriteLine($"Pages: {pdfMeta.PageCount}");
Console.WriteLine($"Author: {pdfMeta.Author}");
Console.WriteLine($"Title: {pdfMeta.Title}");
}
var htmlResult = KreuzbergClient.ExtractFileSync("page.html", config);
if (htmlResult.Metadata?.Format.Html != null)
{
var htmlMeta = htmlResult.Metadata.Format.Html;
Console.WriteLine($"Title: {htmlMeta.Title}");
Console.WriteLine($"Description: {htmlMeta.Description}");
// Access keywords as array
if (htmlMeta.Keywords != null && htmlMeta.Keywords.Count > 0)
{
Console.WriteLine($"Keywords: {string.Join(", ", htmlMeta.Keywords)}");
}
// Access canonical URL (renamed from canonical)
if (htmlMeta.CanonicalUrl != null)
{
Console.WriteLine($"Canonical URL: {htmlMeta.CanonicalUrl}");
}
// Access Open Graph fields from dictionary
if (htmlMeta.OpenGraph != null && htmlMeta.OpenGraph.Count > 0)
{
if (htmlMeta.OpenGraph.ContainsKey("image"))
Console.WriteLine($"Open Graph Image: {htmlMeta.OpenGraph["image"]}");
if (htmlMeta.OpenGraph.ContainsKey("title"))
Console.WriteLine($"Open Graph Title: {htmlMeta.OpenGraph["title"]}");
if (htmlMeta.OpenGraph.ContainsKey("type"))
Console.WriteLine($"Open Graph Type: {htmlMeta.OpenGraph["type"]}");
}
// Access Twitter Card fields from dictionary
if (htmlMeta.TwitterCard != null && htmlMeta.TwitterCard.Count > 0)
{
if (htmlMeta.TwitterCard.ContainsKey("card"))
Console.WriteLine($"Twitter Card Type: {htmlMeta.TwitterCard["card"]}");
if (htmlMeta.TwitterCard.ContainsKey("creator"))
Console.WriteLine($"Twitter Creator: {htmlMeta.TwitterCard["creator"]}");
}
// Access new fields
if (htmlMeta.Language != null)
Console.WriteLine($"Language: {htmlMeta.Language}");
if (htmlMeta.TextDirection != null)
Console.WriteLine($"Text Direction: {htmlMeta.TextDirection}");
// Access headers
if (htmlMeta.Headers != null && htmlMeta.Headers.Count > 0)
Console.WriteLine($"Headers: {string.Join(", ", htmlMeta.Headers.Select(h => h.Text))}");
// Access links
if (htmlMeta.Links != null && htmlMeta.Links.Count > 0)
{
foreach (var link in htmlMeta.Links)
Console.WriteLine($"Link: {link.Href} ({link.Text})");
}
// Access images
if (htmlMeta.Images != null && htmlMeta.Images.Count > 0)
Console.WriteLine($"Images: {string.Join(", ", htmlMeta.Images.Select(i => i.Src))}");
// Access structured data
if (htmlMeta.StructuredData != null && htmlMeta.StructuredData.Count > 0)
Console.WriteLine($"Structured Data items: {htmlMeta.StructuredData.Count}");
}
Image Extraction¶
using Kreuzberg;
var config = new ExtractionConfig
{
Images = new ImageExtractionConfig
{
ExtractImages = true,
TargetDpi = 200,
MaxImageDimension = 2048,
InjectPlaceholders = true, // set to false to extract images without markdown references
AutoAdjustDpi = true
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Extracted: {result.Content[..Math.Min(100, result.Content.Length)]}");
Language Detection¶
using Kreuzberg;
var config = new ExtractionConfig
{
LanguageDetection = new LanguageDetectionConfig
{
Enabled = true,
MinConfidence = 0.9m,
DetectMultiple = true
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Languages: {string.Join(", ", result.DetectedLanguages ?? new List<string>())}");
Token Reduction¶
Reduce token count for LLM pipelines while preserving meaning:
using Kreuzberg;
var config = new ExtractionConfig
{
TokenReduction = new TokenReductionConfig
{
Mode = "moderate",
PreserveImportantWords = true
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content length: {result.Content.Length}");
Error Handling¶
using Kreuzberg;
try
{
var result = KreuzbergClient.ExtractFileSync("missing.pdf");
Console.WriteLine(result.Content);
}
catch (KreuzbergValidationException ex)
{
Console.Error.WriteLine($"Validation error: {ex.Message}");
}
catch (KreuzbergIOException ex)
{
Console.Error.WriteLine($"IO error: {ex.Message}");
throw;
}
catch (KreuzbergException ex)
{
Console.Error.WriteLine($"Extraction failed: {ex.Message}");
throw;
}
Exception hierarchy:
KreuzbergException— base exception for all Kreuzberg errorsKreuzbergValidationException— invalid configuration or inputKreuzbergParsingException— document parsing failureKreuzbergOcrException— OCR processing failureKreuzbergIOException— file I/O failureKreuzbergMissingDependencyException— missing optional dependency (for example, Tesseract)KreuzbergSerializationException— JSON serialization failure
Thread Safety¶
KreuzbergClient static methods are thread-safe. Configuration objects are safe for concurrent reads. Plugin registrations use thread-safe collections. Individual ExtractionResult instances should not be shared across threads for mutation.
Troubleshooting¶
DLL not found
Ensure the native library is in your runtime directory at runtimes/{rid}/native/.
P/Invoke errors
Verify: (1) native library is installed, (2) architecture matches (x64/arm64), (3) Tesseract is available if using OCR.
Next Steps¶
- Extraction Basics — sync vs async, batch processing, supported formats
- Configuration Guide — all configuration options and file formats
- OCR Guide — OCR backend setup and language packs
- Plugins Guide — custom post-processors, validators, and OCR backends
- Advanced Features — chunking, embeddings, and page tracking