Skip to content

C# Bindings v4.0.0

.NET bindings for Kreuzberg document extraction via P/Invoke to the native Rust library. All extraction, configuration, and plugin APIs are exposed through the static KreuzbergClient class.

For cross-language extraction concepts, see Extraction Basics. For configuration options, see Configuration Guide.

Installation

Terminal
dotnet add package Kreuzberg

Requires .NET 10.0+ on Windows, macOS, or Linux.

For OCR support, install Tesseract separately:

Terminal
# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

Quick Start

Extract a File

C#
using Kreuzberg;

var result = KreuzbergClient.ExtractFileSync("document.pdf", new ExtractionConfig());

Console.WriteLine(result.Content);
Console.WriteLine($"Tables: {result.Tables.Count}");
Console.WriteLine($"Metadata: {result.Metadata.FormatType}");

Async Extraction

C#
using Kreuzberg;

var result = await KreuzbergClient.ExtractFileAsync("document.pdf");

Console.WriteLine(result.Content);
Console.WriteLine(result.MimeType);

Extract from Bytes

When you already have file contents in memory, pass the byte array with its MIME type:

C#
using Kreuzberg;

var data = await File.ReadAllBytesAsync("document.pdf");
var result = KreuzbergClient.ExtractBytesSync(data, "application/pdf");

Console.WriteLine(result.Content);
Console.WriteLine(result.MimeType);

Batch Processing

Process multiple files in a single call with automatic parallelization:

C#
using Kreuzberg;

var files = new[] { "doc1.pdf", "doc2.docx", "doc3.pptx" };
var results = KreuzbergClient.BatchExtractFilesSync(files, new ExtractionConfig());

foreach (var result in results)
{
    Console.WriteLine($"Content length: {result.Content.Length}");
}

Configuration

Pass an ExtractionConfig to any extraction method. All fields are optional — defaults work for most cases.

Basic Configuration

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    UseCache = true,
    EnableQualityProcessing = true
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine(result.Content);

OCR Configuration

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Backend = "tesseract",
        Language = "eng+fra",
        TesseractConfig = new TesseractConfig { Psm = 3 }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine(result.Content);

For fine-grained Tesseract tuning (PSM modes, confidence thresholds, table detection):

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Language = "eng+fra+deu",
        TesseractConfig = new TesseractConfig
        {
            Psm = 6,
            Oem = 1,
            MinConfidence = 0.8m,
            EnableTableDetection = true
        }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");

See OCR Guide for backend options and language setup.

Load from File

Kreuzberg discovers configuration files automatically by walking up the directory tree, or you can load from an explicit path:

C#
using Kreuzberg;

var config = new ExtractionConfig();
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);

Console.WriteLine(result.Content[..Math.Min(100, result.Content.Length)]);
Console.WriteLine($"Total length: {result.Content.Length}");

See Configuration Guide for file formats (TOML, YAML, JSON) and discovery order.

Full Configuration Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    UseCache = true,
    EnableQualityProcessing = true,
    Ocr = new OcrConfig
    {
        Backend = "tesseract",
        Language = "eng+fra",
        TesseractConfig = new TesseractConfig { Psm = 3 }
    },
    PdfOptions = new PdfConfig { ExtractImages = true },
    Chunking = new ChunkingConfig
    {
        MaxChars = 1000,
        MaxOverlap = 200,
        Embedding = new EmbeddingConfig
        {
            Model = EmbeddingModelType.Preset("all-MiniLM-L6-v2")
        }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");

Working with Results

Tables

C#
using Kreuzberg;

var result = KreuzbergClient.ExtractFileSync("document.pdf", new ExtractionConfig());

foreach (var table in result.Tables)
{
    Console.WriteLine($"Table with {table.Cells.Count} rows");
    Console.WriteLine(table.Markdown);

    foreach (var row in table.Cells)
    {
        Console.WriteLine(string.Join(" | ", row));
    }
}

Metadata

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig { ExtractMetadata = true }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

if (result.Metadata?.Format.Pdf != null)
{
    var pdfMeta = result.Metadata.Format.Pdf;
    Console.WriteLine($"Pages: {pdfMeta.PageCount}");
    Console.WriteLine($"Author: {pdfMeta.Author}");
    Console.WriteLine($"Title: {pdfMeta.Title}");
}

var htmlResult = KreuzbergClient.ExtractFileSync("page.html", config);
if (htmlResult.Metadata?.Format.Html != null)
{
    var htmlMeta = htmlResult.Metadata.Format.Html;
    Console.WriteLine($"Title: {htmlMeta.Title}");
    Console.WriteLine($"Description: {htmlMeta.Description}");

    // Access keywords as array
    if (htmlMeta.Keywords != null && htmlMeta.Keywords.Count > 0)
    {
        Console.WriteLine($"Keywords: {string.Join(", ", htmlMeta.Keywords)}");
    }

    // Access canonical URL (renamed from canonical)
    if (htmlMeta.CanonicalUrl != null)
    {
        Console.WriteLine($"Canonical URL: {htmlMeta.CanonicalUrl}");
    }

    // Access Open Graph fields from dictionary
    if (htmlMeta.OpenGraph != null && htmlMeta.OpenGraph.Count > 0)
    {
        if (htmlMeta.OpenGraph.ContainsKey("image"))
            Console.WriteLine($"Open Graph Image: {htmlMeta.OpenGraph["image"]}");
        if (htmlMeta.OpenGraph.ContainsKey("title"))
            Console.WriteLine($"Open Graph Title: {htmlMeta.OpenGraph["title"]}");
        if (htmlMeta.OpenGraph.ContainsKey("type"))
            Console.WriteLine($"Open Graph Type: {htmlMeta.OpenGraph["type"]}");
    }

    // Access Twitter Card fields from dictionary
    if (htmlMeta.TwitterCard != null && htmlMeta.TwitterCard.Count > 0)
    {
        if (htmlMeta.TwitterCard.ContainsKey("card"))
            Console.WriteLine($"Twitter Card Type: {htmlMeta.TwitterCard["card"]}");
        if (htmlMeta.TwitterCard.ContainsKey("creator"))
            Console.WriteLine($"Twitter Creator: {htmlMeta.TwitterCard["creator"]}");
    }

    // Access new fields
    if (htmlMeta.Language != null)
        Console.WriteLine($"Language: {htmlMeta.Language}");

    if (htmlMeta.TextDirection != null)
        Console.WriteLine($"Text Direction: {htmlMeta.TextDirection}");

    // Access headers
    if (htmlMeta.Headers != null && htmlMeta.Headers.Count > 0)
        Console.WriteLine($"Headers: {string.Join(", ", htmlMeta.Headers.Select(h => h.Text))}");

    // Access links
    if (htmlMeta.Links != null && htmlMeta.Links.Count > 0)
    {
        foreach (var link in htmlMeta.Links)
            Console.WriteLine($"Link: {link.Href} ({link.Text})");
    }

    // Access images
    if (htmlMeta.Images != null && htmlMeta.Images.Count > 0)
        Console.WriteLine($"Images: {string.Join(", ", htmlMeta.Images.Select(i => i.Src))}");

    // Access structured data
    if (htmlMeta.StructuredData != null && htmlMeta.StructuredData.Count > 0)
        Console.WriteLine($"Structured Data items: {htmlMeta.StructuredData.Count}");
}

Image Extraction

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Images = new ImageExtractionConfig
    {
        ExtractImages = true,
        TargetDpi = 200,
        MaxImageDimension = 2048,
        InjectPlaceholders = true, // set to false to extract images without markdown references
        AutoAdjustDpi = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Extracted: {result.Content[..Math.Min(100, result.Content.Length)]}");

Language Detection

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    LanguageDetection = new LanguageDetectionConfig
    {
        Enabled = true,
        MinConfidence = 0.9m,
        DetectMultiple = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Languages: {string.Join(", ", result.DetectedLanguages ?? new List<string>())}");

Token Reduction

Reduce token count for LLM pipelines while preserving meaning:

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    TokenReduction = new TokenReductionConfig
    {
        Mode = "moderate",
        PreserveImportantWords = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content length: {result.Content.Length}");

Error Handling

C#
using Kreuzberg;

try
{
    var result = KreuzbergClient.ExtractFileSync("missing.pdf");
    Console.WriteLine(result.Content);
}
catch (KreuzbergValidationException ex)
{
    Console.Error.WriteLine($"Validation error: {ex.Message}");
}
catch (KreuzbergIOException ex)
{
    Console.Error.WriteLine($"IO error: {ex.Message}");
    throw;
}
catch (KreuzbergException ex)
{
    Console.Error.WriteLine($"Extraction failed: {ex.Message}");
    throw;
}

Exception hierarchy:

  • KreuzbergException — base exception for all Kreuzberg errors
  • KreuzbergValidationException — invalid configuration or input
  • KreuzbergParsingException — document parsing failure
  • KreuzbergOcrException — OCR processing failure
  • KreuzbergIOException — file I/O failure
  • KreuzbergMissingDependencyException — missing optional dependency (for example, Tesseract)
  • KreuzbergSerializationException — JSON serialization failure

Thread Safety

KreuzbergClient static methods are thread-safe. Configuration objects are safe for concurrent reads. Plugin registrations use thread-safe collections. Individual ExtractionResult instances should not be shared across threads for mutation.

Troubleshooting

DLL not found

Ensure the native library is in your runtime directory at runtimes/{rid}/native/.

Terminal
echo $LD_LIBRARY_PATH     # Linux
echo $DYLD_LIBRARY_PATH   # macOS
P/Invoke errors

Verify: (1) native library is installed, (2) architecture matches (x64/arm64), (3) Tesseract is available if using OCR.

OCR not working
Terminal
tesseract --version

Ensure Tesseract is installed and in PATH.

Next Steps