Skip to content

Go API Reference

Complete reference for the Kreuzberg Go bindings using cgo to access the Rust-powered extraction pipeline.

The Go binding exposes the same extraction capabilities as the other languages through C FFI bindings to kreuzberg-ffi. You get identical metadata extraction, OCR processing, chunking, embeddings, and plugin support—with synchronous and context-aware async APIs.

Requirements

  • Go 1.25+ (with cgo support)
  • C compiler (gcc/clang for cgo compilation)
  • libkreuzberg_ffi.a static library (at build time only)
  • Tesseract/EasyOCR/PaddleOCR (optional, for OCR functionality)

Installation

Kreuzberg Go binaries are statically linked — once built, they are self-contained and require no runtime library dependencies. Only the static library is needed at build time.

Add the package to your go.mod:

Terminal
go get github.com/kreuzberg-dev/kreuzberg/packages/go/v4@latest

Monorepo Development

For development in the Kreuzberg monorepo:

Terminal
# Build the FFI crate (produces static library)
cargo build -p kreuzberg-ffi --release

# Go will automatically link against target/release/libkreuzberg_ffi.a
cd packages/go/v4
go build -v

# Run your binary - no library paths needed, it's statically linked!
./v4

External Projects

When building outside the monorepo, provide the static library via CGO_LDFLAGS:

Terminal
# Option 1: Download pre-built from GitHub Releases
curl -LO https://github.com/kreuzberg-dev/kreuzberg/releases/download/v4.0.0/go-ffi-linux-x86_64.tar.gz
tar -xzf go-ffi-linux-x86_64.tar.gz
mkdir -p ~/kreuzberg/lib
cp kreuzberg-ffi/lib/libkreuzberg_ffi.a ~/kreuzberg/lib/

# Option 2: Build static library yourself
git clone https://github.com/kreuzberg-dev/kreuzberg.git
cd kreuzberg && cargo build -p kreuzberg-ffi --release
cp target/release/libkreuzberg_ffi.a ~/kreuzberg/lib/

# Build your Go project with static linking
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build

# Run - no library paths needed!
./myapp

Quickstart

Basic file extraction (synchronous)

main.go
package main

import (
    "fmt"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    fmt.Printf("Format: %s\n", result.MimeType)
    fmt.Printf("Content length: %d\n", len(result.Content))
    fmt.Printf("Success: %v\n", result.Success)
}

Async extraction with timeout

async_extraction.go
package main

import (
    "context"
    "errors"
    "log"
    "time"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
    defer cancel()

    result, err := kreuzberg.ExtractFile(ctx, "large-document.pdf", nil)
    if errors.Is(err, context.DeadlineExceeded) {
        log.Println("extraction timed out")
        return
    }
    if err != nil {
        log.Fatalf("extraction failed: %v", err)
    }

    log.Printf("Extracted %d characters\n", len(result.Content))
}

Core Functions

ExtractFileSync

Extract content and metadata from a file synchronously.

Signature:

Go
func ExtractFileSync(path string, config *ExtractionConfig) (*ExtractionResult, error)

Parameters:

  • path (string): Path to the file to extract (absolute or relative)
  • config (*ExtractionConfig): Optional extraction configuration; uses defaults if nil

Returns:

  • *ExtractionResult: Populated result containing content, metadata, tables, chunks, and images
  • error: KreuzbergError or standard Go error (see Error Handling section)

Error Handling:

  • ValidationError: If path is empty
  • IOError: If file not found or not readable
  • ParsingError: If document parsing fails
  • MissingDependencyError: If required OCR/processing library is missing
  • UnsupportedFormatError: If MIME type is not supported

Example - Extract PDF:

extract_pdf.go
result, err := kreuzberg.ExtractFileSync("report.pdf", nil)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

fmt.Printf("Title: %s\n", *result.Metadata.PdfMetadata().Title)
fmt.Printf("Page count: %d\n", *result.Metadata.PdfMetadata().PageCount)
fmt.Printf("Content preview: %s...\n", result.Content[:100])

Example - Extract with configuration:

extract_with_config.go
cfg := &kreuzberg.ExtractionConfig{
    UseCache: boolPtr(true),
    OCR: &kreuzberg.OCRConfig{
        Backend:  "tesseract",
        Language: stringPtr("eng"),
    },
}

result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

ExtractFile

Extract content from a file asynchronously with context support.

Signature:

Go
func ExtractFile(ctx context.Context, path string, config *ExtractionConfig) (*ExtractionResult, error)

Parameters:

  • ctx (context.Context): Context for cancellation and timeout
  • path (string): Path to the file
  • config (*ExtractionConfig): Optional configuration

Returns:

  • *ExtractionResult: Extraction result
  • error: May include context errors (context.DeadlineExceeded, context.Canceled)

Note: Context cancellation is best-effort. The underlying C call cannot be interrupted, but the function returns immediately with ctx.Err() when the context deadline is exceeded or cancelled.

Example - With deadline:

extract_with_deadline.go
ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(30*time.Second))
defer cancel()

result, err := kreuzberg.ExtractFile(ctx, "large.docx", nil)
if errors.Is(err, context.DeadlineExceeded) {
    log.Println("extraction took too long")
    return
}
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

ExtractBytesSync

Extract content from an in-memory byte slice with specified MIME type.

Signature:

Go
func ExtractBytesSync(data []byte, mimeType string, config *ExtractionConfig) (*ExtractionResult, error)

Parameters:

  • data ([]byte): Document bytes
  • mimeType (string): MIME type (e.g., "application/pdf", "text/plain")
  • config (*ExtractionConfig): Optional configuration

Returns:

  • *ExtractionResult: Extraction result
  • error: KreuzbergError on extraction failure

Example - Extract from downloaded PDF:

extract_from_http.go
httpResp, err := http.Get("https://example.com/document.pdf")
if err != nil {
    log.Fatal(err)
}
defer httpResp.Body.Close()

data, err := io.ReadAll(httpResp.Body)
if err != nil {
    log.Fatal(err)
}

result, err := kreuzberg.ExtractBytesSync(data, "application/pdf", nil)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

fmt.Printf("Extracted %d words\n", len(strings.Fields(result.Content)))

ExtractBytes

Extract content from in-memory bytes asynchronously.

Signature:

Go
func ExtractBytes(ctx context.Context, data []byte, mimeType string, config *ExtractionConfig) (*ExtractionResult, error)

Parameters:

  • ctx (context.Context): Context for cancellation and timeout
  • data ([]byte): Document bytes
  • mimeType (string): MIME type
  • config (*ExtractionConfig): Optional configuration

Returns:

  • *ExtractionResult: Extraction result
  • error: KreuzbergError or context error

Example:

extract_bytes_async.go
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

result, err := kreuzberg.ExtractBytes(ctx, data, "text/html", nil)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

BatchExtractFilesSync

Extract multiple files sequentially using the optimized batch pipeline.

Signature:

Go
func BatchExtractFilesSync(paths []string, config *ExtractionConfig) ([]*ExtractionResult, error)

Parameters:

  • paths ([]string): Slice of file paths
  • config (*ExtractionConfig): Configuration applied to all files

Returns:

  • []*ExtractionResult: Slice of results (one per input file; may contain nils for failed extractions)
  • error: Returned only if batch setup fails; individual file errors are captured in ErrorMetadata

Example - Batch extract multiple PDFs:

batch_extract_pdfs.go
files := []string{"doc1.pdf", "doc2.pdf", "doc3.pdf"}

results, err := kreuzberg.BatchExtractFilesSync(files, nil)
if err != nil {
    log.Fatalf("batch extraction setup failed: %v", err)
}

for i, result := range results {
    if result == nil {
        fmt.Printf("File %d: extraction failed\n", i)
        continue
    }

    if result.Metadata.Error != nil {
        fmt.Printf("File %d: %s (%s)\n", i, result.Metadata.Error.ErrorType, result.Metadata.Error.Message)
        continue
    }

    fmt.Printf("File %d: extracted %d chars\n", i, len(result.Content))
}

BatchExtractFiles

Batch extract multiple files asynchronously.

Signature:

Go
func BatchExtractFiles(ctx context.Context, paths []string, config *ExtractionConfig) ([]*ExtractionResult, error)

Parameters:

  • ctx (context.Context): Context for cancellation
  • paths ([]string): File paths
  • config (*ExtractionConfig): Configuration for all files

Returns:

  • []*ExtractionResult: Results slice
  • error: Context or setup errors

BatchExtractBytesSync

Extract multiple in-memory documents in a single batch operation.

Signature:

Go
func BatchExtractBytesSync(items []BytesWithMime, config *ExtractionConfig) ([]*ExtractionResult, error)

Parameters:

  • items ([]BytesWithMime): Slice of {Data, MimeType} pairs
  • config (*ExtractionConfig): Configuration applied to all items

Returns:

  • []*ExtractionResult: Results slice
  • error: Setup error or validation error

BytesWithMime structure:

Go
type BytesWithMime struct {
    Data     []byte
    MimeType string
}

Example - Batch extract multiple formats:

batch_extract_bytes.go
items := []kreuzberg.BytesWithMime{
    {Data: pdfData, MimeType: "application/pdf"},
    {Data: docxData, MimeType: "application/vnd.openxmlformats-officedocument.wordprocessingml.document"},
    {Data: htmlData, MimeType: "text/html"},
}

results, err := kreuzberg.BatchExtractBytesSync(items, nil)
if err != nil {
    log.Fatalf("batch extraction failed: %v", err)
}

for i, result := range results {
    if result == nil || !result.Success {
        log.Printf("Item %d extraction failed\n", i)
        continue
    }
    log.Printf("Item %d: %s format\n", i, result.MimeType)
}

BatchExtractBytes

Batch extract in-memory documents asynchronously.

Signature:

Go
func BatchExtractBytes(ctx context.Context, items []BytesWithMime, config *ExtractionConfig) ([]*ExtractionResult, error)

Parameters:

  • ctx (context.Context): Context for cancellation
  • items ([]BytesWithMime): Document slice
  • config (*ExtractionConfig): Configuration

Returns:

  • []*ExtractionResult: Results slice
  • error: Context or setup errors

LibraryVersion

Get the version of the underlying Rust library.

Signature:

Go
func LibraryVersion() string

Returns:

  • string: Version string (e.g., "4.0.0")

Example:

check_version.go
fmt.Printf("Kreuzberg version: %s\n", kreuzberg.LibraryVersion())

Configuration

ExtractionConfig

Root configuration struct for all extraction operations. All fields are optional (pointers); omitted fields use Kreuzberg defaults.

Signature:

Go
type ExtractionConfig struct {
    UseCache                 *bool                          // Enable result caching
    EnableQualityProcessing  *bool                          // Run quality improvements
    OCR                      *OCRConfig                     // OCR backend and settings
    ForceOCR                 *bool                          // Force OCR even for text-extractable docs
    Chunking                 *ChunkingConfig                // Text chunking and embeddings
    Images                   *ImageExtractionConfig         // Image extraction from docs
    PdfOptions               *PdfConfig                     // PDF-specific options
    TokenReduction           *TokenReductionConfig          // Token pruning before embeddings
    LanguageDetection        *LanguageDetectionConfig       // Language detection settings
    Keywords                 *KeywordConfig                 // Keyword extraction
    Postprocessor            *PostProcessorConfig           // Post-processor selection
    HTMLOptions              *HTMLConversionOptions         // HTML-to-Markdown conversion
    MaxConcurrentExtractions *int                           // Batch concurrency limit
}

OCRConfig

Configure OCR backend selection and language.

Signature:

Go
type OCRConfig struct {
    Backend   string           // OCR backend name: "tesseract", "easyocr", "paddle", etc.
    Language  *string          // Language code (e.g., "eng", "deu", "fra")
    Tesseract *TesseractConfig // Tesseract-specific fine-tuning
}

Example:

ocr_config.go
cfg := &kreuzberg.ExtractionConfig{
    OCR: &kreuzberg.OCRConfig{
        Backend:  "tesseract",
        Language: stringPtr("eng"),
        Tesseract: &kreuzberg.TesseractConfig{
            PSM:           intPtr(3),
            MinConfidence: float64Ptr(0.5),
        },
    },
}

TesseractConfig

Fine-grained Tesseract OCR tuning.

Signature:

Go
type TesseractConfig struct {
    Language                       string                    // Language code
    PSM                            *int                      // Page segmentation mode (0-13)
    OutputFormat                   string                    // Output format: "text", "pdf", "hocr"
    OEM                            *int                      // Engine mode (0-3)
    MinConfidence                  *float64                  // Confidence threshold (0.0-1.0)
    Preprocessing                  *ImagePreprocessingConfig // Image preprocessing
    EnableTableDetection           *bool                     // Detect and extract tables
    TableMinConfidence             *float64                  // Table detection confidence
    TableColumnThreshold           *int                      // Column separation threshold
    TableRowThresholdRatio         *float64                  // Row separation ratio
    UseCache                       *bool                     // Cache OCR results
    // Additional Tesseract parameters...
    TesseditCharWhitelist          string                    // Character whitelist
    TesseditCharBlacklist          string                    // Character blacklist
}

ImagePreprocessingConfig

Configure OCR image preprocessing (DPI normalization, rotation, denoising, etc.).

Signature:

Go
type ImagePreprocessingConfig struct {
    TargetDPI        *int   // Target DPI for OCR (typically 300)
    AutoRotate       *bool  // Auto-detect and correct image rotation
    Deskew           *bool  // Correct skewed text
    Denoise          *bool  // Remove noise
    ContrastEnhance  *bool  // Enhance contrast
    BinarizationMode string // Binarization method: "otsu", "adaptive"
    InvertColors     *bool  // Invert black/white
}

ChunkingConfig

Configure text chunking for RAG and retrieval workloads.

Signature:

Go
type ChunkingConfig struct {
    MaxChars     *int             // Maximum characters per chunk
    MaxOverlap   *int             // Overlap between chunks
    ChunkSize    *int             // Alias for MaxChars
    ChunkOverlap *int             // Alias for MaxOverlap
    Preset       *string          // Preset: "semantic", "sliding", "recursive"
    Embedding    *EmbeddingConfig // Embedding generation
    Enabled      *bool            // Enable chunking
}

ImageExtractionConfig

Configure image extraction from documents.

Signature:

Go
type ImageExtractionConfig struct {
    ExtractImages     *bool // Extract embedded images
    TargetDPI         *int  // Target DPI for extraction
    MaxImageDimension *int  // Maximum dimension (width/height)
    AutoAdjustDPI     *bool // Auto-adjust DPI for small images
    MinDPI            *int  // Minimum DPI threshold
    MaxDPI            *int  // Maximum DPI threshold
}

PdfConfig

PDF-specific extraction options.

Signature:

Go
type PdfConfig struct {
    ExtractImages   *bool    // Extract embedded images
    Passwords       []string // List of passwords for encrypted PDFs
    ExtractMetadata *bool    // Extract document metadata
}

EmbeddingConfig

Configure embedding generation for chunks.

Signature:

Go
type EmbeddingConfig struct {
    Model                *EmbeddingModelType // Model selection
    Normalize            *bool               // L2 normalization
    BatchSize            *int                // Batch size for inference
    ShowDownloadProgress *bool               // Show download progress
    CacheDir             *string             // Cache directory
}

type EmbeddingModelType struct {
    Type       string // "preset", "fastembed", "custom"
    Name       string // For preset models
    Model      string // For fastembed/custom
    ModelID    string // Alias for custom
    Dimensions *int   // Embedding dimensions
}

KeywordConfig

Configure keyword extraction.

Signature:

Go
type KeywordConfig struct {
    Algorithm   string      // "yake" or "rake"
    MaxKeywords *int        // Maximum keywords to extract
    MinScore    *float64    // Minimum keyword score
    NgramRange  *[2]int     // N-gram range: [min, max]
    Language    *string     // Language code
    Yake        *YakeParams // YAKE-specific tuning
    Rake        *RakeParams // RAKE-specific tuning
}

type YakeParams struct {
    WindowSize *int
}

type RakeParams struct {
    MinWordLength     *int
    MaxWordsPerPhrase *int
}

PostProcessorConfig

Configure post-processing steps.

Signature:

Go
type PostProcessorConfig struct {
    Enabled            *bool    // Enable post-processing
    EnabledProcessors  []string // Specific processors to run
    DisabledProcessors []string // Processors to skip
}

Results & Types

ExtractionResult

The main result struct containing all extracted data.

Signature:

Go
type ExtractionResult struct {
    Content           string           // Extracted text content
    MimeType          string           // Detected MIME type
    Metadata          Metadata         // Document metadata
    Tables            []Table          // Extracted tables
    DetectedLanguages []string         // Detected languages
    Chunks            []Chunk          // Text chunks (if enabled)
    Images            []ExtractedImage // Embedded images (if enabled)
    Pages             []PageContent    // Per-page content (if enabled)
    Success           bool             // Extraction success flag
}

Example - Accessing results:

inspect_extraction_result.go
result, err := kreuzberg.ExtractFileSync("report.pdf", nil)
if err != nil || !result.Success {
    log.Fatal("extraction failed")
}

fmt.Printf("Detected MIME type: %s\n", result.MimeType)
fmt.Printf("Content length: %d\n", len(result.Content))
fmt.Printf("Detected languages: %v\n", result.DetectedLanguages)
fmt.Printf("Number of tables: %d\n", len(result.Tables))
fmt.Printf("Number of chunks: %d\n", len(result.Chunks))
fmt.Printf("Number of images: %d\n", len(result.Images))

Pages

Type: []PageContent

Per-page extracted content when page extraction is enabled via PageConfig.ExtractPages = true.

Each page contains: - Page number (1-indexed) - Text content for that page - Tables on that page - Images on that page

Example:

page_extraction.go
config := &kreuzberg.ExtractionConfig{
    Pages: &kreuzberg.PageConfig{
        ExtractPages: boolPtr(true),
    },
}

result, err := kreuzberg.ExtractFileSync("document.pdf", config)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

if result.Pages != nil {
    for _, page := range result.Pages {
        fmt.Printf("Page %d:\n", page.PageNumber)
        fmt.Printf("  Content: %d chars\n", len(page.Content))
        fmt.Printf("  Tables: %d\n", len(page.Tables))
        fmt.Printf("  Images: %d\n", len(page.Images))
    }
}

Accessing Per-Page Content

When page extraction is enabled, access individual pages and iterate over them:

iterate_pages.go
config := &kreuzberg.ExtractionConfig{
    Pages: &kreuzberg.PageConfig{
        ExtractPages:      boolPtr(true),
        InsertPageMarkers: boolPtr(true),
        MarkerFormat:      stringPtr("\n\n--- Page {page_num} ---\n\n"),
    },
}

result, err := kreuzberg.ExtractFileSync("document.pdf", config)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

// Access combined content with page markers
fmt.Println("Combined content with markers:")
if len(result.Content) > 500 {
    fmt.Println(result.Content[:500])
} else {
    fmt.Println(result.Content)
}
fmt.Println()

// Access per-page content
if result.Pages != nil {
    for _, page := range result.Pages {
        fmt.Printf("Page %d:\n", page.PageNumber)
        preview := page.Content
        if len(preview) > 100 {
            preview = preview[:100]
        }
        fmt.Printf("  %s...\n", preview)
        if len(page.Tables) > 0 {
            fmt.Printf("  Found %d table(s)\n", len(page.Tables))
        }
        if len(page.Images) > 0 {
            fmt.Printf("  Found %d image(s)\n", len(page.Images))
        }
    }
}

Metadata

Aggregated document metadata with format-specific fields.

Signature:

Go
type Metadata struct {
    Language           *string                     // Detected language code
    Date               *string                     // Extracted document date
    Subject            *string                     // Document subject
    Format             FormatMetadata              // Format-specific metadata
    ImagePreprocessing *ImagePreprocessingMetadata // OCR preprocessing info
    JSONSchema         json.RawMessage             // JSON Schema if available
    Error              *ErrorMetadata              // Error info for batch operations
    Additional         map[string]json.RawMessage  // Custom/additional fields
}

Access format-specific metadata:

inspect_format_metadata.go
fmt.Println("Format type:", result.Metadata.FormatType())

if pdfMeta, ok := result.Metadata.PdfMetadata(); ok {
    fmt.Printf("Title: %s\n", *pdfMeta.Title)
    fmt.Printf("Pages: %d\n", *pdfMeta.PageCount)
    fmt.Printf("Author: %s\n", *pdfMeta.Authors[0])
}

if excelMeta, ok := result.Metadata.ExcelMetadata(); ok {
    fmt.Printf("Sheets: %d\n", excelMeta.SheetCount)
    fmt.Printf("Sheet names: %v\n", excelMeta.SheetNames)
}

if htmlMeta, ok := result.Metadata.HTMLMetadata(); ok {
    fmt.Printf("Page title: %s\n", *htmlMeta.Title)
    fmt.Printf("OG image: %s\n", *htmlMeta.OGImage)
}

Table

Extracted table structure.

Signature:

Go
type Table struct {
    Cells      [][]string // 2D cell array [row][col]
    Markdown   string     // Markdown representation
    PageNumber int        // Page number (PDF/Image documents)
}

Example:

extract_tables.go
for tableIdx, table := range result.Tables {
    fmt.Printf("Table %d (page %d):\n", tableIdx, table.PageNumber)
    for _, row := range table.Cells {
        fmt.Println(strings.Join(row, " | "))
    }
    fmt.Println("Markdown:", table.Markdown)
}

Chunk

Text chunk with optional embeddings and metadata.

Signature:

Go
type Chunk struct {
    Content   string        // Chunk text
    Embedding []float32     // Embedding vector (if enabled)
    Metadata  ChunkMetadata // Chunk positioning
}

type ChunkMetadata struct {
    ByteStart   int  // UTF-8 byte offset (inclusive)
    ByteEnd     int  // UTF-8 byte offset (exclusive)
    CharCount   int  // Number of characters in chunk
    TokenCount  *int // Token count (if available)
    FirstPage   *int // First page this chunk appears on (1-indexed)
    LastPage    *int // Last page this chunk appears on (1-indexed)
    ChunkIndex  int  // Index in chunk sequence
    TotalChunks int  // Total number of chunks
}

Fields:

  • ByteStart (int): UTF-8 byte offset in content (inclusive)
  • ByteEnd (int): UTF-8 byte offset in content (exclusive)
  • CharCount (int): Number of characters in chunk
  • TokenCount (*int): Estimated token count (if configured)
  • FirstPage (*int): First page this chunk appears on (1-indexed, only when page boundaries available)
  • LastPage (*int): Last page this chunk appears on (1-indexed, only when page boundaries available)

Page tracking: When PageStructure.Boundaries is available and chunking is enabled, FirstPage and LastPage are automatically calculated based on byte offsets.

Example:

inspect_chunks.go
for _, chunk := range result.Chunks {
    fmt.Printf("Chunk %d/%d\n", chunk.Metadata.ChunkIndex, chunk.Metadata.TotalChunks)
    fmt.Printf("Content: %s...\n", chunk.Content[:min(50, len(chunk.Content))])
    fmt.Printf("Bytes: [%d:%d], %d chars\n", chunk.Metadata.ByteStart, chunk.Metadata.ByteEnd, chunk.Metadata.CharCount)
    if chunk.Metadata.TokenCount != nil {
        fmt.Printf("Tokens: %d\n", *chunk.Metadata.TokenCount)
    }

    // Show page information if available
    if chunk.Metadata.FirstPage != nil {
        first := *chunk.Metadata.FirstPage
        last := *chunk.Metadata.LastPage
        if first == last {
            fmt.Printf("Page: %d\n", first)
        } else {
            fmt.Printf("Pages: %d-%d\n", first, last)
        }
    }

    if len(chunk.Embedding) > 0 {
        fmt.Printf("Embedding dim: %d\n", len(chunk.Embedding))
        fmt.Printf("First 5 values: %v\n", chunk.Embedding[:5])
    }
}

ExtractedImage

Image extracted from document with optional OCR results.

Signature:

Go
type ExtractedImage struct {
    Data             []byte            // Raw image bytes
    Format           string            // Image format: "jpeg", "png", "webp"
    ImageIndex       int               // Index in images list
    PageNumber       *int              // Page number (if applicable)
    Width            *uint32           // Image width in pixels
    Height           *uint32           // Image height in pixels
    Colorspace       *string           // Colorspace (sRGB, CMYK, etc.)
    BitsPerComponent *uint32           // Bits per color component
    IsMask           bool              // Is image a mask?
    Description      *string           // Image description/alt text
    OCRResult        *ExtractionResult // Nested OCR extraction
}

Example:

extract_images.go
for imgIdx, img := range result.Images {
    fmt.Printf("Image %d: %s, %dx%d\n", imgIdx, img.Format, *img.Width, *img.Height)

    filename := fmt.Sprintf("image_%d.%s", imgIdx, img.Format)
    os.WriteFile(filename, img.Data, 0644)

    if img.OCRResult != nil {
        fmt.Printf("Image %d OCR: %s\n", imgIdx, img.OCRResult.Content)
    }
}

Error Handling

Error Types

Kreuzberg defines a type hierarchy of errors via the KreuzbergError interface:

Go
type KreuzbergError interface {
    error
    Kind() ErrorKind
}

type ErrorKind string

const (
    ErrorKindUnknown           ErrorKind = "unknown"
    ErrorKindIO                ErrorKind = "io"
    ErrorKindValidation        ErrorKind = "validation"
    ErrorKindParsing           ErrorKind = "parsing"
    ErrorKindOCR               ErrorKind = "ocr"
    ErrorKindCache             ErrorKind = "cache"
    ErrorKindImageProcessing   ErrorKind = "image_processing"
    ErrorKindSerialization     ErrorKind = "serialization"
    ErrorKindMissingDependency ErrorKind = "missing_dependency"
    ErrorKindPlugin            ErrorKind = "plugin"
    ErrorKindUnsupportedFormat ErrorKind = "unsupported_format"
    ErrorKindRuntime           ErrorKind = "runtime"
)

Error type classes:

  • ValidationError: Input validation failed (empty paths, missing MIME types)
  • ParsingError: Document parsing failed (malformed file, unsupported format)
  • OCRError: OCR backend failure (library missing, invalid language)
  • CacheError: Cache operation failed
  • ImageProcessingError: Image manipulation failed
  • SerializationError: JSON encoding/decoding failed
  • MissingDependencyError: Required library not found (Tesseract, EasyOCR, etc.)
  • PluginError: Plugin registration or execution failed
  • UnsupportedFormatError: MIME type not supported
  • IOError: File I/O failure
  • RuntimeError: Unexpected runtime failure (lock poisoning, etc.)

Error Classification

Errors are automatically classified based on native error messages. Use errors.As() and errors.Is() to handle specific error types:

error_classification.go
import (
    "errors"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
if err != nil {
    var parsingErr *kreuzberg.ParsingError
    if errors.As(err, &parsingErr) {
        log.Printf("Parsing failed: %v\n", parsingErr)
        return
    }

    var missingDep *kreuzberg.MissingDependencyError
    if errors.As(err, &missingDep) {
        log.Printf("Missing dependency: %s\n", missingDep.Dependency)
        return
    }

    log.Printf("Extraction failed: %v\n", err)
}

Error Unwrapping

All Kreuzberg errors support error unwrapping via errors.Unwrap():

error_unwrapping.go
result, err := kreuzberg.ExtractFileSync("doc.pdf", nil)
if err != nil {
    rootErr := errors.Unwrap(err)
    if rootErr != nil {
        log.Printf("Root cause: %v\n", rootErr)
    }

    if krErr, ok := err.(kreuzberg.KreuzbergError); ok {
        log.Printf("Error kind: %v\n", krErr.Kind())
    }
}

Error Handling Examples

Handle file not found:

handle_file_not_found.go
result, err := kreuzberg.ExtractFileSync("missing.pdf", nil)
if err != nil {
    var ioErr *kreuzberg.IOError
    if errors.As(err, &ioErr) {
        log.Println("File not found or unreadable")
        return
    }
    log.Fatalf("unexpected error: %v\n", err)
}

Handle missing OCR dependency:

handle_missing_ocr.go
cfg := &kreuzberg.ExtractionConfig{
    OCR: &kreuzberg.OCRConfig{
        Backend:  "tesseract",
        Language: stringPtr("eng"),
    },
}

result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
    var missingDep *kreuzberg.MissingDependencyError
    if errors.As(err, &missingDep) {
        log.Printf("Install %s to use OCR\n", missingDep.Dependency)
        return
    }
    log.Fatalf("extraction failed: %v\n", err)
}

Batch error handling:

batch_error_handling.go
results, err := kreuzberg.BatchExtractFilesSync(files, nil)
if err != nil {
    log.Fatalf("batch setup failed: %v\n", err)
}

for i, result := range results {
    if result == nil {
        log.Printf("File %d: extraction failed (nil result)\n", i)
        continue
    }

    if result.Metadata.Error != nil {
        log.Printf("File %d: %s - %s\n", i, result.Metadata.Error.ErrorType, result.Metadata.Error.Message)
        continue
    }

    if !result.Success {
        log.Printf("File %d: extraction unsuccessful\n", i)
        continue
    }

    log.Printf("File %d: success (%d chars)\n", i, len(result.Content))
}

Advanced Usage

MIME Type Detection

Detect MIME type from file extension or content:

mime_detection.go
mimeType := "application/pdf"

CGO-Specific Patterns

Memory Management

Go's cgo automatically manages C memory for simple types. Kreuzberg handles C pointer cleanup internally via defer statements:

memory_safety.go
result, err := kreuzberg.ExtractFileSync("doc.pdf", nil)

result, err := kreuzberg.ExtractBytesSync(data, "application/pdf", nil)

Static Linking Configuration

Go binaries are statically linked against the FFI library, so no runtime library paths are needed. Configuration is done at build time:

Monorepo Development:

Terminal
# Build FFI library first
cargo build -p kreuzberg-ffi --release

# Go automatically finds target/release/libkreuzberg_ffi.a
go build -v ./...

# Run directly - no environment variables needed
./myapp

External Projects:

Terminal
# Set CGO_LDFLAGS to point to the static library
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build

# Run directly - no runtime dependencies
./myapp

Configuration as JSON

Internally, ExtractionConfig is serialized to JSON and passed to the C FFI:

json_serialization.go
cfg := &kreuzberg.ExtractionConfig{
    UseCache: boolPtr(true),
    OCR: &kreuzberg.OCRConfig{
        Backend:  "tesseract",
        Language: stringPtr("eng"),
    },
}

result, err := kreuzberg.ExtractFileSync("doc.pdf", cfg)

Custom Post-Processors

Register custom post-processing logic in Go:

custom_post_processor.go
package main

import (
    "C"
    "encoding/json"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

//export myCustomProcessor
func myCustomProcessor(resultJSON *C.char) *C.char {
    jsonStr := C.GoString(resultJSON)
    var result kreuzberg.ExtractionResult
    if err := json.Unmarshal([]byte(jsonStr), &result); err != nil {
        errMsg := C.CString("failed to parse JSON")
        return errMsg
    }

    result.Content = strings.ToUpper(result.Content)

    modified, _ := json.Marshal(result)
    return C.CString(string(modified))
}

func init() {
    err := kreuzberg.RegisterPostProcessor(
        "go-uppercase",
        100, // priority
        (C.PostProcessorCallback)(C.myCustomProcessor),
    )
    if err != nil {
        log.Fatalf("failed to register post-processor: %v\n", err)
    }
}

func main() {
    cfg := &kreuzberg.ExtractionConfig{
        Postprocessor: &kreuzberg.PostProcessorConfig{
            EnabledProcessors: []string{"go-uppercase"},
        },
    }

    result, _ := kreuzberg.ExtractFileSync("doc.pdf", cfg)
}

Custom Validators

Validate extraction results:

custom_validator.go
//export myValidator
func myValidator(resultJSON *C.char) *C.char {
    jsonStr := C.GoString(resultJSON)
    var result kreuzberg.ExtractionResult
    json.Unmarshal([]byte(jsonStr), &result)

    if len(result.Content) == 0 {
        errMsg := C.CString("content is empty")
        return errMsg
    }

    return nil
}

func init() {
    kreuzberg.RegisterValidator(
        "content-not-empty",
        50,
        (C.ValidatorCallback)(C.myValidator),
    )
}

Custom OCR Backends

Register a custom OCR backend:

custom_ocr_backend.go
//export customOCR
func customOCR(imageData *C.uint8_t, width C.uint32_t, height C.uint32_t, lang *C.char) *C.char {
    result := kreuzberg.ExtractionResult{
        Content:  "extracted text from custom OCR",
        MimeType: "text/plain",
        Success:  true,
    }
    data, _ := json.Marshal(result)
    return C.CString(string(data))
}

func init() {
    kreuzberg.RegisterOCRBackend(
        "custom-ocr",
        (C.OcrBackendCallback)(C.customOCR),
    )
}

Plugin Management

List and manage registered plugins:

plugin_management.go
validators, err := kreuzberg.ListValidators()
if err == nil {
    fmt.Printf("Validators: %v\n", validators)
}

processors, err := kreuzberg.ListPostProcessors()
if err == nil {
    fmt.Printf("Post-processors: %v\n", processors)
}

backends, err := kreuzberg.ListOCRBackends()
if err == nil {
    fmt.Printf("OCR backends: %v\n", backends)
}

if err := kreuzberg.ClearValidators(); err != nil {
    log.Fatalf("failed to clear validators: %v\n", err)
}

if err := kreuzberg.UnregisterValidator("my-validator"); err != nil {
    log.Fatalf("failed to unregister: %v\n", err)
}

Performance Tips

  1. Batch Processing: Use BatchExtractFilesSync() for multiple files to leverage internal optimizations
  2. Context Timeouts: Set realistic timeouts; OCR can be slow on large documents
  3. Caching: Enable UseCache: boolPtr(true) to cache frequently extracted documents
  4. Static Linking: Binaries are self-contained after build; no runtime library paths needed
  5. Configuration Reuse: Create and reuse ExtractionConfig objects across multiple calls
  6. Goroutines: Use ExtractFile() / ExtractBytes() variants in goroutines for concurrency

Troubleshooting

Static Library Not Found

Error: cannot find -lkreuzberg_ffi or undefined reference to 'kreuzberg_...'

Solution:

Terminal
# Verify static library exists
ls -la target/release/libkreuzberg_ffi.a

# For monorepo development, just build the FFI crate:
cargo build -p kreuzberg-ffi --release

# For external projects, provide the path via CGO_LDFLAGS:
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build

The binary will be statically linked and have no runtime dependencies on Kreuzberg libraries.


CGO Compilation Errors

Error: error: kreuzberg.h: No such file or directory

Solution:

Ensure kreuzberg-ffi is built before building your Go module:

Terminal
cargo build -p kreuzberg-ffi --release
go build ./...

Missing OCR Library

Error: MissingDependencyError: Missing dependency: tesseract

Solution:

Install Tesseract or use a different OCR backend:

Terminal
# macOS
brew install tesseract

# Debian/Ubuntu
apt-get install tesseract-ocr

# Or use EasyOCR/PaddleOCR (Python packages)

Context Timeout on Large Documents

Issue: Extraction times out before completion

Solution:

Increase timeout or disable OCR for large documents:

handle_large_documents.go
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()

cfg := &kreuzberg.ExtractionConfig{
    ForceOCR: boolPtr(false),
}

result, err := kreuzberg.ExtractFile(ctx, "large.pdf", cfg)

Testing

Run the test suite:

Terminal
# Unit tests (from packages/go)
task go:test

# Lint (gofmt + golangci-lint)
task go:lint

# E2E tests (from e2e/go, auto-generated from fixtures)
task e2e:go:verify

# Manual test (build FFI library first)
cargo build -p kreuzberg-ffi --release
go test -v ./packages/go/v4

Helper Functions

Add these utility functions to your code:

Go
func stringPtr(s string) *string {
    return &s
}

func boolPtr(b bool) *bool {
    return &b
}

func intPtr(i int) *int {
    return &i
}

func float64Ptr(f float64) *float64 {
    return &f
}

func uint32Ptr(u uint32) *uint32 {
    return &u
}