Go API Reference¶
Complete reference for the Kreuzberg Go bindings using cgo to access the Rust-powered extraction pipeline.
The Go binding exposes the same extraction capabilities as the other languages through C FFI bindings to kreuzberg-ffi. You get identical metadata extraction, OCR processing, chunking, embeddings, and plugin support—with synchronous and context-aware async APIs.
Requirements¶
- Go 1.25+ (with cgo support)
- C compiler (gcc/clang for cgo compilation)
- libkreuzberg_ffi.a static library (at build time only)
- Tesseract/EasyOCR/PaddleOCR (optional, for OCR functionality)
Installation¶
Kreuzberg Go binaries are statically linked — once built, they are self-contained and require no runtime library dependencies. Only the static library is needed at build time.
Add the package to your go.mod:¶
Monorepo Development¶
For development in the Kreuzberg monorepo:
# Build the FFI crate (produces static library)
cargo build -p kreuzberg-ffi --release
# Go will automatically link against target/release/libkreuzberg_ffi.a
cd packages/go/v4
go build -v
# Run your binary - no library paths needed, it's statically linked!
./v4
External Projects¶
When building outside the monorepo, provide the static library via CGO_LDFLAGS:
# Option 1: Download pre-built from GitHub Releases
curl -LO https://github.com/kreuzberg-dev/kreuzberg/releases/download/v4.0.0/go-ffi-linux-x86_64.tar.gz
tar -xzf go-ffi-linux-x86_64.tar.gz
mkdir -p ~/kreuzberg/lib
cp kreuzberg-ffi/lib/libkreuzberg_ffi.a ~/kreuzberg/lib/
# Option 2: Build static library yourself
git clone https://github.com/kreuzberg-dev/kreuzberg.git
cd kreuzberg && cargo build -p kreuzberg-ffi --release
cp target/release/libkreuzberg_ffi.a ~/kreuzberg/lib/
# Build your Go project with static linking
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build
# Run - no library paths needed!
./myapp
Quickstart¶
Basic file extraction (synchronous)¶
package main
import (
"fmt"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
fmt.Printf("Format: %s\n", result.MimeType)
fmt.Printf("Content length: %d\n", len(result.Content))
fmt.Printf("Success: %v\n", result.Success)
}
Async extraction with timeout¶
package main
import (
"context"
"errors"
"log"
"time"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()
result, err := kreuzberg.ExtractFile(ctx, "large-document.pdf", nil)
if errors.Is(err, context.DeadlineExceeded) {
log.Println("extraction timed out")
return
}
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
log.Printf("Extracted %d characters\n", len(result.Content))
}
Core Functions¶
ExtractFileSync¶
Extract content and metadata from a file synchronously.
Signature:
Parameters:
path(string): Path to the file to extract (absolute or relative)config(*ExtractionConfig): Optional extraction configuration; uses defaults if nil
Returns:
*ExtractionResult: Populated result containing content, metadata, tables, chunks, and imageserror: KreuzbergError or standard Go error (see Error Handling section)
Error Handling:
ValidationError: If path is emptyIOError: If file not found or not readableParsingError: If document parsing failsMissingDependencyError: If required OCR/processing library is missingUnsupportedFormatError: If MIME type is not supported
Example - Extract PDF:
result, err := kreuzberg.ExtractFileSync("report.pdf", nil)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
fmt.Printf("Title: %s\n", *result.Metadata.PdfMetadata().Title)
fmt.Printf("Page count: %d\n", *result.Metadata.PdfMetadata().PageCount)
fmt.Printf("Content preview: %s...\n", result.Content[:100])
Example - Extract with configuration:
cfg := &kreuzberg.ExtractionConfig{
UseCache: boolPtr(true),
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: stringPtr("eng"),
},
}
result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
ExtractFile¶
Extract content from a file asynchronously with context support.
Signature:
func ExtractFile(ctx context.Context, path string, config *ExtractionConfig) (*ExtractionResult, error)
Parameters:
ctx(context.Context): Context for cancellation and timeoutpath(string): Path to the fileconfig(*ExtractionConfig): Optional configuration
Returns:
*ExtractionResult: Extraction resulterror: May include context errors (context.DeadlineExceeded, context.Canceled)
Note: Context cancellation is best-effort. The underlying C call cannot be interrupted, but the function returns immediately with ctx.Err() when the context deadline is exceeded or cancelled.
Example - With deadline:
ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(30*time.Second))
defer cancel()
result, err := kreuzberg.ExtractFile(ctx, "large.docx", nil)
if errors.Is(err, context.DeadlineExceeded) {
log.Println("extraction took too long")
return
}
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
ExtractBytesSync¶
Extract content from an in-memory byte slice with specified MIME type.
Signature:
func ExtractBytesSync(data []byte, mimeType string, config *ExtractionConfig) (*ExtractionResult, error)
Parameters:
data([]byte): Document bytesmimeType(string): MIME type (e.g., "application/pdf", "text/plain")config(*ExtractionConfig): Optional configuration
Returns:
*ExtractionResult: Extraction resulterror: KreuzbergError on extraction failure
Example - Extract from downloaded PDF:
httpResp, err := http.Get("https://example.com/document.pdf")
if err != nil {
log.Fatal(err)
}
defer httpResp.Body.Close()
data, err := io.ReadAll(httpResp.Body)
if err != nil {
log.Fatal(err)
}
result, err := kreuzberg.ExtractBytesSync(data, "application/pdf", nil)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
fmt.Printf("Extracted %d words\n", len(strings.Fields(result.Content)))
ExtractBytes¶
Extract content from in-memory bytes asynchronously.
Signature:
func ExtractBytes(ctx context.Context, data []byte, mimeType string, config *ExtractionConfig) (*ExtractionResult, error)
Parameters:
ctx(context.Context): Context for cancellation and timeoutdata([]byte): Document bytesmimeType(string): MIME typeconfig(*ExtractionConfig): Optional configuration
Returns:
*ExtractionResult: Extraction resulterror: KreuzbergError or context error
Example:
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
result, err := kreuzberg.ExtractBytes(ctx, data, "text/html", nil)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
BatchExtractFilesSync¶
Extract multiple files sequentially using the optimized batch pipeline.
Signature:
Parameters:
paths([]string): Slice of file pathsconfig(*ExtractionConfig): Configuration applied to all files
Returns:
[]*ExtractionResult: Slice of results (one per input file; may contain nils for failed extractions)error: Returned only if batch setup fails; individual file errors are captured in ErrorMetadata
Example - Batch extract multiple PDFs:
files := []string{"doc1.pdf", "doc2.pdf", "doc3.pdf"}
results, err := kreuzberg.BatchExtractFilesSync(files, nil)
if err != nil {
log.Fatalf("batch extraction setup failed: %v", err)
}
for i, result := range results {
if result == nil {
fmt.Printf("File %d: extraction failed\n", i)
continue
}
if result.Metadata.Error != nil {
fmt.Printf("File %d: %s (%s)\n", i, result.Metadata.Error.ErrorType, result.Metadata.Error.Message)
continue
}
fmt.Printf("File %d: extracted %d chars\n", i, len(result.Content))
}
BatchExtractFiles¶
Batch extract multiple files asynchronously.
Signature:
func BatchExtractFiles(ctx context.Context, paths []string, config *ExtractionConfig) ([]*ExtractionResult, error)
Parameters:
ctx(context.Context): Context for cancellationpaths([]string): File pathsconfig(*ExtractionConfig): Configuration for all files
Returns:
[]*ExtractionResult: Results sliceerror: Context or setup errors
BatchExtractBytesSync¶
Extract multiple in-memory documents in a single batch operation.
Signature:
func BatchExtractBytesSync(items []BytesWithMime, config *ExtractionConfig) ([]*ExtractionResult, error)
Parameters:
items([]BytesWithMime): Slice of {Data, MimeType} pairsconfig(*ExtractionConfig): Configuration applied to all items
Returns:
[]*ExtractionResult: Results sliceerror: Setup error or validation error
BytesWithMime structure:
Example - Batch extract multiple formats:
items := []kreuzberg.BytesWithMime{
{Data: pdfData, MimeType: "application/pdf"},
{Data: docxData, MimeType: "application/vnd.openxmlformats-officedocument.wordprocessingml.document"},
{Data: htmlData, MimeType: "text/html"},
}
results, err := kreuzberg.BatchExtractBytesSync(items, nil)
if err != nil {
log.Fatalf("batch extraction failed: %v", err)
}
for i, result := range results {
if result == nil || !result.Success {
log.Printf("Item %d extraction failed\n", i)
continue
}
log.Printf("Item %d: %s format\n", i, result.MimeType)
}
BatchExtractBytes¶
Batch extract in-memory documents asynchronously.
Signature:
func BatchExtractBytes(ctx context.Context, items []BytesWithMime, config *ExtractionConfig) ([]*ExtractionResult, error)
Parameters:
ctx(context.Context): Context for cancellationitems([]BytesWithMime): Document sliceconfig(*ExtractionConfig): Configuration
Returns:
[]*ExtractionResult: Results sliceerror: Context or setup errors
LibraryVersion¶
Get the version of the underlying Rust library.
Signature:
Returns:
string: Version string (e.g., "4.0.0")
Example:
Configuration¶
ExtractionConfig¶
Root configuration struct for all extraction operations. All fields are optional (pointers); omitted fields use Kreuzberg defaults.
Signature:
type ExtractionConfig struct {
UseCache *bool // Enable result caching
EnableQualityProcessing *bool // Run quality improvements
OCR *OCRConfig // OCR backend and settings
ForceOCR *bool // Force OCR even for text-extractable docs
Chunking *ChunkingConfig // Text chunking and embeddings
Images *ImageExtractionConfig // Image extraction from docs
PdfOptions *PdfConfig // PDF-specific options
TokenReduction *TokenReductionConfig // Token pruning before embeddings
LanguageDetection *LanguageDetectionConfig // Language detection settings
Keywords *KeywordConfig // Keyword extraction
Postprocessor *PostProcessorConfig // Post-processor selection
HTMLOptions *HTMLConversionOptions // HTML-to-Markdown conversion
MaxConcurrentExtractions *int // Batch concurrency limit
}
OCRConfig¶
Configure OCR backend selection and language.
Signature:
type OCRConfig struct {
Backend string // OCR backend name: "tesseract", "easyocr", "paddle", etc.
Language *string // Language code (e.g., "eng", "deu", "fra")
Tesseract *TesseractConfig // Tesseract-specific fine-tuning
}
Example:
cfg := &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: stringPtr("eng"),
Tesseract: &kreuzberg.TesseractConfig{
PSM: intPtr(3),
MinConfidence: float64Ptr(0.5),
},
},
}
TesseractConfig¶
Fine-grained Tesseract OCR tuning.
Signature:
type TesseractConfig struct {
Language string // Language code
PSM *int // Page segmentation mode (0-13)
OutputFormat string // Output format: "text", "pdf", "hocr"
OEM *int // Engine mode (0-3)
MinConfidence *float64 // Confidence threshold (0.0-1.0)
Preprocessing *ImagePreprocessingConfig // Image preprocessing
EnableTableDetection *bool // Detect and extract tables
TableMinConfidence *float64 // Table detection confidence
TableColumnThreshold *int // Column separation threshold
TableRowThresholdRatio *float64 // Row separation ratio
UseCache *bool // Cache OCR results
// Additional Tesseract parameters...
TesseditCharWhitelist string // Character whitelist
TesseditCharBlacklist string // Character blacklist
}
ImagePreprocessingConfig¶
Configure OCR image preprocessing (DPI normalization, rotation, denoising, etc.).
Signature:
type ImagePreprocessingConfig struct {
TargetDPI *int // Target DPI for OCR (typically 300)
AutoRotate *bool // Auto-detect and correct image rotation
Deskew *bool // Correct skewed text
Denoise *bool // Remove noise
ContrastEnhance *bool // Enhance contrast
BinarizationMode string // Binarization method: "otsu", "adaptive"
InvertColors *bool // Invert black/white
}
ChunkingConfig¶
Configure text chunking for RAG and retrieval workloads.
Signature:
type ChunkingConfig struct {
MaxChars *int // Maximum characters per chunk
MaxOverlap *int // Overlap between chunks
ChunkSize *int // Alias for MaxChars
ChunkOverlap *int // Alias for MaxOverlap
Preset *string // Preset: "semantic", "sliding", "recursive"
Embedding *EmbeddingConfig // Embedding generation
Enabled *bool // Enable chunking
}
ImageExtractionConfig¶
Configure image extraction from documents.
Signature:
type ImageExtractionConfig struct {
ExtractImages *bool // Extract embedded images
TargetDPI *int // Target DPI for extraction
MaxImageDimension *int // Maximum dimension (width/height)
AutoAdjustDPI *bool // Auto-adjust DPI for small images
MinDPI *int // Minimum DPI threshold
MaxDPI *int // Maximum DPI threshold
}
PdfConfig¶
PDF-specific extraction options.
Signature:
type PdfConfig struct {
ExtractImages *bool // Extract embedded images
Passwords []string // List of passwords for encrypted PDFs
ExtractMetadata *bool // Extract document metadata
}
EmbeddingConfig¶
Configure embedding generation for chunks.
Signature:
type EmbeddingConfig struct {
Model *EmbeddingModelType // Model selection
Normalize *bool // L2 normalization
BatchSize *int // Batch size for inference
ShowDownloadProgress *bool // Show download progress
CacheDir *string // Cache directory
}
type EmbeddingModelType struct {
Type string // "preset", "fastembed", "custom"
Name string // For preset models
Model string // For fastembed/custom
ModelID string // Alias for custom
Dimensions *int // Embedding dimensions
}
KeywordConfig¶
Configure keyword extraction.
Signature:
type KeywordConfig struct {
Algorithm string // "yake" or "rake"
MaxKeywords *int // Maximum keywords to extract
MinScore *float64 // Minimum keyword score
NgramRange *[2]int // N-gram range: [min, max]
Language *string // Language code
Yake *YakeParams // YAKE-specific tuning
Rake *RakeParams // RAKE-specific tuning
}
type YakeParams struct {
WindowSize *int
}
type RakeParams struct {
MinWordLength *int
MaxWordsPerPhrase *int
}
PostProcessorConfig¶
Configure post-processing steps.
Signature:
type PostProcessorConfig struct {
Enabled *bool // Enable post-processing
EnabledProcessors []string // Specific processors to run
DisabledProcessors []string // Processors to skip
}
Results & Types¶
ExtractionResult¶
The main result struct containing all extracted data.
Signature:
type ExtractionResult struct {
Content string // Extracted text content
MimeType string // Detected MIME type
Metadata Metadata // Document metadata
Tables []Table // Extracted tables
DetectedLanguages []string // Detected languages
Chunks []Chunk // Text chunks (if enabled)
Images []ExtractedImage // Embedded images (if enabled)
Pages []PageContent // Per-page content (if enabled)
Success bool // Extraction success flag
}
Example - Accessing results:
result, err := kreuzberg.ExtractFileSync("report.pdf", nil)
if err != nil || !result.Success {
log.Fatal("extraction failed")
}
fmt.Printf("Detected MIME type: %s\n", result.MimeType)
fmt.Printf("Content length: %d\n", len(result.Content))
fmt.Printf("Detected languages: %v\n", result.DetectedLanguages)
fmt.Printf("Number of tables: %d\n", len(result.Tables))
fmt.Printf("Number of chunks: %d\n", len(result.Chunks))
fmt.Printf("Number of images: %d\n", len(result.Images))
Pages¶
Type: []PageContent
Per-page extracted content when page extraction is enabled via PageConfig.ExtractPages = true.
Each page contains: - Page number (1-indexed) - Text content for that page - Tables on that page - Images on that page
Example:
config := &kreuzberg.ExtractionConfig{
Pages: &kreuzberg.PageConfig{
ExtractPages: boolPtr(true),
},
}
result, err := kreuzberg.ExtractFileSync("document.pdf", config)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
if result.Pages != nil {
for _, page := range result.Pages {
fmt.Printf("Page %d:\n", page.PageNumber)
fmt.Printf(" Content: %d chars\n", len(page.Content))
fmt.Printf(" Tables: %d\n", len(page.Tables))
fmt.Printf(" Images: %d\n", len(page.Images))
}
}
Accessing Per-Page Content¶
When page extraction is enabled, access individual pages and iterate over them:
config := &kreuzberg.ExtractionConfig{
Pages: &kreuzberg.PageConfig{
ExtractPages: boolPtr(true),
InsertPageMarkers: boolPtr(true),
MarkerFormat: stringPtr("\n\n--- Page {page_num} ---\n\n"),
},
}
result, err := kreuzberg.ExtractFileSync("document.pdf", config)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
// Access combined content with page markers
fmt.Println("Combined content with markers:")
if len(result.Content) > 500 {
fmt.Println(result.Content[:500])
} else {
fmt.Println(result.Content)
}
fmt.Println()
// Access per-page content
if result.Pages != nil {
for _, page := range result.Pages {
fmt.Printf("Page %d:\n", page.PageNumber)
preview := page.Content
if len(preview) > 100 {
preview = preview[:100]
}
fmt.Printf(" %s...\n", preview)
if len(page.Tables) > 0 {
fmt.Printf(" Found %d table(s)\n", len(page.Tables))
}
if len(page.Images) > 0 {
fmt.Printf(" Found %d image(s)\n", len(page.Images))
}
}
}
Metadata¶
Aggregated document metadata with format-specific fields.
Signature:
type Metadata struct {
Language *string // Detected language code
Date *string // Extracted document date
Subject *string // Document subject
Format FormatMetadata // Format-specific metadata
ImagePreprocessing *ImagePreprocessingMetadata // OCR preprocessing info
JSONSchema json.RawMessage // JSON Schema if available
Error *ErrorMetadata // Error info for batch operations
Additional map[string]json.RawMessage // Custom/additional fields
}
Access format-specific metadata:
fmt.Println("Format type:", result.Metadata.FormatType())
if pdfMeta, ok := result.Metadata.PdfMetadata(); ok {
fmt.Printf("Title: %s\n", *pdfMeta.Title)
fmt.Printf("Pages: %d\n", *pdfMeta.PageCount)
fmt.Printf("Author: %s\n", *pdfMeta.Authors[0])
}
if excelMeta, ok := result.Metadata.ExcelMetadata(); ok {
fmt.Printf("Sheets: %d\n", excelMeta.SheetCount)
fmt.Printf("Sheet names: %v\n", excelMeta.SheetNames)
}
if htmlMeta, ok := result.Metadata.HTMLMetadata(); ok {
fmt.Printf("Page title: %s\n", *htmlMeta.Title)
fmt.Printf("OG image: %s\n", *htmlMeta.OGImage)
}
Table¶
Extracted table structure.
Signature:
type Table struct {
Cells [][]string // 2D cell array [row][col]
Markdown string // Markdown representation
PageNumber int // Page number (PDF/Image documents)
}
Example:
for tableIdx, table := range result.Tables {
fmt.Printf("Table %d (page %d):\n", tableIdx, table.PageNumber)
for _, row := range table.Cells {
fmt.Println(strings.Join(row, " | "))
}
fmt.Println("Markdown:", table.Markdown)
}
Chunk¶
Text chunk with optional embeddings and metadata.
Signature:
type Chunk struct {
Content string // Chunk text
Embedding []float32 // Embedding vector (if enabled)
Metadata ChunkMetadata // Chunk positioning
}
type ChunkMetadata struct {
ByteStart int // UTF-8 byte offset (inclusive)
ByteEnd int // UTF-8 byte offset (exclusive)
CharCount int // Number of characters in chunk
TokenCount *int // Token count (if available)
FirstPage *int // First page this chunk appears on (1-indexed)
LastPage *int // Last page this chunk appears on (1-indexed)
ChunkIndex int // Index in chunk sequence
TotalChunks int // Total number of chunks
}
Fields:
ByteStart(int): UTF-8 byte offset in content (inclusive)ByteEnd(int): UTF-8 byte offset in content (exclusive)CharCount(int): Number of characters in chunkTokenCount(*int): Estimated token count (if configured)FirstPage(*int): First page this chunk appears on (1-indexed, only when page boundaries available)LastPage(*int): Last page this chunk appears on (1-indexed, only when page boundaries available)
Page tracking: When PageStructure.Boundaries is available and chunking is enabled, FirstPage and LastPage are automatically calculated based on byte offsets.
Example:
for _, chunk := range result.Chunks {
fmt.Printf("Chunk %d/%d\n", chunk.Metadata.ChunkIndex, chunk.Metadata.TotalChunks)
fmt.Printf("Content: %s...\n", chunk.Content[:min(50, len(chunk.Content))])
fmt.Printf("Bytes: [%d:%d], %d chars\n", chunk.Metadata.ByteStart, chunk.Metadata.ByteEnd, chunk.Metadata.CharCount)
if chunk.Metadata.TokenCount != nil {
fmt.Printf("Tokens: %d\n", *chunk.Metadata.TokenCount)
}
// Show page information if available
if chunk.Metadata.FirstPage != nil {
first := *chunk.Metadata.FirstPage
last := *chunk.Metadata.LastPage
if first == last {
fmt.Printf("Page: %d\n", first)
} else {
fmt.Printf("Pages: %d-%d\n", first, last)
}
}
if len(chunk.Embedding) > 0 {
fmt.Printf("Embedding dim: %d\n", len(chunk.Embedding))
fmt.Printf("First 5 values: %v\n", chunk.Embedding[:5])
}
}
ExtractedImage¶
Image extracted from document with optional OCR results.
Signature:
type ExtractedImage struct {
Data []byte // Raw image bytes
Format string // Image format: "jpeg", "png", "webp"
ImageIndex int // Index in images list
PageNumber *int // Page number (if applicable)
Width *uint32 // Image width in pixels
Height *uint32 // Image height in pixels
Colorspace *string // Colorspace (sRGB, CMYK, etc.)
BitsPerComponent *uint32 // Bits per color component
IsMask bool // Is image a mask?
Description *string // Image description/alt text
OCRResult *ExtractionResult // Nested OCR extraction
}
Example:
for imgIdx, img := range result.Images {
fmt.Printf("Image %d: %s, %dx%d\n", imgIdx, img.Format, *img.Width, *img.Height)
filename := fmt.Sprintf("image_%d.%s", imgIdx, img.Format)
os.WriteFile(filename, img.Data, 0644)
if img.OCRResult != nil {
fmt.Printf("Image %d OCR: %s\n", imgIdx, img.OCRResult.Content)
}
}
Error Handling¶
Error Types¶
Kreuzberg defines a type hierarchy of errors via the KreuzbergError interface:
type KreuzbergError interface {
error
Kind() ErrorKind
}
type ErrorKind string
const (
ErrorKindUnknown ErrorKind = "unknown"
ErrorKindIO ErrorKind = "io"
ErrorKindValidation ErrorKind = "validation"
ErrorKindParsing ErrorKind = "parsing"
ErrorKindOCR ErrorKind = "ocr"
ErrorKindCache ErrorKind = "cache"
ErrorKindImageProcessing ErrorKind = "image_processing"
ErrorKindSerialization ErrorKind = "serialization"
ErrorKindMissingDependency ErrorKind = "missing_dependency"
ErrorKindPlugin ErrorKind = "plugin"
ErrorKindUnsupportedFormat ErrorKind = "unsupported_format"
ErrorKindRuntime ErrorKind = "runtime"
)
Error type classes:
ValidationError: Input validation failed (empty paths, missing MIME types)ParsingError: Document parsing failed (malformed file, unsupported format)OCRError: OCR backend failure (library missing, invalid language)CacheError: Cache operation failedImageProcessingError: Image manipulation failedSerializationError: JSON encoding/decoding failedMissingDependencyError: Required library not found (Tesseract, EasyOCR, etc.)PluginError: Plugin registration or execution failedUnsupportedFormatError: MIME type not supportedIOError: File I/O failureRuntimeError: Unexpected runtime failure (lock poisoning, etc.)
Error Classification¶
Errors are automatically classified based on native error messages. Use errors.As() and errors.Is() to handle specific error types:
import (
"errors"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
if err != nil {
var parsingErr *kreuzberg.ParsingError
if errors.As(err, &parsingErr) {
log.Printf("Parsing failed: %v\n", parsingErr)
return
}
var missingDep *kreuzberg.MissingDependencyError
if errors.As(err, &missingDep) {
log.Printf("Missing dependency: %s\n", missingDep.Dependency)
return
}
log.Printf("Extraction failed: %v\n", err)
}
Error Unwrapping¶
All Kreuzberg errors support error unwrapping via errors.Unwrap():
result, err := kreuzberg.ExtractFileSync("doc.pdf", nil)
if err != nil {
rootErr := errors.Unwrap(err)
if rootErr != nil {
log.Printf("Root cause: %v\n", rootErr)
}
if krErr, ok := err.(kreuzberg.KreuzbergError); ok {
log.Printf("Error kind: %v\n", krErr.Kind())
}
}
Error Handling Examples¶
Handle file not found:
result, err := kreuzberg.ExtractFileSync("missing.pdf", nil)
if err != nil {
var ioErr *kreuzberg.IOError
if errors.As(err, &ioErr) {
log.Println("File not found or unreadable")
return
}
log.Fatalf("unexpected error: %v\n", err)
}
Handle missing OCR dependency:
cfg := &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: stringPtr("eng"),
},
}
result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
var missingDep *kreuzberg.MissingDependencyError
if errors.As(err, &missingDep) {
log.Printf("Install %s to use OCR\n", missingDep.Dependency)
return
}
log.Fatalf("extraction failed: %v\n", err)
}
Batch error handling:
results, err := kreuzberg.BatchExtractFilesSync(files, nil)
if err != nil {
log.Fatalf("batch setup failed: %v\n", err)
}
for i, result := range results {
if result == nil {
log.Printf("File %d: extraction failed (nil result)\n", i)
continue
}
if result.Metadata.Error != nil {
log.Printf("File %d: %s - %s\n", i, result.Metadata.Error.ErrorType, result.Metadata.Error.Message)
continue
}
if !result.Success {
log.Printf("File %d: extraction unsuccessful\n", i)
continue
}
log.Printf("File %d: success (%d chars)\n", i, len(result.Content))
}
Advanced Usage¶
MIME Type Detection¶
Detect MIME type from file extension or content:
CGO-Specific Patterns¶
Memory Management¶
Go's cgo automatically manages C memory for simple types. Kreuzberg handles C pointer cleanup internally via defer statements:
result, err := kreuzberg.ExtractFileSync("doc.pdf", nil)
result, err := kreuzberg.ExtractBytesSync(data, "application/pdf", nil)
Static Linking Configuration¶
Go binaries are statically linked against the FFI library, so no runtime library paths are needed. Configuration is done at build time:
Monorepo Development:
# Build FFI library first
cargo build -p kreuzberg-ffi --release
# Go automatically finds target/release/libkreuzberg_ffi.a
go build -v ./...
# Run directly - no environment variables needed
./myapp
External Projects:
# Set CGO_LDFLAGS to point to the static library
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build
# Run directly - no runtime dependencies
./myapp
Configuration as JSON¶
Internally, ExtractionConfig is serialized to JSON and passed to the C FFI:
cfg := &kreuzberg.ExtractionConfig{
UseCache: boolPtr(true),
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: stringPtr("eng"),
},
}
result, err := kreuzberg.ExtractFileSync("doc.pdf", cfg)
Custom Post-Processors¶
Register custom post-processing logic in Go:
package main
import (
"C"
"encoding/json"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
//export myCustomProcessor
func myCustomProcessor(resultJSON *C.char) *C.char {
jsonStr := C.GoString(resultJSON)
var result kreuzberg.ExtractionResult
if err := json.Unmarshal([]byte(jsonStr), &result); err != nil {
errMsg := C.CString("failed to parse JSON")
return errMsg
}
result.Content = strings.ToUpper(result.Content)
modified, _ := json.Marshal(result)
return C.CString(string(modified))
}
func init() {
err := kreuzberg.RegisterPostProcessor(
"go-uppercase",
100, // priority
(C.PostProcessorCallback)(C.myCustomProcessor),
)
if err != nil {
log.Fatalf("failed to register post-processor: %v\n", err)
}
}
func main() {
cfg := &kreuzberg.ExtractionConfig{
Postprocessor: &kreuzberg.PostProcessorConfig{
EnabledProcessors: []string{"go-uppercase"},
},
}
result, _ := kreuzberg.ExtractFileSync("doc.pdf", cfg)
}
Custom Validators¶
Validate extraction results:
//export myValidator
func myValidator(resultJSON *C.char) *C.char {
jsonStr := C.GoString(resultJSON)
var result kreuzberg.ExtractionResult
json.Unmarshal([]byte(jsonStr), &result)
if len(result.Content) == 0 {
errMsg := C.CString("content is empty")
return errMsg
}
return nil
}
func init() {
kreuzberg.RegisterValidator(
"content-not-empty",
50,
(C.ValidatorCallback)(C.myValidator),
)
}
Custom OCR Backends¶
Register a custom OCR backend:
//export customOCR
func customOCR(imageData *C.uint8_t, width C.uint32_t, height C.uint32_t, lang *C.char) *C.char {
result := kreuzberg.ExtractionResult{
Content: "extracted text from custom OCR",
MimeType: "text/plain",
Success: true,
}
data, _ := json.Marshal(result)
return C.CString(string(data))
}
func init() {
kreuzberg.RegisterOCRBackend(
"custom-ocr",
(C.OcrBackendCallback)(C.customOCR),
)
}
Plugin Management¶
List and manage registered plugins:
validators, err := kreuzberg.ListValidators()
if err == nil {
fmt.Printf("Validators: %v\n", validators)
}
processors, err := kreuzberg.ListPostProcessors()
if err == nil {
fmt.Printf("Post-processors: %v\n", processors)
}
backends, err := kreuzberg.ListOCRBackends()
if err == nil {
fmt.Printf("OCR backends: %v\n", backends)
}
if err := kreuzberg.ClearValidators(); err != nil {
log.Fatalf("failed to clear validators: %v\n", err)
}
if err := kreuzberg.UnregisterValidator("my-validator"); err != nil {
log.Fatalf("failed to unregister: %v\n", err)
}
Performance Tips¶
- Batch Processing: Use
BatchExtractFilesSync()for multiple files to leverage internal optimizations - Context Timeouts: Set realistic timeouts; OCR can be slow on large documents
- Caching: Enable
UseCache: boolPtr(true)to cache frequently extracted documents - Static Linking: Binaries are self-contained after build; no runtime library paths needed
- Configuration Reuse: Create and reuse ExtractionConfig objects across multiple calls
- Goroutines: Use
ExtractFile()/ExtractBytes()variants in goroutines for concurrency
Troubleshooting¶
Static Library Not Found¶
Error: cannot find -lkreuzberg_ffi or undefined reference to 'kreuzberg_...'
Solution:
# Verify static library exists
ls -la target/release/libkreuzberg_ffi.a
# For monorepo development, just build the FFI crate:
cargo build -p kreuzberg-ffi --release
# For external projects, provide the path via CGO_LDFLAGS:
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build
The binary will be statically linked and have no runtime dependencies on Kreuzberg libraries.
CGO Compilation Errors¶
Error: error: kreuzberg.h: No such file or directory
Solution:
Ensure kreuzberg-ffi is built before building your Go module:
Missing OCR Library¶
Error: MissingDependencyError: Missing dependency: tesseract
Solution:
Install Tesseract or use a different OCR backend:
# macOS
brew install tesseract
# Debian/Ubuntu
apt-get install tesseract-ocr
# Or use EasyOCR/PaddleOCR (Python packages)
Context Timeout on Large Documents¶
Issue: Extraction times out before completion
Solution:
Increase timeout or disable OCR for large documents:
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
cfg := &kreuzberg.ExtractionConfig{
ForceOCR: boolPtr(false),
}
result, err := kreuzberg.ExtractFile(ctx, "large.pdf", cfg)
Testing¶
Run the test suite:
# Unit tests (from packages/go)
task go:test
# Lint (gofmt + golangci-lint)
task go:lint
# E2E tests (from e2e/go, auto-generated from fixtures)
task e2e:go:verify
# Manual test (build FFI library first)
cargo build -p kreuzberg-ffi --release
go test -v ./packages/go/v4
Helper Functions¶
Add these utility functions to your code:
func stringPtr(s string) *string {
return &s
}
func boolPtr(b bool) *bool {
return &b
}
func intPtr(i int) *int {
return &i
}
func float64Ptr(f float64) *float64 {
return &f
}
func uint32Ptr(u uint32) *uint32 {
return &u
}
Related Resources¶
- Source: packages/go/v4/ (Go binding implementation)
- FFI Bridge: crates/kreuzberg-ffi/ (C FFI layer)
- Rust Core: crates/kreuzberg/ (extraction logic)
- E2E Tests: e2e/go/ (auto-generated test fixtures)
- CI: .github/workflows/go-test.yml (test pipeline)