Skip to content

LLM Integration v4.8.0

Kreuzberg integrates with 146 LLM providers (including local inference engines) via liter-llm for three capabilities: VLM OCR, structured extraction, and provider-hosted embeddings.

Feature gate

Requires the llm Cargo feature. Not included in the default feature set.

VLM OCR

Use vision-language models as an OCR backend. The document page is rendered as an image and sent to the VLM, which returns the extracted text.

When to Use

  • Low-quality scanned documents where traditional OCR struggles
  • Handwritten text recognition
  • Arabic, Farsi, and other scripts with poor Tesseract/PaddleOCR support
  • Complex layouts where traditional OCR fails (mixed tables, forms, diagrams)
  • When you need higher accuracy and can accept higher latency and API costs

Configuration

Python
import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, LlmConfig

async def main() -> None:
    config = ExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(
            backend="vlm",
            vlm_config=LlmConfig(model="openai/gpt-4o-mini"),
        ),
    )
    result = await extract_file("scan.pdf", config=config)
    print(result.content)

asyncio.run(main())
TypeScript
import { extractFileSync } from '@kreuzberg/node';

const config = {
    forceOcr: true,
    ocr: {
        backend: 'vlm',
        vlmConfig: {
            model: 'openai/gpt-4o-mini',
        },
    },
};

const result = extractFileSync('scan.pdf', null, config);
console.log(result.content);
Rust
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};

let config = ExtractionConfig {
    force_ocr: true,
    ocr: Some(OcrConfig {
        backend: "vlm".to_string(),
        vlm_config: Some(LlmConfig {
            model: "openai/gpt-4o-mini".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    }),
    ..Default::default()
};
let result = extract_file("scan.pdf", None, &config).await?;
Terminal
kreuzberg extract scan.pdf --force-ocr true \
  --vlm-model openai/gpt-4o-mini
kreuzberg.toml
force_ocr = true

[ocr]
backend = "vlm"

[ocr.vlm_config]
model = "openai/gpt-4o-mini"
Terminal
export KREUZBERG_VLM_OCR_MODEL=openai/gpt-4o-mini
export OPENAI_API_KEY=sk-...

Custom VLM Prompt

Override the default prompt template for VLM OCR:

Python
from kreuzberg import ExtractionConfig, OcrConfig, LlmConfig

config = ExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(
        backend="vlm",
        vlm_config=LlmConfig(model="openai/gpt-4o-mini"),
        vlm_prompt="Extract all text from this document image. Preserve formatting.",
    ),
)

Supported Providers

Any liter-llm vision-capable provider works as a VLM OCR backend:

Provider Example Model
OpenAI openai/gpt-4o, openai/gpt-4o-mini
Anthropic anthropic/claude-sonnet-4-20250514
Google google/gemini-2.0-flash
Groq groq/llama-3.2-90b-vision-preview
Ollama (local) ollama/llama3.2-vision
LM Studio (local) lmstudio/llava-1.5
vLLM (local) vllm/llava-next

Structured Extraction

Extract structured JSON data from documents by providing a JSON schema. The document is first extracted as text, then sent to an LLM with the schema to produce conforming output.

Basic Usage

Python
import asyncio
from kreuzberg import extract_file, ExtractionConfig, StructuredExtractionConfig, LlmConfig

async def main() -> None:
    config = ExtractionConfig(
        structured_extraction=StructuredExtractionConfig(
            schema={
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "authors": {"type": "array", "items": {"type": "string"}},
                    "date": {"type": "string"},
                },
                "required": ["title", "authors", "date"],
                "additionalProperties": False,
            },
            llm=LlmConfig(model="openai/gpt-4o-mini"),
            strict=True,
        ),
    )
    result = await extract_file("paper.pdf", config=config)
    print(result.structured_output)
    # {"title": "...", "authors": ["..."], "date": "..."}

asyncio.run(main())
TypeScript
import { extractFileSync } from '@kreuzberg/node';

const config = {
    structuredExtraction: {
        schema: {
            type: 'object',
            properties: {
                title: { type: 'string' },
                authors: { type: 'array', items: { type: 'string' } },
                date: { type: 'string' },
            },
            required: ['title', 'authors', 'date'],
            additionalProperties: false,
        },
        llm: {
            model: 'openai/gpt-4o-mini',
        },
        strict: true,
    },
};

const result = extractFileSync('paper.pdf', null, config);
console.log(result.structuredOutput);
Rust
use kreuzberg::{
    extract_file, ExtractionConfig, LlmConfig, StructuredExtractionConfig,
};
use serde_json::json;

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        structured_extraction: Some(StructuredExtractionConfig {
            schema: json!({
                "type": "object",
                "properties": {
                    "title": { "type": "string" },
                    "authors": { "type": "array", "items": { "type": "string" } },
                    "date": { "type": "string" }
                },
                "required": ["title", "authors", "date"],
                "additionalProperties": false
            }),
            llm: LlmConfig {
                model: "openai/gpt-4o-mini".to_string(),
                ..Default::default()
            },
            strict: true,
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file("paper.pdf", None, &config).await?;
    if let Some(structured) = &result.structured_output {
        println!("{}", structured);
    }
    Ok(())
}
Terminal
kreuzberg extract-structured paper.pdf \
  --schema schema.json \
  --model openai/gpt-4o-mini \
  --strict
kreuzberg.toml
[structured_extraction]
schema_name = "paper_metadata"
strict = true

[structured_extraction.schema]
type = "object"

[structured_extraction.schema.properties.title]
type = "string"

[structured_extraction.schema.properties.date]
type = "string"

[structured_extraction.llm]
model = "openai/gpt-4o-mini"

Custom Prompts (Jinja2)

Override the default extraction prompt with a Jinja2 template:

Python
from kreuzberg import ExtractionConfig, StructuredExtractionConfig, LlmConfig

config = ExtractionConfig(
    structured_extraction=StructuredExtractionConfig(
        schema={"type": "object", "properties": {"title": {"type": "string"}}},
        llm=LlmConfig(model="openai/gpt-4o-mini"),
        prompt=(
            "Analyze this document and extract key metadata.\n\n"
            "Document:\n{{ content }}\n\n"
            "Schema: {{ schema }}"
        ),
    ),
)

Available template variables:

Variable Description
{{ content }} The extracted document text
{{ schema }} The JSON schema as a formatted string
{{ schema_name }} The schema name (default: "extraction")
{{ schema_description }} The schema description (may be empty)

Cross-Provider Compatibility

Structured extraction handles provider differences automatically:

  • OpenAI: Full strict mode with additionalProperties enforcement
  • Anthropic/Gemini: additionalProperties automatically stripped (not supported by these providers)
  • All providers: Markdown code fence wrapping in responses is automatically handled

Strict Mode

When strict=True, the LLM is instructed to produce output that exactly matches the schema. This enables OpenAI's structured output mode and adds validation on the response.

VLM Embeddings

Use provider-hosted embedding models instead of local ONNX models. Useful when you want to match the embedding model used by your vector database or when local ONNX models are not available.

Configuration

Python
import asyncio
from kreuzberg import embed, EmbeddingConfig, EmbeddingModelType, LlmConfig

async def main() -> None:
    config = EmbeddingConfig(
        model=EmbeddingModelType.llm(
            LlmConfig(model="openai/text-embedding-3-small")
        ),
        normalize=True,
    )
    embeddings = await embed(["Hello world"], config=config)
    print(len(embeddings[0]))  # 1536

asyncio.run(main())
TypeScript
import { embedSync } from '@kreuzberg/node';

const embeddings = embedSync(['Hello world'], {
  model: {
    modelType: 'llm',
    value: 'openai/text-embedding-3-small',
  },
  normalize: true,
});
console.log(embeddings[0].length); // 1536
Rust
use kreuzberg::{embed_texts, EmbeddingConfig, EmbeddingModelType, LlmConfig};

let config = EmbeddingConfig {
    model: EmbeddingModelType::Llm {
        llm: LlmConfig {
            model: "openai/text-embedding-3-small".to_string(),
            ..Default::default()
        },
    },
    normalize: true,
    ..Default::default()
};
let embeddings = embed_texts(&["Hello world"], &config)?;
Terminal
kreuzberg embed \
  --provider llm \
  --model openai/text-embedding-3-small \
  --text "Hello world"

Available Models

Model Dimensions Provider
openai/text-embedding-3-small 1536 OpenAI
openai/text-embedding-3-large 3072 OpenAI
mistral/mistral-embed 1024 Mistral
Any liter-llm embedding-capable provider Varies Various

Local LLM Support

v4.8.0

Kreuzberg supports local LLM inference engines via liter-llm's built-in provider routing. No API key required — just point to your local server.

Supported Local Engines

Engine Prefix Default URL Install
Ollama ollama/ http://localhost:11434/v1 brew install ollama
LM Studio lmstudio/ http://localhost:1234/v1 Desktop app
vLLM vllm/ http://localhost:8000/v1 pip install vllm
llama.cpp llamacpp/ http://localhost:8080/v1 Build from source
LocalAI localai/ http://localhost:8080/v1 Docker
llamafile llamafile/ http://localhost:8080/v1 Single binary

Example: Ollama

# Start Ollama and pull a model

ollama pull llama3.2-vision

# Use it for VLM OCR (no API key needed)
kreuzberg extract scan.pdf --force-ocr true \
  --vlm-model ollama/llama3.2-vision

# Use it for structured extraction
kreuzberg extract-structured doc.pdf \
  --schema schema.json \
  --model ollama/llama3.2

# Use it for embeddings
kreuzberg embed --provider llm \
  --model ollama/all-minilm \
  --text "Hello world"
from Kreuzberg import extract_file, ExtractionConfig, StructuredExtractionConfig, LlmConfig

config = ExtractionConfig(
    structured_extraction=StructuredExtractionConfig(
        schema={"type": "object", "properties": {"title": {"type": "string"}}},
        llm=LlmConfig(model="ollama/llama3.2"),  # No api_key needed
    ),
)
result = await extract_file("doc.pdf", config=config)
[structured_extraction.llm]
model = "ollama/llama3.2"

# No api_key needed for local providers

Custom Base URL

If your local server runs on a non-default port, use base_url: python LlmConfig(model="ollama/llama3.2", base_url="http://localhost:11435/v1")

API Key Configuration

API keys can be set via (in order of precedence):

  1. api_key field in LlmConfig — highest priority, per-request
  2. Provider standard env vars (OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, etc.)
  3. Kreuzberg-specific env var (KREUZBERG_LLM_API_KEY) — used as fallback for any provider

Local providers skip API key lookup

Local inference engines (Ollama, LM Studio, vLLM, llama.cpp, LocalAI, llamafile) do not require an API key. If you use a local provider prefix (for example, ollama/), the API key fields are ignored.

Python
from kreuzberg import LlmConfig

# Explicit API key
config = LlmConfig(model="openai/gpt-4o", api_key="sk-...")

# Custom base URL (e.g., Azure OpenAI, local proxy)
config = LlmConfig(
    model="openai/gpt-4o",
    base_url="https://my-proxy.example.com/v1",
)

LlmConfig Reference

Field Type Default Description
model str required Provider/model in liter-llm format (for example, "openai/gpt-4o")
api_key str \| None None API key (falls back to env vars)
base_url str \| None None Custom endpoint URL
timeout_secs int \| None 60 Request timeout in seconds
max_retries int \| None 3 Maximum retry attempts
temperature float \| None None Sampling temperature
max_tokens int \| None None Maximum tokens to generate

REST API

Structured Extraction

POST /extract-structured — multipart form with file + schema + model configuration.

Terminal
curl -X POST http://localhost:4000/extract-structured \
  -F "file=@invoice.pdf" \
  -F 'schema={"type":"object","properties":{"vendor":{"type":"string"},"total":{"type":"number"}}}' \
  -F "model=openai/gpt-4o-mini" \
  -F "strict=true"

MCP Tools

When running Kreuzberg as an MCP server, LLM features are available as tools:

  • extract_structured — extract structured data from a document using a JSON schema
  • embed_text — extended with model parameter for LLM-hosted embeddings