VLM Image Captions¶

Caption every extracted image with a vision-language model. Captions populate ExtractedImage.caption for every image whose area exceeds CaptioningConfig.min_image_area.

Feature gate

Requires the captioning Cargo feature. Included in full. Requires liter-llm and a vision-capable provider.

When to Use¶

You need alt-text for accessibility-compliant exports
You need searchable text descriptions per image to feed into a retrieval pipeline alongside the document body
You need diagrams, charts, or photos described for LLM downstream consumption

When Not to Use¶

You only need OCR'd text from images — use OCR for text extraction from images
You're processing high-volume batches where API spend is a concern — captioning calls an LLM per image
Images are mostly decorative or structural elements

Configuration¶

PythonTypeScriptRustTOML

Python

from kreuzberg import extract_file, ExtractionConfig, CaptioningConfig, LlmConfig

config = ExtractionConfig(
    captioning=CaptioningConfig(
        llm=LlmConfig(model="openai/gpt-4o-mini"),
    ),
)
result = await extract_file("report.pdf", config=config)
for image in result.images or []:
    if image.caption:
        print(image.caption)

TypeScript

import { extractFile } from "@kreuzberg/node";

const result = await extractFile("report.pdf", {
    captioning: {
        llm: { model: "openai/gpt-4o-mini" },
    },
});

for (const image of result.images ?? []) {
    if (image.caption) {
        console.log(image.caption);
    }
}

Rust

use kreuzberg::{extract_file, ExtractionConfig, CaptioningConfig, LlmConfig};

let config = ExtractionConfig {
    captioning: Some(CaptioningConfig {
        llm: LlmConfig {
            model: "openai/gpt-4o-mini".to_string(),
            ..Default::default()
        },
        prompt: None,
        min_image_area: 1000,
    }),
    ..Default::default()
};
let result = extract_file("report.pdf", None, &config).await?;
for image in &result.images {
    if let Some(caption) = &image.caption {
        println!("{caption}");
    }
}

kreuzberg.toml

[captioning]
min_image_area = 1000

[captioning.llm]
model = "openai/gpt-4o-mini"

Custom Prompt¶

Override the built-in caption prompt:

Python

from kreuzberg import ExtractionConfig, CaptioningConfig, LlmConfig

config = ExtractionConfig(
    captioning=CaptioningConfig(
        llm=LlmConfig(model="openai/gpt-4o-mini"),
        prompt="Describe this figure in one sentence suitable for alt-text.",
        min_image_area=4000,
    ),
)

The prompt is sent alongside each image as a single VLM request. The model sees the image plus the prompt; the response becomes the caption verbatim.

Filtering Small Images¶

min_image_area is in pixels (width × height). Icons, bullets, and decorative glyphs below the threshold are skipped — their caption field stays None. The default 1000 excludes 32×32 icons but admits typical inline figures. Raise the threshold to skip thumbnails; lower it to caption everything.

Output Shape¶

{
  "images": [
    {
      "image_kind": "diagram",
      "caption": "A flowchart showing the data ingestion pipeline: source → cleaner → indexer → retrieval API.",
      "bounding_box": { "page": 3, "x": 72, "y": 144, "width": 468, "height": 312 }
    },
    {
      "image_kind": "icon",
      "caption": null
    }
  ]
}

Supported Providers¶

Any vision-capable liter-llm provider works (see the VLM OCR provider table). For batch captioning, gpt-4o-mini, claude-3-5-haiku, and google/gemini-2.0-flash are typically the cheapest options.

API-key precedence chain matches LLM Integration:

CaptioningConfig.llm.api_key
KREUZBERG_LLM_API_KEY
Per-provider env var

Local engines (Ollama, LM Studio with a VLM, vLLM) need no key.

LLM Integration — provider matrix, local engines, VLM OCR
OCR — text-from-image extraction
Configuration Reference

Edit this page on GitHub