VLM Image Captions¶
Caption every extracted image with a vision-language model. Captions populate ExtractedImage.caption for every image whose area exceeds CaptioningConfig.min_image_area.
Feature gate
Requires the captioning Cargo feature. Included in full. Requires liter-llm and a vision-capable provider.
When to Use¶
- You need alt-text for accessibility-compliant exports
- You need searchable text descriptions per image to feed into a retrieval pipeline alongside the document body
- You need diagrams, charts, or photos described for LLM downstream consumption
When Not to Use¶
- You only need OCR'd text from images — use OCR for text extraction from images
- You're processing high-volume batches where API spend is a concern — captioning calls an LLM per image
- Images are mostly decorative or structural elements
Configuration¶
from kreuzberg import extract_file, ExtractionConfig, CaptioningConfig, LlmConfig
config = ExtractionConfig(
captioning=CaptioningConfig(
llm=LlmConfig(model="openai/gpt-4o-mini"),
),
)
result = await extract_file("report.pdf", config=config)
for image in result.images or []:
if image.caption:
print(image.caption)
use kreuzberg::{extract_file, ExtractionConfig, CaptioningConfig, LlmConfig};
let config = ExtractionConfig {
captioning: Some(CaptioningConfig {
llm: LlmConfig {
model: "openai/gpt-4o-mini".to_string(),
..Default::default()
},
prompt: None,
min_image_area: 1000,
}),
..Default::default()
};
let result = extract_file("report.pdf", None, &config).await?;
for image in &result.images {
if let Some(caption) = &image.caption {
println!("{caption}");
}
}
Custom Prompt¶
Override the built-in caption prompt:
The prompt is sent alongside each image as a single VLM request. The model sees the image plus the prompt; the response becomes the caption verbatim.
Filtering Small Images¶
min_image_area is in pixels (width × height). Icons, bullets, and decorative glyphs below the threshold are skipped — their caption field stays None. The default 1000 excludes 32×32 icons but admits typical inline figures. Raise the threshold to skip thumbnails; lower it to caption everything.
Output Shape¶
{
"images": [
{
"image_kind": "diagram",
"caption": "A flowchart showing the data ingestion pipeline: source → cleaner → indexer → retrieval API.",
"bounding_box": { "page": 3, "x": 72, "y": 144, "width": 468, "height": 312 }
},
{
"image_kind": "icon",
"caption": null
}
]
}
Supported Providers¶
Any vision-capable liter-llm provider works (see the VLM OCR provider table). For batch captioning, gpt-4o-mini, claude-3-5-haiku, and google/gemini-2.0-flash are typically the cheapest options.
API-key precedence chain matches LLM Integration:
CaptioningConfig.llm.api_keyKREUZBERG_LLM_API_KEY- Per-provider env var
Local engines (Ollama, LM Studio with a VLM, vLLM) need no key.
Related¶
- LLM Integration — provider matrix, local engines, VLM OCR
- OCR — text-from-image extraction
- Configuration Reference