Document Translation¶
Translate extracted content into a target language with an LLM. Translates content, optionally formatted_content, and every chunk's text in place. Result populates ExtractionResult.translation.
Feature gate
Requires the translation Cargo feature. Included in full. Requires liter-llm for the underlying provider.
When to Use¶
- You ingest documents in mixed languages and want a single normalised language for downstream search or analytics.
- You need per-chunk translation aligned with retrieval-augmented generation (RAG) indexes.
- You need Markdown/HTML preserved through translation (
preserve_markup = true).
When Not to Use¶
- You only need machine-translation of short user queries. Call the LLM provider directly.
- You need a deterministic, network-free pipeline. Translation always calls an LLM.
Configuration¶
import asyncio
from kreuzberg import extract_file, ExtractionConfig, TranslationConfig, LlmConfig
async def main() -> None:
config = ExtractionConfig(
translation=TranslationConfig(
target_lang="de",
llm=LlmConfig(model="openai/gpt-4o-mini"),
),
)
result = await extract_file("contract.pdf", config=config)
if result.translation:
print(result.translation.content)
asyncio.run(main())
use kreuzberg::{extract_file, ExtractionConfig, TranslationConfig, LlmConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = ExtractionConfig {
translation: Some(TranslationConfig {
target_lang: "de".to_string(),
source_lang: None,
preserve_markup: false,
llm: LlmConfig {
model: "openai/gpt-4o-mini".to_string(),
..Default::default()
},
}),
..Default::default()
};
let result = extract_file("contract.pdf", None, &config).await?;
if let Some(translation) = result.translation {
println!("{}", translation.content);
}
Ok(())
}
Preserve Markup¶
Set preserve_markup = true to translate formatted_content (Markdown / HTML) without losing formatting. The LLM is prompted to keep code fences, links, lists, and tables intact.
Language Codes¶
target_lang is a BCP-47 tag. Common values:
| Tag | Language |
|---|---|
en |
English |
de |
German |
fr |
French |
fr-CA |
French (Canada) |
es |
Spanish |
zh |
Chinese |
ja |
Japanese |
ar |
Arabic |
pt-BR |
Portuguese (Brazil) |
source_lang follows the same format; leave None for auto-detection.
Output Shape¶
{
"translation": {
"target_lang": "de",
"source_lang": "en",
"content": "Der Vertrag legt eine dreijährige Supportvereinbarung mit vierteljährlicher Abrechnung fest.",
"formatted_content": "# Vertrag\n\nDie Laufzeit beträgt drei Jahre…"
}
}
Chunks (when chunking is enabled) carry the translated text in place — result.chunks[i].content holds the translated chunk, not the source.
Provider Setup¶
Pick any liter-llm provider — see LLM Integration. For high-quality translation, gpt-4o, claude-3-5-sonnet, and google/gemini-2.5-pro are typical picks; gpt-4o-mini works for short documents.
API-key precedence:
TranslationConfig.llm.api_keyKREUZBERG_LLM_API_KEY- Per-provider env var
Related¶
- LLM Integration — provider matrix
- Document Summarisation — sibling LLM post-processor
- Configuration Reference