Skip to content

Document Translation

Translate extracted content into a target language with an LLM. Translates content, optionally formatted_content, and every chunk's text in place. Result populates ExtractionResult.translation.

Feature gate

Requires the translation Cargo feature. Included in full. Requires liter-llm for the underlying provider.

When to Use

  • You ingest documents in mixed languages and want a single normalised language for downstream search or analytics.
  • You need per-chunk translation aligned with retrieval-augmented generation (RAG) indexes.
  • You need Markdown/HTML preserved through translation (preserve_markup = true).

When Not to Use

  • You only need machine-translation of short user queries. Call the LLM provider directly.
  • You need a deterministic, network-free pipeline. Translation always calls an LLM.

Configuration

Python
import asyncio
from kreuzberg import extract_file, ExtractionConfig, TranslationConfig, LlmConfig

async def main() -> None:
    config = ExtractionConfig(
        translation=TranslationConfig(
            target_lang="de",
            llm=LlmConfig(model="openai/gpt-4o-mini"),
        ),
    )
    result = await extract_file("contract.pdf", config=config)
    if result.translation:
        print(result.translation.content)

asyncio.run(main())
TypeScript
import { extractFile } from '@kreuzberg/node';

const result = await extractFile("contract.pdf", {
    translation: {
        targetLang: "de",
        llm: { model: "openai/gpt-4o-mini" },
    },
});
if (result.translation) {
    console.log(result.translation.content);
}
Rust
use kreuzberg::{extract_file, ExtractionConfig, TranslationConfig, LlmConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = ExtractionConfig {
        translation: Some(TranslationConfig {
            target_lang: "de".to_string(),
            source_lang: None,
            preserve_markup: false,
            llm: LlmConfig {
                model: "openai/gpt-4o-mini".to_string(),
                ..Default::default()
            },
        }),
        ..Default::default()
    };
    let result = extract_file("contract.pdf", None, &config).await?;
    if let Some(translation) = result.translation {
        println!("{}", translation.content);
    }
    Ok(())
}
kreuzberg.toml
[translation]
target_lang = "de"
preserve_markup = false

[translation.llm]
model = "openai/gpt-4o-mini"

Preserve Markup

Set preserve_markup = true to translate formatted_content (Markdown / HTML) without losing formatting. The LLM is prompted to keep code fences, links, lists, and tables intact.

Python
from kreuzberg import ExtractionConfig, TranslationConfig, LlmConfig

config = ExtractionConfig(
    translation=TranslationConfig(
        target_lang="de",
        source_lang="en",
        preserve_markup=True,
        llm=LlmConfig(model="openai/gpt-4o"),
    ),
)

Language Codes

target_lang is a BCP-47 tag. Common values:

Tag Language
en English
de German
fr French
fr-CA French (Canada)
es Spanish
zh Chinese
ja Japanese
ar Arabic
pt-BR Portuguese (Brazil)

source_lang follows the same format; leave None for auto-detection.

Output Shape

{
  "translation": {
    "target_lang": "de",
    "source_lang": "en",
    "content": "Der Vertrag legt eine dreijährige Supportvereinbarung mit vierteljährlicher Abrechnung fest.",
    "formatted_content": "# Vertrag\n\nDie Laufzeit beträgt drei Jahre…"
  }
}

Chunks (when chunking is enabled) carry the translated text in place — result.chunks[i].content holds the translated chunk, not the source.

Provider Setup

Pick any liter-llm provider — see LLM Integration. For high-quality translation, gpt-4o, claude-3-5-sonnet, and google/gemini-2.5-pro are typical picks; gpt-4o-mini works for short documents.

API-key precedence:

  1. TranslationConfig.llm.api_key
  2. KREUZBERG_LLM_API_KEY
  3. Per-provider env var

Edit this page on GitHub