OCR Language Support¶
Each OCR backend supports different sets of languages. Select the backend and language based on your document requirements.
Backend Language Coverage¶
Tesseract¶
Default OCR backend with 100+ supported languages. Supports both ISO 639-3 (three-letter codes) and variant codes for historical and script-specific variants.
| Language | Code | Status |
|---|---|---|
| Afrikaans | afr |
Full |
| Amharic | amh |
Full |
| Arabic | ara |
Full |
| Assamese | asm |
Full |
| Azerbaijani (Latin) | aze |
Full |
| Azerbaijani (Cyrillic) | aze_cyrl |
Full |
| Belarusian | bel |
Full |
| Bengali | ben |
Full |
| Tibetan | bod |
Full |
| Bosnian | bos |
Full |
| Breton | bre |
Full |
| Bulgarian | bul |
Full |
| Catalan | cat |
Full |
| Cebuano | ceb |
Full |
| Czech | ces |
Full |
| Chinese (Simplified) | chi_sim |
Full |
| Chinese (Traditional) | chi_tra |
Full |
| Cherokee | chr |
Full |
| Corsican | cos |
Full |
| Welsh | cym |
Full |
| Danish | dan |
Full |
| German | deu |
Full |
| Dhivehi | div |
Full |
| Dzongkha | dzo |
Full |
| Greek | ell |
Full |
| English | eng |
Full |
| Middle English | enm |
Full |
| Esperanto | epo |
Full |
| Equations | equ |
Math formulas |
| Estonian | est |
Full |
| Basque | eus |
Full |
| Faroese | fao |
Full |
| Persian | fas |
Full |
| Filipino | fil |
Full |
| Finnish | fin |
Full |
| French | fra |
Full |
| Frankish (Old German) | frk |
Full |
| Middle French | frm |
Full |
| Frisian | fry |
Full |
| Scottish Gaelic | gla |
Full |
| Irish | gle |
Full |
| Galician | glg |
Full |
| Ancient Greek | grc |
Full |
| Gujarati | guj |
Full |
| Haitian Creole | hat |
Full |
| Hebrew | heb |
Full |
| Hindi | hin |
Full |
| Croatian | hrv |
Full |
| Hungarian | hun |
Full |
| Armenian | hye |
Full |
| Inuktitut | iku |
Full |
| Indonesian | ind |
Full |
| Icelandic | isl |
Full |
| Italian | ita |
Full |
| Italian (Old) | ita_old |
Full |
| Javanese | jav |
Full |
| Japanese | jpn |
Full |
| Kannada | kan |
Full |
| Georgian | kat |
Full |
| Georgian (Old) | kat_old |
Full |
| Kazakh | kaz |
Full |
| Khmer | khm |
Full |
| Kyrgyz | kir |
Full |
| Kurmanji (Kurdish) | kmr |
Full |
| Korean | kor |
Full |
| Lao | lao |
Full |
| Latin | lat |
Full |
| Latvian | lav |
Full |
| Lithuanian | lit |
Full |
| Luxembourgish | ltz |
Full |
| Malayalam | mal |
Full |
| Marathi | mar |
Full |
| Macedonian | mkd |
Full |
| Maltese | mlt |
Full |
| Mongolian | mon |
Full |
| Māori | mri |
Full |
| Malay | msa |
Full |
| Burmese | mya |
Full |
| Nepali | nep |
Full |
| Dutch | nld |
Full |
| Norwegian | nor |
Full |
| Occitan | oci |
Full |
| Odia | ori |
Full |
| Orientation/Script detection | osd |
Layout |
| Punjabi | pan |
Full |
| Polish | pol |
Full |
| Portuguese | por |
Full |
| Pushto | pus |
Full |
| Quechua | que |
Full |
| Romanian | ron |
Full |
| Russian | rus |
Full |
| Sanskrit | san |
Full |
| Sinhala | sin |
Full |
| Slovak | slk |
Full |
| Slovenian | slv |
Full |
| Sindhi | snd |
Full |
| Spanish | spa |
Full |
| Spanish (Old) | spa_old |
Full |
| Albanian | sqi |
Full |
| Serbian | srp |
Full |
| Serbian (Latin) | srp_latn |
Full |
| Sundanese | sun |
Full |
| Swahili | swa |
Full |
| Swedish | swe |
Full |
| Syriac | syr |
Full |
| Tamil | tam |
Full |
| Tatar | tat |
Full |
| Telugu | tel |
Full |
| Tajik | tgk |
Full |
| Thai | tha |
Full |
| Tigrinya | tir |
Full |
| Tonga | ton |
Full |
| Turkish | tur |
Full |
| Uyghur | uig |
Full |
| Ukrainian | ukr |
Full |
| Urdu | urd |
Full |
| Uzbek | uzb |
Full |
| Uzbek (Cyrillic) | uzb_cyrl |
Full |
| Vietnamese | vie |
Full |
| Yiddish | yid |
Full |
| Yoruba | yor |
Full |
Installation: Tesseract language packs are installed via your OS package manager. Install the base Tesseract, then add individual language packs as needed. See the OCR guide for installation steps by platform.
Selection: Use --ocr-language flag with a single code (e.g., --ocr-language eng). For multiple languages, join with +: --ocr-language eng+deu+fra.
PaddleOCR¶
Fast text detection and recognition for 80+ languages across 11 script families. Optimized for both mobile and server deployment.
| Language | Code | Family |
|---|---|---|
| Afrikaans | afr |
Latin |
| Arabic | ara |
Arabic |
| Bulgarian | bul |
Cyrillic |
| Chinese (Simplified) | ch_sim / zh_hans |
CJK |
| Chinese (Traditional) | ch_tra / zh_hant |
CJK |
| Czech | cs |
Latin |
| Danish | da |
Latin |
| Dutch | nl |
Latin |
| English | en |
Latin |
| Estonian | et |
Latin |
| Finnish | fi |
Latin |
| French | fr |
Latin |
| German | de |
Latin |
| Greek | el |
Greek |
| Hungarian | hu |
Latin |
| Indonesian | id |
Latin |
| Italian | it |
Latin |
| Japanese | ja |
CJK |
| Korean | ko |
CJK |
| Latin | la |
Latin |
| Latvian | lv |
Latin |
| Lithuanian | lt |
Latin |
| Norwegian | nb |
Latin |
| Persian | fa |
Arabic |
| Polish | pl |
Latin |
| Portuguese | pt |
Latin |
| Romanian | ro |
Latin |
| Russian | ru |
Cyrillic |
| Slovak | sk |
Latin |
| Slovenian | sl |
Latin |
| Spanish | es |
Latin |
| Swedish | sv |
Latin |
| Tagalog | tl |
Latin |
| Turkish | tr |
Latin |
| Ukrainian | uk |
Cyrillic |
| Vietnamese | vi |
Latin |
Installation: Built into Kreuzberg via the paddle-ocr feature. Models download automatically on first use.
Selection: Use --ocr-language or config ocr.languages with a single code or list. Join multiple codes with + for CLI: --ocr-language en+de+zh_hans.
Note: PaddleOCR uses two-letter ISO 639-1 codes and script-specific variants (e.g., zh_hans for Simplified Chinese). Consult the paddleocr-vl backend source for the authoritative list.
EasyOCR¶
High-accuracy deep learning models for 80+ languages. Python-only backend.
EasyOCR supports the same 80+ languages as PaddleOCR plus additional variants. Refer to the official EasyOCR language list for the complete set.
Installation: pip install "kreuzberg[easyocr]"
Selection: Use --ocr-language or config ocr.languages with language codes from the EasyOCR documentation.
Candle TrOCR v5.0¶
Lightweight line-level text recognition using Microsoft's TrOCR model. Trained primarily for English.
| Language | Code |
|---|---|
| English | eng / en |
Variants: Four model sizes are available:
base-printed(default) — Optimized for printed text, ~250 MBlarge-printed— Higher accuracy, ~400 MBbase-handwritten— Trained on handwritten textlarge-handwritten— Large variant for handwriting
Limitation: TrOCR is designed for single lines of text, not full pages. Pair with a layout detector to crop regions before OCR.
Installation: Enabled via candle-ocr or full feature in Cargo.
Selection: Configure via backend options:
Candle GLM-OCR v5.0¶
Multilingual vision-language model (0.9B) for full-page document parsing. Supports text, tables, formulas, and charts with region-aware layout dispatch.
| Language | Code |
|---|---|
| English | eng / en |
| Chinese | zho / zh |
| Japanese | jpn / ja |
| Korean | kor / ko |
| French | fra / fr |
| German | deu / de |
| Spanish | spa / es |
| Italian | ita / it |
| Portuguese | por / pt |
| Russian | rus / ru |
| Arabic | ara / ar |
| Hindi | hin / hi |
| Thai | tha / th |
| Vietnamese | vie / vi |
Note: GLM-OCR is trained on multilingual data and accepts any language code, falling back gracefully for unsupported languages.
Installation: Enabled via candle-ocr or full feature.
Selection: Use --ocr-backend candle-glm-ocr and optionally set --ocr-language from the list above.
Configuration:
{
"ocr": {
"backend": "candle-glm-ocr",
"backend_options": {
"task": "ocr",
"device": "auto",
"layout_mode": "whole_page"
}
}
}
Candle PaddleOCR-VL v5.0¶
Lightweight vision-language model for multilingual document parsing. Supports text, tables, formulas, and charts.
| Language | Code |
|---|---|
| English | eng / en |
| Chinese | zho / zh |
| Japanese | jpn / ja |
| Korean | kor / ko |
| French | fra / fr |
| German | deu / de |
| Spanish | spa / es |
| Italian | ita / it |
| Portuguese | por / pt |
| Russian | rus / ru |
| Arabic | ara / ar |
| Hindi | hin / hi |
| Thai | tha / th |
| Vietnamese | vie / vi |
Note: PaddleOCR-VL is trained on 109+ languages and accepts any language code.
Installation: Enabled via candle-ocr or full feature.
Selection: Use --ocr-backend candle-paddleocr-vl with optional --ocr-language.
Configuration: Requires a local model path:
{
"ocr": {
"backend": "candle-paddleocr-vl",
"backend_options": {
"task": "ocr",
"device": "auto",
"model_path": "/path/to/paddleocr-vl-model"
}
}
}
Candle DeepSeek-OCR v5.0¶
Vision-language model combining SAM vision encoder, ViT/Qwen2, CLIP, and language decoder. Supports multilingual document parsing.
| Language | Code |
|---|---|
| English | eng / en |
| Chinese | zho / zh |
| Japanese | jpn / ja |
| Korean | kor / ko |
| French | fra / fr |
| German | deu / de |
| Spanish | spa / es |
| Italian | ita / it |
| Portuguese | por / pt |
| Russian | rus / ru |
| Arabic | ara / ar |
| Hindi | hin / hi |
| Thai | tha / th |
| Vietnamese | vie / vi |
Note: DeepSeek-OCR is trained on multilingual data and accepts any language code.
Installation: Enabled via candle-ocr or full feature.
Selection: Use --ocr-backend candle-deepseek-ocr with optional --ocr-language.
Configuration: Requires a local model path:
{
"ocr": {
"backend": "candle-deepseek-ocr",
"backend_options": {
"device": "auto",
"model_path": "/path/to/deepseek-ocr-model",
"version": 2
}
}
}
Candle Hunyuan-OCR v5.0¶
Vision-language model for comprehensive document parsing. Supports CJK (Chinese, Japanese, Korean) and Latin scripts with multilingual capabilities.
| Language | Code |
|---|---|
| English | eng / en |
| Chinese | zho / zh |
| Japanese | jpn / ja |
| Korean | kor / ko |
| French | fra / fr |
| German | deu / de |
| Spanish | spa / es |
| Italian | ita / it |
| Portuguese | por / pt |
| Russian | rus / ru |
| Arabic | ara / ar |
| Hindi | hin / hi |
| Thai | tha / th |
| Vietnamese | vie / vi |
Note: Hunyuan-OCR is trained on multilingual data and accepts any language code.
Installation: Enabled via candle-ocr or full feature.
Selection: Use --ocr-backend candle-hunyuan-ocr with optional --ocr-language.
Configuration: Requires a local model path:
{
"ocr": {
"backend": "candle-hunyuan-ocr",
"backend_options": {
"device": "auto",
"model_path": "/path/to/hunyuan-ocr-model"
}
}
}
Selecting Languages¶
Configuration Precedence¶
Language selection follows the configuration cascade (highest to lowest priority):
- CLI flag:
--ocr-language engor--ocr-language eng+deu - Inline JSON config:
--config-json '{"ocr": {"languages": ["eng", "deu"]}}' - Config file:
kreuzberg.tomlorkreuzberg.yaml:
- Default: Uses backend's default language behavior (usually English or auto-detect)
Code Format¶
- Tesseract: ISO 639-3 three-letter codes (e.g.,
eng,deu,fra) - PaddleOCR: ISO 639-1 two-letter codes or script-specific variants (e.g.,
en,de,zh_hans) - Candle VLMs: Accept major language codes (e.g.,
eng,en,zho,zh)
Multiple Languages¶
Combine languages with + in CLI, or use arrays in config:
Fallback Strategy¶
If a requested language is unavailable:
- Tesseract: Fails with an error if the tessdata pack is not installed
- PaddleOCR: Falls back to the model's training base or returns an error
- Candle VLMs: Accept all language codes and attempt recognition (graceful degradation)
- EasyOCR: Falls back to the EasyOCR model's default behavior or returns an error
Install missing language packs for Tesseract via your OS package manager before OCR execution.
Language Detection¶
Use the language-detection feature to auto-detect document language, then pass the detected language to OCR:
from kreuzberg import Client
client = Client()
result = client.extract_file(
"document.pdf",
{
"language_detection": True,
"ocr": {"backend": "tesseract"}
}
)
print(f"Detected language: {result.metadata.language}")
See Also¶
- OCR Guide — Backend setup, configuration, and best practices
- Configuration Reference — Full list of extraction config options
- Supported Formats — Document format support matrix