Skip to content

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.


[Unreleased]

Changed

  • deps: bump alef pin v0.19.14 → v0.19.20. v0.19.15-v0.19.20 ship generator fixes addressing the systemic trait-bridge stub regressions surfaced by v5.0.0-rc.3 CI E2E (11 of 14 lang jobs failing on plugin_api stubs missing super-trait methods, wrong return types, internal type leakage, missing PHP interface emission, etc.) — plus the v0.19.20 hotfix that registers the new PHP interface Jinja templates in alef's embedded TEMPLATES array. Affects every binding (crates/kreuzberg-{node,wasm,ffi}, packages/*) and every e2e suite under e2e/.

[5.0.0-rc.3] - 2026-05-26

Changed

  • deps: bump alef pin to v0.19.14 — bundles all v0.19.12 + v0.19.13 + v0.19.14 generator fixes across every language emitter (Rust e2e unused result, Go missing encoding/json import, Zig undeclared types + callconv(.c), Python e2e initialize() stub, PHP e2e interface FQN, C# Bridge.Register(impl), Elixir unregister_* public wrappers, dart .as_ref() for Option<HashMap>, dart unit-variant default ctor parens, Swift RustStr typedef in placeholder, Ruby cargo-machete metadata, Java duplicate-method dedup, Kotlin Android e2e ExtractionConfig inference + .size for ByteArray + EmbeddingConfig model, dart RID-aware FRB native lib loader, kotlin streaming virtual fields gating, csharp per-binding exception class, go arch name mapping for FFI download, elixir always-include :jason dep).
  • e2e: add homebrew to [crates.e2e].languages — registry-mode test_app validates CLI + FFI Homebrew formulae post-publish.

Fixed

  • publish (workflow): release-finalize job now declares prepare in its needs: list. The job referenced needs.prepare.outputs.is_tag, dry_run, tag, is_prerelease but did not depend on prepare, producing an actionlint failure that gated CI Lint.
  • publish (ruby): exclude ext/*/native/target/ and ext/*/native/tmp/ directories from gem files list. The gemspec was including the entire Rust build artifact tree (compiled .rlib, .rmeta, .dylib files), bloating the .gem from 33 MB to 388 MB, causing gem spec to reject it as invalid due to size constraints. Now filters out build directories via reject { |f| f.include?("/native/target/") || f.include?("/native/tmp/") }.
  • publish (workflow ordering): add new finalize-github-release-after-uploads job that publishes the GitHub Release from draft to final state immediately after all native artifact uploads complete, but before publish-hex and publish-zig jobs run. Hex.pm and Zig package registries fetch asset URLs from the release page, which fails with 404 when the release is in DRAFT state. Now, publish-hex and publish-zig depend on the early finalize job, ensuring the release is already published before they attempt to download assets. The final release-finalize job runs after all publishes for post-publish validation and idempotently handles the Go module tag creation.
  • dart (ios): use wasm-target feature set for iOS x86_64 simulator build instead of android-target. The x86_64-apple-ios target was pulling in Android ABI pre-built binaries (pdfium, ort) with mismatched CPU types, causing lipo failures during XCFramework assembly. wasm-target is ORT-free and cross-compiles cleanly for all iOS simulators.

[5.0.0-rc.2] (skipped) - 2026-05-25

rc.2 was tagged + partially published (5/13 registries) on 2026-05-25 then superseded by rc.3. The fixes listed below were carried forward.

Changed

  • deps: bump alef pin to v0.19.10 (C# KreuzbergLib.Register*/Unregister* plugin forwarders + *Bridge.Register*(impl) static factories; C# trait-bridge stubs emit sync methods with real return types; Kotlin Android test-app switched to JUnit unit tests + correct Maven coordinate dev.kreuzberg:kreuzberg-android; Swift dead-code removal of legacy e2e helpers; PHP/Elixir ahash machete-ignore; Java output path aligned with pom.xml; Go test-app go.mod registry pin; pnpm-workspace allow-builds; Dart FRB_DART_LOAD_EXTERNAL_LIBRARY_NATIVE_LIB_DIR env var support; homebrew + php-ext codegen generalized from kreuzberg-specific defaults).

Fixed

  • publish (zig): mirror html-to-markdown's check-zig + publish-zig pattern (uses registry: github-release with full tarball asset name + package-name override matching alef-emitted test_app URL pattern + dependency on check-zig for idempotent reruns on already-published versions).
  • publish (pub.dev): add force_republish check to trigger-pubdev condition matching html-to-markdown pattern so already-published versions reroll cleanly when needed.
  • publish (ruby + elixir + r + dart): explicit vendor_mode = "registry" in alef.toml so path = "../../crates/kreuzberg" deps inside inner Rust crates are rewritten to crates.io version deps at publish time. rc.1 ruby gem failed at install with failed to read .../kreuzberg/Cargo.toml for exactly this reason.
  • publish (pypi): verify non-strict (sdist/wheel allowlist mismatch was aborting publish in rc.1).
  • publish (node-platform): verify non-strict (NAPI stub sub-packages don't contain .node binaries for unbuilt platforms; was aborting node publish in rc.1).
  • publish (wasm): bound build timeout (90 min) + dedicated cache key to survive GitHub runner preemption that cancelled rc.1's WASM build at ~12 min.
  • publish (elixir): NIF matrix expansion (Elixir NIF matrix include without standalone os key was silently overwriting and only producing the Windows NIF; now produces all 12 NIFs across linux + macOS + windows + arm).

  • java: align alef output path to pom.xml <sourceDirectory>. The alef backend appends dev/kreuzberg/ to the configured output; the previous setting produced files at packages/java/src/main/java/dev/kreuzberg/*.java but pom.xml's sourceDirectory=${project.basedir} expects packages/java/dev/kreuzberg/*.java. This fixes Maven compilation failures where the rc.1 JAR was missing generated classes like JsonUtil and the plugin registration methods. (alef.toml) The alef backend appends the package path (dev/kreuzberg/) to the configured output, so packages/java/src/main/java/ generated files to packages/java/src/main/java/dev/kreuzberg/*.java, but pom.xml's sourceDirectory=${project.basedir} expects files at packages/java/dev/kreuzberg/*.java. This fixes Maven compilation failures where the JAR was missing generated classes like JsonUtil and plugin registration methods. The pom.xml layout already matches the FFI-style bindings (Go, C#, Dart, etc.), so aligning Java to the same convention restores consistency. (alef.toml)

Changed

  • deps: bump alef pin to v0.19.5 (bundles AHashMap binding fixes for FFI/Dart/Swift/Elixir/Ruby/PHP and WASM emitter dedup + collect + From-impl + sub-config deserialize fixes).

Fixed

  • bindings: Stop emitting calculate_quality_score in language bindings — was made pub(crate) in commit 088168665a, exposing an internal detail of text::quality module that was never part of the public API. The function's signature (Option<&AHashMap<Cow<'static, str>, Value>>f64) references non-public generic types that don't FFI-convert cleanly (specifically, as_deref() failed on Option<HashMap> which doesn't implement Deref). Added calculate_quality_score to exclude_functions in alef.toml for all languages (Python, Node, Ruby, PHP, Go, Java, Dart, Kotlin Android, Swift, C#, R, Zig, Elixir, FFI) so the function is not exposed in any generated binding. All library builds now succeed without type-checking failures. (alef.toml)

  • swift: Regenerate packages/swift/rust/src/lib.rs and packages/swift/rust/Cargo.toml with alef v0.19.5 fixes for AHashMap<Cow, _> param types. The prior generated code for calculate_quality_score attempted &serde_json::from_str::<HashMap<String, String>>(&metadata).expect("...") on an Option<String> parameter, producing two compile errors: (1) HashMap does not implement Deref so .as_deref() was not available, and (2) the deserialized HashMap<String, V> did not match the core function's expected Option<&AHashMap<Cow<'static, str>, Value>>. Fixed by alef v0.19.5: the Swift shim now emits a pre-call let __metadata_ahash binding that deserializes the JSON string into HashMap<String, String>, converts to AHashMap<Cow::Owned, Value::String>, and passes .as_ref() to the core. Added unconditional ahash = "0.8" to packages/swift/rust/Cargo.toml so generated Swift crates reference ahash::AHashMap without manual additions. (packages/swift/rust/src/lib.rs, packages/swift/rust/Cargo.toml)

  • ffi + dart: Regenerate crates/kreuzberg-ffi/src/lib.rs and packages/dart/rust/src/lib.rs with alef v0.19.5 fixes for AHashMap<Cow, _> param types. In the FFI shim, the prior generated code emitted serde_json::from_str::<HashMap<String, V>> and .as_deref() for the metadata parameter of kreuzberg_calculate_quality_score, producing two compile errors: HashMap does not implement Deref (so .as_deref() fails) and the deserialized HashMap<String, V> did not match the core function's expected Option<&AHashMap<Cow<'static, str>, Value>>. In the Dart FRB bridge, the prior code attempted &metadata where metadata: Option<HashMap<String, String>>, which also fails to type-check against the core. Both are fixed by alef v0.19.5: the FFI emitter now deserializes into ahash::AHashMap<Cow<'static, str>, Value> and uses .as_ref(); the Dart bridge emits a pre-call let __metadata_ahash binding that converts HashMap<String, String> to AHashMap<Cow, Value> before borrowing. Added ahash = "0.8" to crates/kreuzberg-ffi/Cargo.toml and packages/dart/rust/Cargo.toml since both generated files reference ahash::AHashMap directly. (crates/kreuzberg-ffi/src/lib.rs, crates/kreuzberg-ffi/Cargo.toml, packages/dart/rust/src/lib.rs, packages/dart/rust/Cargo.toml)

  • dart: Regenerated packages/dart/rust/src/frb_generated.rs via flutter_rust_bridge_codegen generate to pick up type changes from alef v0.19.4 regen. The stale codegen file expected Option<String> for bbox fields (DocumentNode.bbox, ElementMetadata.coordinates, ExtractedImage.bounding_box, GridCell.bbox, LayoutRegion.bounding_box, PdfAnnotation.bounding_box, Table.bounding_box) and Vec<String> for attribute fields, but kreuzberg's recent type updates now expose BoundingBox as a concrete struct and attributes as nested vectors. The regenerated file now correctly type-checks against the current Rust core. Also fixed a type mismatch in the calculate_quality_score wrapper: the FFI-friendly signature takes Option<HashMap<String, String>> but the core function expects Option<&AHashMap<Cow, Value>>, so the wrapper now converts the Dart-friendly types to the Rust core format. Added ahash = "0.8" dependency to packages/dart/rust/Cargo.toml to support the conversion. Fixes macOS arm64 CI build failures. (packages/dart/rust/src/frb_generated.rs, packages/dart/rust/src/lib.rs, packages/dart/rust/Cargo.toml)

  • ci: .github/workflows/ci-docs.yaml lint-docs job pinned alef-ref to v0.16.69, so the docs linter ran against a stale alef while generated bindings were produced by the current version (v0.19.4). The version skew caused the docs validation to fail on the first regen push. Bumped to v0.19.4 to match alef.toml. (.github/workflows/ci-docs.yaml)

  • python: _to_rust_chunking_config no longer passes sizing=None to the PyO3 ChunkingConfig constructor, which raised TypeError: argument 'sizing': 'None' is not an instance of 'ChunkSizing'. The constructor's PyO3 signature provides ChunkSizing::default() when sizing is omitted, so the Python wrapper now omits the key from kwargs when the user-facing field is None, allowing the Rust default to apply. Fixes test_config_chunking_prepend_heading_context in the Python e2e suite. (packages/python/kreuzberg/api.py)

  • bindings: Expose BoundingBox in all alef-generated bindings (Java, C#, Go, Dart, Swift, Kotlin Android, etc.) by removing #[cfg_attr(alef, alef(skip))] from crates/kreuzberg/src/types/extraction.rs. The type appears as a field on Table, OcrElement, ElementMetadata, Annotation, and Page; with the type skipped alef fell back to opaque String/Object mappings, breaking Jackson/JSON deserialization (Java e2e testConfigSecurityLimits failed with Cannot deserialize value of type 'java.lang.String' from Object value for Table.boundingBox). Exposing the type generates a proper record/struct everywhere and unblocks the full Java e2e suite (88/88). (crates/kreuzberg/src/types/extraction.rs)

  • embeddings: Serialize concurrent first-time model downloads across processes. hf-hub's own download lock is non-blocking and retries for only ~5s — far shorter than a 100MB+ sentence-transformer model download — so racing processes (e.g. parallel e2e workers) failed outright with ApiError::LockAcquisition. download_model_files now holds a blocking flock(LOCK_EX) on a kreuzberg-owned lock file (<cache>/models--<repo>/.kbz-download.lock) for the full download; other processes block until release, then find the model already cached. The lock is released on drop or by the OS on process exit. Non-unix targets fall back to the prior hf-hub behavior. (crates/kreuzberg/src/embeddings/mod.rs)

  • core: The global OCR backend registry now self-heals after clear_ocr_backends. OcrBackendRegistry is the only plugin registry seeded with built-in backends (Tesseract/PaddleOCR/VLM) at construction, and the image extractor looks them up by name with no fallback — so calling clear_ocr_backends() permanently broke OCR for the rest of the process. This caused a test-pollution failure in the cross-language e2e suites: the ocr_backend_management category (which runs clear_ocr_backends) sorts before smoke, so the OCR smoke test then failed with OCR backend 'tesseract' not registered. Available backends: []. The built-in registration logic is now factored into OcrBackendRegistry::register_defaults, and a new crate-internal ensure_ocr_backends_initialized re-seeds the registry when it is empty — mirroring the existing extractors::ensure_initialized self-heal for the document extractor registry. The image extractor calls it before every OCR dispatch. (crates/kreuzberg/src/plugins/registry/ocr.rs, crates/kreuzberg/src/plugins/ocr.rs, crates/kreuzberg/src/plugins/mod.rs, crates/kreuzberg/src/extractors/image.rs)

  • kotlin-android: Exclude trait bridge interfaces from code generation by adding exclude_languages = ["kotlin_android"] to all trait bridge definitions in alef.toml (OcrBackend, PostProcessor, Validator, EmbeddingBackend, DocumentExtractor, Renderer). The alef-backend-java emits trait bridge classes with Java Panama FFM imports (java.lang.foreign.*), which are not available on Android. Kotlin-Android is JNI-only (no trait bridge support yet) and doesn't need these interfaces. This prevents compilation errors in packages/kotlin-android/src/main/java/ from missing Panama FFM API classes.

  • swift: Added render_pdf_page_to_png Swift call override in alef.toml with explicit argument definitions including optional dpi and password parameters, plus updated PDF fixtures to include null values for these fields. Resolved compilation errors in e2e/swift_e2e/Tests/KreuzbergE2ETests/PdfTests.swift where the generated function calls were missing the required named arguments. Also added swift to skip_languages for embed_texts_async call since async function naming conflicts prevent the binding from generating embedTextsAsync() — Callers should use the async wrapper from embed_texts instead. This resolves all 7 Swift compilation errors and reduces test failures to 1 (OCR backend registration runtime issue). (alef.toml, fixtures/pdf/, e2e/swift_e2e/)

  • kotlin-android: Added crates/kreuzberg/src/types/internal.rs to alef.toml sources list so InternalDocument is extracted and generated for Kotlin bindings. The type is used in trait bridge method signatures (e.g., IRenderer::render(doc: InternalDocument)) and must be available for the generated interface to compile. Previously the generated Kotlin code referenced InternalDocument but the type was never extracted (missing source file), causing "Unresolved reference" errors. The skip attribute was already removed in pass-2 (bf80c2fce7); this completes the fix by ensuring the type is extracted. (alef.toml)

  • kotlin-android: Added embed_texts_async to exclude_functions in alef.toml. The function creates a naming conflict with the suspend wrapper of embed_texts — both generate suspend fun embedTextsAsync() in Kotlin, causing overload ambiguity. Callers should use the suspend function from embed_texts instead. This resolves duplicate function declaration errors in the generated Kreuzberg.kt. (alef.toml)

Changed

  • alef: bumped to v0.17.24. Regenerated all bindings and e2e tests. v0.17.24 includes: conditional #[php(prop)] for Prop-compatible types only (fixes E0277 errors for non-Prop fields), Kotlin/Android codegen fixes (named struct field default-construction, sealed-class field annotations, JNI single-param is_optional propagation), Dart pubspec single-caret version constraint, alef-e2e/zig module_name path fix, valid build.zig.zon declarations, scaffold wasm filename underscore conversion, alef-e2e/csharp synthetic chunk assertion inline predicates (chunks_have_heading_context, first_chunk_starts_with_heading), alef-e2e/python equivalent. Known regression: e2e/csharp/tests/ContractTests.cs is no longer emitted by the C# e2e generator; other languages still emit their contract test file (tracked for upstream fix). (alef.toml, ~935 regenerated files)

  • ci: Split pub.dev publishing into a dedicated publish-pubdev.yaml workflow. pub.dev OIDC trusted publishing rejects tokens from release events; only push and workflow_dispatch are accepted. The Dart package embeds platform-specific native binaries (Android JNI, iOS XCFramework, server libs), so the main workflow now assembles them into a dart-package-assembled artifact and dispatches publish-pubdev.yaml via workflow_dispatch with the run ID; that workflow downloads the artifact and publishes. One-time setup required: configure pub.dev → kreuzberg package → Admin → Automated publishing with workflow path .github/workflows/publish-pubdev.yaml.

Fixed

  • demo: WASM extraction now runs in a per-file Web Worker, keeping the main thread unblocked throughout. Fixes page freeze on repeated use, stale output visible during a new extraction, and the 30s timeout that could never fire when the main-thread event loop was blocked. The ArrayBuffer is transferred (not copied) to halve peak memory on large files. Dead importmap entries (tesseract-wasm, comlink) and the top-level initWasm/enableOcr call were removed; WASM initialisation now happens inside the Worker. All console.error calls are replaced by the [kreuzberg/wasm]-prefixed log helper so DevTools filtering works. (docs/demo.html, #992)

  • e2e/csharp: Added nested_types mappings and options_via = "from_json" overrides to alef.toml C# call configurations. When fixture values are sealed-union or complex config types (EmbeddingModelType, EmbeddingConfig, ChunkingConfig, KeywordConfig, etc.), the codegen now emits JsonSerializer.Deserialize<T>(json, ConfigOptions)! instead of raw string literals, eliminating type-mismatch compile errors. Top-level nested_types apply across all calls; per-call overrides refine for specific functions. Reduces e2e/csharp compile errors from 11 to 3 (remaining: metadata.Format.Trim() sealed-union bug and chunks_have_heading_context synthetic predicate routing — tracked for upstream alef codegen fixes).

  • Taskfile: kotlin:e2e now passes --lang kotlin_android to alef test, matching the language key declared in alef.toml. Previously the task invoked alef test --e2e --lang kotlin which produced Language 'kotlin' not in config languages list or test configuration. (Taskfile.yml)

  • core: EmbeddingModelType::default() now returns Preset { name: "balanced" } instead of Preset { name: "" }. Language binding mirror structs (Ruby, PHP, and others) have their own EmbeddingModelType with a derived or hand-written Default that calls String::default() (empty) for the name field; when that flows through From<EmbeddingModelType> into kreuzberg::embed_texts_async, get_preset("") returns None, causing "Unknown embedding preset: " errors. All defaults across the codebase converge on "balanced", so the Default impl is now consistent with default_model() and EmbeddingConfig::default(). Added a unit test test_embedding_model_type_default_is_balanced to lock this in. (crates/kreuzberg/src/core/config/processing.rs)

  • e2e/elixir: Regenerated with alef v0.17.19, which fixes a keyword-opts threshold bug in the Elixir e2e codegen. When a call had 2+ trailing optional parameters (e.g., mime_type, config), the codegen now emits all optional args in keyword form (mime_type: "...", config: "..."), not mixed positional and keyword (mime_type: "...", "{}"). This respects Elixir's syntax requirement that all positional args come before keyword args. Fixes smoke_test and other e2e test compile errors. (alef.toml, e2e/elixir/test/*_test.exs)

  • e2e/r: Marked the config argument of embed_texts_async as optional = true in alef.toml so generated e2e tests for languages whose fixtures omit a config (R, Python, Node) no longer call the binding without it. Previously the R wrapper signature function(texts, config) had no default and the empty/happy fixtures failed with argument "config" is missing, with no default. Regenerated all per-language embed_async_pending test suites via alef generate. Brings R e2e from 155/158 PASS to 159/160 PASS; the only remaining failure (test_smoke.R tesseract OCR backend) is environmental. (alef.toml, e2e/*/...embed_async_pending*)

  • deps: Loosened tar requirement in crates/kreuzberg/Cargo.toml from ^0.4.46 back to ^0.4 so it remains compatible with tree-sitter-language-pack v1.8.1, which locks tar to 0.4.45. The pinned floor prevented the e2e/rust crate from resolving a consistent version. (crates/kreuzberg/Cargo.toml)

Fixed (core: code_intelligence Go type mismatch)

  • core: Changed ExtractionResult.code_intelligence field type from Option<crate::ProcessResult> to Option<serde_json::Value>. The ProcessResult type lives in tree_sitter_language_pack — an external crate whose struct layout alef cannot resolve — so the Go backend emitted *string, causing cannot unmarshal object into string at runtime. Using serde_json::Value maps to json.RawMessage in Go, preserving full JSON fidelity while being opaque to alef. Updated extraction/derive.rs to serialize the ProcessResult before assignment; a tracing::warn! guards the rare serialization failure. (crates/kreuzberg/src/types/extraction.rs, crates/kreuzberg/src/extraction/derive.rs, packages/go/v5/binding.go)

Fixed (core: chunk_size=0 panic)

  • core: build_chunk_config in crates/kreuzberg/src/chunking/builder.rs now clamps max_characters == 0 to 1000 (the default_chunk_size value) instead of forwarding zero to text-splitter, which panics with chunk size must be non-zero. Binding mirror structs whose serde(default) zeroes the field no longer abort the host process. A tracing::warn! line logs the clamp. (crates/kreuzberg/src/chunking/builder.rs)

Fixed (core: null-tolerant serde for ChunkSizing, EmbeddingModelType, FormatMetadata::Image uppercase)

  • core: ChunkingConfig.sizing and EmbeddingConfig.model fields now tolerate explicit JSON null (in addition to missing fields). Language binding mirror structs emit "sizing": null / "model": null from zero-valued structs; the previous #[serde(default)] annotation handled a missing key but not an explicit null for internally-tagged enums. Added deserialize_null_default<T> and deserialize_null_model helpers in processing.rs and switched both fields to use them. (crates/kreuzberg/src/core/config/processing.rs)
  • core: FormatMetadata::Image Display impl now emits the format string in uppercase (e.g., "PNG" instead of "png"), matching the fixture assertion convention used across all language e2e suites. (crates/kreuzberg/src/types/metadata.rs)

Fixed (e2e/php: hermetic ini)

  • scripts/setup-php-ext-ini.sh + alef.toml: PHP e2e runs were failing system-wide because a sibling project (tree-sitter-language-pack) had left a stale /opt/homebrew/etc/php/8.4/conf.d/ext-kreuzberg.ini pointing at a non-existent ts_pack_core_php.so. The local php -c php.ini flag only overrides the main php.ini path, not the scan-dir, so the stale entry kept being loaded. Hardened the e2e launcher to set PHP_INI_SCAN_DIR= (disabling conf.d scanning entirely) and made the generated e2e/php/php.ini set extension_dir explicitly so the hermetic config still finds the built extension. PHP e2e is now reproducible without depending on the host's conf.d state.

Fixed (R e2e: 159/160, only env tesseract failure remains)

  • e2e/r: Regenerated against alef with four R codegen fixes (no-arg-wrapper input leakage, empty Vec<String>character(0), R-side extra_args support, FormatMetadata tagged-enum collapse helper + result_is_bytes-aware length assertions). Added an R override on render_pdf_page_to_png with extra_args = ["NULL", "NULL"] to fill in the extendr-required dpi/password positionals when the fixture omits them. Combined, kreuzberg R e2e moves from 153/158 to 159/160 — the remaining failure (test_smoke.R:9:3 PNG-with-OCR) is the pre-existing tesseract not registered env condition. (alef.toml, e2e/r/)

Fixed (Zig e2e: all 88 tests passing)

  • e2e/zig: Bumped pinned alef_version to 0.17.17 and regenerated e2e/zig/ against the upstream codegen fix that skips the chunks_have_heading_context synthetic assertion. Combined with pre-existing zig-specific overrides (extract_file and extract_bytes redirected to their _sync variants with JSON-struct result parsing, render_pdf_page_to_png extra null, null args for dpi/password, and FormatMetadata-as-display inline accessor on the metadata.format assertion), task zig:e2e now reports Build Summary: 43/43 steps succeeded; 88/88 tests passed. (alef.toml, e2e/zig/)

Fixed (Go bindings: re-exported ProcessResult path)

  • core: Switched ExtractionResult.code_intelligence from Option<tree_sitter_language_pack::ProcessResult> to Option<crate::ProcessResult> so alef's type visitor resolves the re-exported name and emits the correct *ProcessResult Go binding type. Previously alef fell back to *string, producing cannot unmarshal object into string errors at runtime.

Fixed (DrawingType alef-skip, zig binding ffi_path)

  • core: Annotate DrawingType enum in crates/kreuzberg/src/extraction/docx/drawing.rs with #[cfg_attr(alef, alef(skip))]. The enclosing Drawing struct was already skipped but its inner enum was not, so alef R binding gen emitted impl From<kreuzberg::DrawingType> for ... and broke packages/r/src/rust/src/lib.rs with five kreuzberg::DrawingType not found errors.
  • packages/zig/build.zig: Default ffi_path updated from ../../target/debug to ../../target/release so zig build test finds the FFI artifact built by the alef test before-hook (cargo build --release -p kreuzberg-ffi). Mirrors the alef-scaffold/zig fix.

Fixed (FormatMetadata Display impl)

  • core: Added impl std::fmt::Display for FormatMetadata so generated rust e2e assertions on metadata.format can render the variant as a short string ("pdf", "docx", "image"…). The Image variant emits the inner format (e.g., "PNG"), matching the fixture's field_equals value. Required after alef-e2e/rust switched to Display formatting in field_access::rust_unwrap_binding.

Fixed (java e2e clear_* methods)

  • java e2e: clear_document_extractors(), clear_ocr_backends(), clear_embedding_backends(), clear_post_processors(), clear_renderers(), clear_validators() now generate in the Kreuzberg facade class. Root cause: #[cfg_attr(alef, alef(skip))] annotations on these functions prevented alef from picking them up during Rust API extraction. Removed the annotations from the 6 clear_*() functions (only; register_*/unregister_* still skipped since they take Arc<dyn Trait> params unsuitable for FFI). Re-generated all language bindings.

Fixed (kotlin-android e2e parity sweep)

  • kotlin-android: now honours #[cfg_attr(alef, alef(skip))] annotations during DTO and enum emission (alef-backend-kotlin-android filter fix).
  • alef-extract disambiguation: when two types collide and one carries binding_excluded=true, the non-excluded one now keeps the original name (instead of getting a "2" suffix). Eliminates spurious EmbeddingPreset2, NodeContent2, LayoutClass2, PaddleLanguage2, FormatMetadata2, ExtractionMethod2 from the IR.
  • pdf/metadata.rs::PdfMetadata: removed incorrect alef(skip) annotation — there's only one definition (no duplicate to disambiguate), so the annotation was hiding the only public type from bindings.
  • stale kotlin-android wrappers: 28+ leftover *2.kt, OcrTesseractConfig.kt, PdfMetadata.kt, ListType.kt, TableTableGrid.kt, ArchiveArchive*.kt files removed after the disambiguation fix.
  • alef-e2e trait-bridge fixture support: lifted the deferred-to-v0.14.5+ gate; trait-bridge fixtures (register/unregister/clear × 6 traits) now generate per-language test calls.
  • alef-extract embed_texts_async: confirmed the cfg-gated generic signature extracts cleanly; cleared the 3 stale skip blocks on embed_async_pending/embed_texts_async_*.json.
  • e2e generation surface: regenerated all 16 language e2e suites. All trait-bridge + plugin_api categories now exercise the full public API. Only legitimate WASM/Rust platform-imposed skips remain (7× WASM-async, 1× Rust-missing-file-path).

Fixed (#985 pdf image extraction hang)

  • PDF image extraction hang on dense pages (#985): extract_image_positions previously ran a full decompression pass over every page unconditionally — even when extract_images=false — causing multi-minute hangs on PDFs with many image objects per page (e.g. InDesign/Acrobat exports). The pre-pass is eliminated; image positions are now derived from the capped extraction result, so the decompression path is skipped entirely when images are not requested. Also adds cancellation-token support inside extract_images_with_data so extraction_timeout_secs can interrupt multi-page extraction between pages, and inside the inline-OCR image loop so a timeout can interrupt per-image OCR.

max_images_per_page now bounds decompression for XObject images: the pdf_oxide backend enumerates the page XObject resource dictionary and stops decompressing after limit images, avoiding the eager cost of extract_images() for images beyond the cap. Inline images (BI/EI content-stream operators) fall back to the eager path with a .take(limit) guard; full inline-image support is tracked in #989.

  • kreuzberg::utils::pool API simplified: Pool::acquire() is now infallible — it returns PoolGuard<T> directly instead of Result<PoolGuard<T>, PoolError>. PoolError is removed. parking_lot::Mutex cannot poison, so the Result wrapper was dead code. Pool, PoolGuard, Recyclable, StringBufferPool, and ByteBufferPool are marked #[doc(hidden)] — they were technically reachable via kreuzberg::utils::pool but were never part of the stable public API.

Added

  • kreuzberg crate root: plugin_api surface (list_*, clear_*, register_*, unregister_* for all six plugin types; detect_mime_type_from_bytes, get_extensions_for_mime) was already present in the internal module tree but not wired up as top-level pub use items. The re-exports are now confirmed in place and the 9 e2e fixtures that carried a stale "not re-exported" skip block have had that block removed — those fixtures are now active across all language targets.

Fixed

  • alef-backend-php: emit reverse From<binding> for kreuzberg::Core impls for metadata DTOs (ArchiveMetadata, BibtexMetadata, DocxMetadata, EmailMetadata, EpubMetadata, ExcelMetadata, FictionBookMetadata, HtmlMetadata, ImageMetadata, JatsMetadata, OcrMetadata, PptxMetadata, PstMetadata, TextMetadata, XmlMetadata, CitationMetadata, CsvMetadata, DbfMetadata, TableGrid). Reverse From impls are required for flat enum binding→core conversion which calls .into() on optional variant fields. Previously only forward impls were generated.

  • Java UnsatisfiedLinkError on kreuzberg_list_embedding_presets / kreuzberg_get_embedding_preset (#998): the alef-generated FFI surface references kreuzberg::EmbeddingPreset unconditionally in function return-type positions, so any build of kreuzberg-ffi that omitted the embedding-presets feature would silently drop these two C symbols — causing the Java static initialiser (which uses .orElseThrow()) to crash the entire JVM at class-load time. Added #[cfg(not(feature = "embedding-presets"))] stubs in crates/kreuzberg/src/lib.rs: a no-op EmbeddingPreset struct and empty-return implementations of both functions. The stubs ensure the symbols are always present and callers degrade gracefully (empty list / null handle) instead of crashing.

  • #962: PDF text extracted one character per line on glyph-spaced documents (programmatic reproducer in crates/kreuzberg/tests/pdf_glyph_spacing_issue_962.rs) — when a Word-exported PDF places each glyph via its own BT…ET block with a sinusoidal y-jitter, pdf_oxide's ColumnAware reading order groups spans by y-level rather than reading order, yielding single-character spans out of sequence. Detection: ≥ 3 same-line x-disorder events among short (≤ 3 char) spans, where "same-line" uses a 5 pt absolute ceiling (MAX_GLYPH_JITTER_PT) instead of a font-size fraction, preventing false positives on height-zero span lists from normal documents. Reconstruction: sort by y-descending, group by 5 pt y-proximity, re-sort each group by x-ascending, insert spaces at word gaps (x-gap > font_size × 0.5). Heuristic constants live in crate::pdf::structure::constants with measurement justification; upstream fix shipped in pdf_oxide v0.3.51 (issue #518, closed 2026-05-19); heuristic removable when kreuzberg upgrades to ≥ 0.3.51. Dutch word relatie (from "relatie-id" in the Word broken-image placeholder string) added to .typos.toml.

Changed

  • API surface lockdown via #[cfg_attr(alef, alef(skip))]: 41 internal types that were leaking into the public polyglot binding surface are now hidden from every binding. Skipped types: REST/MCP wire-envelope DTOs (ApiDoc, InfoResponse, ExtractResponse, EmbedRequest/EmbedResponse, ChunkRequest/ChunkResponse, DetectMimeTypeParams, CacheWarmParams, EmbedTextParams, ExtractStructuredParams, ChunkTextParams, ManifestEntryResponse/ManifestResponse, WarmResponse, OpenWebDocumentResponse, DoclingCompatResponse, StructuredExtractionResponse); HWP/DOCX format parser helpers (StreamReader, CharShape, HwpImage, Drawing, AnchorProperties, PageMarginsPoints, StyleDefinition, ResolvedStyle); office_metadata internals (CoreProperties, CustomProperties, OdtProperties, DocxAppProperties, XlsxAppProperties, PptxAppProperties); pool/service infrastructure (TracingLayer, SyncExtractor, Recyclable); OCR/chunking intermediates (ImageOcrResult, HtmlExtractionResult, ExtractedInlineImage, DetectedBoundary, ChunkingResult, MergedChunk). These types were never part of the polyglot SDK contract — extract_file / extract_bytes / Metadata / OcrConfig remain unchanged.

  • Migrated alef.toml [crates.exclude] to source-level annotations: 100+ type/function/method entries previously listed in alef.toml's global exclude block now carry #[cfg_attr(alef, alef(skip))] at their definition sites in crates/kreuzberg/src/**. Behaviour-neutral migration — the same items remain excluded, but the exclusion declaration lives next to the defined item and stays in sync across refactors instead of drifting away from the source. alef.toml [crates.exclude] now retains only the ~20 entries that cannot be source-annotated (fully-qualified path exclusions where the short name collides with a kept public type, generic structs whose bounds prevent cfg_attr propagation, trait-bridge register_* / clear_* functions that need deduplication against the bridge emitter, and trait impl methods like From::from / Display::fmt / Deref::deref auto-emitted by derive).

  • kreuzberg::CacheStats now refers to cache::core::CacheStats (was previously aliased to paddle_ocr::CacheStats). The paddle-OCR variant is renamed to ModelCacheStats and re-exported as kreuzberg::ModelCacheStats. Breaking change for Python/TypeScript/Ruby/PHP/Go/Java/C#/Elixir/Dart/Swift bindings — consumers of the previous kreuzberg.CacheStats (paddle model cache variant) must migrate to kreuzberg.ModelCacheStats.

  • kreuzberg::extraction::image::ImageMetadata renamed to ExtractedImageMetadata to disambiguate from kreuzberg::types::metadata::ImageMetadata. Internal use only; no public binding surface impact.

Removed

  • Orphan kreuzberg::types::formats::CacheStats (unused duplicate of kreuzberg::cache::core::CacheStats; superseded by the canonical re-export at the crate root).

Added

  • Serialize/Deserialize derives on DOCX and HWP parser types: Table, TableRow, TableCell, Paragraph, and Run in extraction::docx::parser, and Section, Paragraph, and ParaText in extraction::hwp::model. These types remain internal (still annotated #[cfg_attr(alef, alef(skip))] — out of binding surface) but can now flow through Rust-side caching, snapshot tests, and any internal JSON-based pipelines. No change to public binding API.

  • Plugin registry functions now exposed in every binding: register_ocr_backend, register_post_processor, register_validator, register_embedding_backend, register_renderer, register_document_extractor, their unregister_* siblings, and clear_* group counterparts. Bindings previously hid these because the alef codegen emitted duplicate definitions; alef ≥ v0.16.65 auto-deduplicates trait-bridge registrations, so the kreuzberg alef.toml global function exclusions are dropped.

  • kreuzberg-ffi register_ocr_backend / unregister_ocr_backend are now callable from C, Go, Java, C#. The previous *const c_void Send issue was resolved by the alef-backend-ffi Jinja migration; the binding now compiles with unsafe impl Send + Sync on the bridge struct.

  • WASM bindings now expose HwpxExtractor, process_images_with_ocr, and the canonical CacheStats. alef ≥ v0.16.65 auto-excludes feature-gated types, so the explicit WASM exclusions for these are redundant.

Added

  • task demo:dev:setup for WASM demo prerequisites (#1007): new task installs wasm-pack 0.13.1 (matches CI pin), wasm-bindgen-cli at the version pinned in Cargo.lock, and WASI SDK 25 — all idempotent with exact version checks and SHA256 verification on the WASI SDK download. Run once before task demo:dev.

Fixed

  • task demo:dev broken after native WASM OCR migration (#1006): commit 198c9e99e removed the JS OCR worker bridge files without updating demo.html, leaving the asset server with nothing to serve for the CDN imports. Restores dist/index.js, dist/extraction/files.js, and dist/ocr/enabler.js for the new architecture where TesseractWasmBackend auto-registers at WASM init time and enableOcr() is a no-op.

  • Isolated numbered headings in untagged PDFs (#961): Lines matching \d{1,2}\. [A-Z] that appear alone in their paragraph block are now promoted to Heading elements instead of being misclassified as ListItem. Affects the ElementBased transform fallback path (e.g. ReportLab-generated PDFs). Real numbered lists with sibling items are unaffected. Lone lowercase-initial lines (1. go there alone) remain ListItem and are not affected.

  • Bordered 2-column PDF tables now detected (#964): Added extract_tables_bordered as a second detection tier between the strict native pass and the text-heuristic fallback. Tables whose cells are drawn with stroke lines or rectangles but have only 2 columns are now recovered.

[5.0.0-rc.1] - 2026-05-16

Changed

  • Alef binding generator v0.16.14: payload-derived sealed variant parameter names in Kotlin Android (Pdf(PdfMetadata)val metadata instead of val field0), C# record properties marked required on non-nullable refs without defaults, Java records for fields-only DTOs without builder pattern.

Fixed

  • VLM / force_ocr path omitted image placeholders from Markdown output (#987): when force_ocr=true (e.g., a VLM OCR backend) with extract_images=true and inject_placeholders=true, the rendered Markdown contained no ![](image_N.ext) references. Root cause: the structured pre_rendered_doc path (which injects ElementKind::Image elements via assemble_internal_document) is skipped when force_ocr=true; the OCR-path document built from raw OCR text carried no image elements. Fixed by injecting image elements — with correct page attribution from ExtractedImage.page_number — into the OCR-path document when images are extracted and inject_placeholders is enabled.

  • MCID-tagged PDF content dropped in markdown/html output: two independent failure chains in the structured-PDF path caused content loss. (1) mark_cross_page_repeating_text and mark_cross_page_repeating_short_text had no first-occurrence exemption — a title block whose text also appeared as a running header on later pages was furniture-marked on every page including the first, silently dropping it from output. Fixed by preserving the first occurrence of any repeating span, matching the exemption already present in pdf_oxide's own mark_running_artifact_spans. (2) The OCR skip gate incorrectly suppressed OCR when fallback=true (a per-page quality check detected a scanned page): the gate now runs OCR whenever fallback=true, regardless of aggregate character count. OCR is still skipped when the pre-rendered doc is substantive and no per-page fallback is needed, or when the content is non-textual.

  • Djot renderer emitted dangling ![]() for orphaned image elements (#914): when an ElementKind::Image referenced an index outside doc.images, the Djot renderer fell through to its normal output path and produced empty image markup. Aligned with the existing comrak_bridge behavior: skip the element only when both the image lookup is None and the description text is empty. When alt text is present, it is still emitted so user-visible content is preserved.

  • Email encoding data corruption (#910): replaced brittle 4-byte heuristic for UTF-16 detection in EmailExtractor with a robust statistical approach using chardetng. Tiny 4-byte ASCII files (e.g., binary sequences with nulls) are no longer incorrectly transcoded as UTF-16LE, preventing silent data corruption. The fix maintains support for legitimate non-BOM UTF-16 messages by requiring a 16-byte minimum sample and verifying the encoding guess against chardetng when alternating null patterns are detected.

Removed

  • Gleam binding (kreuzberg_gleam Hex package): dropped entirely. Gleam targets the BEAM and Gleam consumers can keep using the Elixir binding via Erlang interop, so the dedicated package added negligible audience reach at a real maintenance cost (regen on every alef bump, dedicated CI workflow, Hex publish job, e2e fixtures, 92 doc snippets). Existing published versions of kreuzberg_gleam on Hex remain available for anyone still pinning them — no further releases will be made.
  • PageContent.images: Vec<Arc<ExtractedImage>> (#777): removed and replaced by PageContent.image_indices: Vec<u32>. All image data lives once in ExtractionResult.images; pages and chunks now carry indices into that collection. See docs/migration/v5.0-image-indices.md for migration guidance.

Security

  • [GHSA-gg9g-p963-p7x4]: HwpxExtractor now validates the ZIP container against SecurityLimits before passing bytes to unhwp::parse_bytes. Previously the ExtractionConfig was silently discarded (_config), allowing a crafted HWPX file with a >100:1 DEFLATE ratio to exhaust process memory (CWE-409). The fix adds an upfront byte-count check (max_archive_size) and a ZipBombValidator pass over the central directory before any decompression occurs. Affects all builds with the hwpx, formats, or full feature enabled since 5.0.0-rc.1.

Added

  • Heuristic PDF table extraction on the default path (#897): the default text-layer PDF extraction now falls back from pdf_oxide's native grid detector to a heuristic reconstruction when the native detector returns empty. The fallback clusters words into vertically-contiguous regions by abnormally-large row gaps, runs the existing reconstruct_table → post_process_table → is_well_formed_table chain (the same one used by the OCR pipeline and the layout-detection path), and emits the surviving grids as Table entries with bounding boxes. This recovers tables on text-layer PDFs (invoices, financial statements, scientific tables) that lack the explicit ruling lines pdf_oxide's grid detector requires — without needing the 12 GB ONNX layout-detection models. New config: PdfConfig.extract_tables: bool (default true) and CLI flag --pdf-extract-tables. Set to false to skip table extraction entirely.
  • Inline-image OCR for PDFs: Added ocr_inline_images field to PdfConfig and --pdf-ocr-inline-images CLI flag. When enabled, this performs OCR on inline images found within PDF pages and injects the results into the extracted document.
  • InternalDocument is serde-bridgeable: InternalDocument and its four previously-non-serde sub-types (InternalElement, ElementKind, Relationship, RelationshipTarget) gained Serialize + Deserialize derives. Combined with the Cow<'static, str>String migration on source_format and mime_type, foreign-language plugin authors can now construct InternalDocument values that round-trip through JSON at the FFI boundary — unblocking DocumentExtractor and Renderer trait bridges in alef 0.15.25+.
  • DocumentExtractor + Renderer cross-language plugins: both trait_bridges are now active in alef.toml. All 16 language bindings expose register_document_extractor / unregister_document_extractor / clear_document_extractors / list_document_extractors and the matching Renderer lifecycle. Foreign-language plugin authors can now implement arbitrary document extractors and renderers in their host language.
  • #619 follow-up: POST /extract-async now returns HTTP 429 when more than 100 jobs are active simultaneously, preventing unbounded memory growth under load.
  • backend_options passthrough on OcrConfig and OcrPipelineStage: both types now carry an Option<serde_json::Value> field backend_options. Custom and built-in backends that support runtime tuning (mode switching, preprocessing flags, inference parameters, etc.) can read this value via config.backend_options inside OcrBackend::process_image and deserialize only the keys they care about — unknown keys are silently ignored, so options from different backends coexist in the same config without conflict. When OcrConfig auto-constructs a two-stage pipeline (tesseract + paddleocr), backend_options is propagated to the primary (tesseract) stage only; the paddleocr fallback stage continues to use paddle_ocr_config.
  • WASM OCR backend: TesseractWasmBackend registered for the ocr-wasm feature set, exposing OCR on the WASM target via tesseract-wasm while the native path continues to use leptess.
  • Renderer plugin: Renderer now extends Plugin, picking up the shared name/version/initialize/shutdown lifecycle (with no-op defaults on Plugin so stateless renderers stay boilerplate-free). Public helpers register_renderer, unregister_renderer, list_renderers, clear_renderers match the convention of the other plugin registries and now all return Result for symmetric cross-language codegen.

Changed

  • BREAKING: PageContent.images replaced by PageContent.image_indices (#777): PageContent.images: Vec<Arc<ExtractedImage>> is removed. Pages now carry image_indices: Vec<u32> — zero-based indices into the top-level ExtractionResult.images collection. ChunkMetadata gains the same image_indices: Vec<u32> field (populated post-chunking by matching each image's page_number against the chunk's [first_page, last_page] range). Migration: replace page.images[i].data with result.images.as_ref().unwrap()[page.image_indices[i] as usize].data. All image data still lives in ExtractionResult.images; pages and chunks reference it by index rather than carrying full copies.

  • BREAKING (wire format): LayoutClass now serializes as snake_case in JSON output (e.g. "list_item" instead of "ListItem"). LayoutDetection.class_name returned by HTTP/MCP/Python APIs flips PascalCase → snake_case, matching the documentation that already used snake_case. The internal LayoutClass::name() accessor is renamed to as_str() and remains available for callers that need a &'static str.

  • TableModel serialization is now symmetric snake_case: previously serde_json::to_string(&TableModel::SlanetWired) returned "SlanetWired" (PascalCase) but from_str only accepted "slanet_wired" (snake_case), so JSON config round-trips were impossible. Both directions now use snake_case via #[serde(rename_all = "snake_case")]. The hand-rolled Deserialize impl is removed.
  • Plugin lifecycle defaults: Plugin::version(), initialize(), and shutdown() now have no-op default implementations (name() stays abstract). Existing impl Plugin blocks that returned Ok(()) / "1.0.0" continue to work; new stateless plugins can omit the boilerplate.

Fixed

  • Performance (#908): Removed a redundant config.normalized() call in the internal extract_file_uncached helper that caused unnecessary heap allocations when bypassing the extraction cache.
  • VLM OCR per-page propagation (#928): VLM OCR results now populate individual PageContent structures in PDF extractions, not just ExtractionResult.content. Per-page text segments are mapped to document pages, and stale native text is cleared on secondary pages when a VLM returns a single whole-document string.
  • LLM base URL normalization: fixed a regression where a custom KREUZBERG_LLM_BASE_URL with a trailing slash caused authentication (401) or 404 errors due to double-slash path construction (e.g. /v1//chat/completions). The LLM client now automatically trims trailing slashes from the base URL.
  • PDF layout-classify regression on text-heavy structure-tree PDFs: three coupled fixes to the layout-for-markdown pipeline that recover quality lost when RT-DETR hints were applied too aggressively.
  • apply_spatial_overrides now content-gates promotion-class hints (Title / SectionHeader / Caption / Footnote / ListItem). A hint is rejected when paragraph text doesn't match the hint type — short text (≤200 chars) for heading/caption/footnote classes; list-marker prefix for ListItem. Prevents promoting long body paragraphs whose bbox happens to overlap a heading hint.
  • apply_proportional_overrides is now bypassed for structure-tree pages (paragraphs without positional data). The fractional-position matching misaligned when paragraphs reordered between font-clustering and RT-DETR. The structure tree's font-size classification is the more reliable signal here; hints are skipped silently.
  • Heading-promotion thresholds tightened: MAX_BOLD_HEADING_WORD_COUNT 15 → 12, body→heading font-size delta +0.5pt → +1.0pt in three call sites of classify.rs.
  • PDF Markdown / Djot / Plain quality gates now run by default: test_pdf_quality_gate, test_pdf_djot_quality_gate, and test_pdf_plain_quality_gate in crates/kreuzberg/tests/pdf_markdown_regression.rs no longer carry #[ignore = "TODO: pdf_oxide upstream"]. The upstream issue (#484) hasn't been demonstrated to currently trigger these failures, and the calibrated PDFIUM_GROUND_TRUTH thresholds are the binding contract for layout-pipeline changes.
  • #911: extraction_timeout_secs now explicitly returns a KreuzbergError::Validation error when configured in non-tokio or WASM builds. Previously, timeouts in these environments were silently ignored, leading to unexpected hangs.
  • swift e2e command: [crates.test.swift].e2e now runs from packages/swift (was e2e/swift). The generated XCTest cases live inside packages/swift/Tests/<Module>Tests/ because SwiftPM 6.0 forbids inter-package .package(path:) references in a monorepo, so e2e/swift/Package.swift is a documentation-only stub with no buildable target. The previous setting failed with error: The package does not contain a buildable target. on every task swift:e2e invocation.
  • Java FFI compile error: readJsonList now wraps the null-check and checkLastError() call inside try-catch, resolving an unreported Throwable exception that blocked Java e2e test compilation.
  • alef.toml: TesseractWasmBackend added to [crates.exclude].types so non-WASM bindings (kreuzberg-py, kreuzberg-nif, etc.) no longer reference the WASM-only OCR backend (which is #[cfg(feature = "ocr-wasm")]-gated) and break the build under default features.
  • dart e2e: extract_file overridden to extract_bytes in alef.toml (dart cannot pass file paths through the FRB bridge); e2e generator regenerated to forward actual [BatchBytesItem] / Uint8List arguments rather than empty parameter lists.
  • e2e/gleam: regenerated against alef gleam codegen with contains_any OR logic + gleam/list import; full FFI shim (packages/gleam/src/kreuzberg_gleam_ffi.erl) wraps the Elixir.Kreuzberg.Native *_sync NIFs and converts the Erlang map results into Gleam-typed tagged tuples (extraction_result, metadata, document_structure, format_metadata, excel_metadata).
  • e2e/zig: regenerated for Zig 0.16 API (allocator + IO surface) and FormatMetadata internally-tagged enum path lookups now skip the variant-name segment.
  • Gleam dependency manifest: restored canonical hex version ranges in packages/gleam/gleam.toml (gleam_stdlib = ">= 0.34.0 and < 2.0.0", gleeunit = ">= 1.0.0 and < 2.0.0"). An earlier alef sync-versions had routed gleam.toml through the catch-all SEMVER replace path and overwrote both ranges with ">= 5.0.0-rc.1 and < 5.0.0-rc.1" (an empty range gleam refuses to resolve), wedging gleam test. The restore_gleam_dep_ranges helper in alef now keeps these stable on future syncs.
  • #853: HWP structured extraction now returns an error instead of silently returning an empty document when no BodyText sections are found. Fixes a regression introduced in the structured extraction refactor.
  • #619 follow-up: POST /extract-async handler no longer panics on mutex poison — returns HTTP 500 and marks the job as Failed instead.
  • Fixed dead conditional-import warning on KreuzbergError in plugins/registry/ocr.rs under non-OCR feature sets.
  • Zig e2e tests: Added default_extraction_config constant and extract_file_sync_default, extract_bytes_sync_default overloads for e2e test generation. Configured alef codegen to emit these default variants.

[5.0.0-rc.1] - 2026-05-05

Breaking Changes

  • Metadata.format is now a nested object, not flattened: The format field (and its format_type discriminator) moved from the root of the Metadata JSON object into a dedicated format key: {"format": {"format_type": "pdf", ...}} instead of {"format_type": "pdf", ...} at root. The additional postprocessor map is likewise nested under an "additional" key. The top-level sheet_count / sheet_names mirror fields are gone; access them via metadata.format.excel.sheet_count / .sheet_names. Affects REST, MCP, CLI (--output-format json), and every binding.
  • Go module path changed from v4 to v5: Import path is now github.com/kreuzberg-dev/kreuzberg/v5. Update your go.mod and all import statements.
  • PHP binding parameter names are now lowerCamelCase: Function parameters such as $mime_type are now $mimeType, $page_index$pageIndex, etc., matching PHP naming conventions.
  • Python _to_rust_extraction_config dict-coercion refactored: The isinstance(value, dict) branch now delegates to _coerce_dict_extraction_config(). No public API change; internal helper is not part of the public surface.

Changed

  • Version bump from 4.x to 5.0.0-rc.1 reflecting accumulated breaking changes across bindings since the 4.10.0 series.
  • All binding manifests (Node, Ruby, PHP, Java, C#, Go, Python, Elixir, Gleam, R, Dart, Swift, Zig) updated to 5.0.0-rc.1.

Changed

  • Inlined text-splitter into crates/kreuzberg/src/chunking/text_splitter/. The upstream crate pins tokenizers = "0.22", which conflicted with kreuzberg's direct tokenizers 0.23 dependency and produced a duplicate copy of tokenizers in the build graph plus a Tokenizer: ChunkSizer trait-bound failure in chunking::core. The inlined fork drops the unused code (tree-sitter) and tiktoken-rs features and rebuilds against tokenizers 0.23. Kreuzberg's own tree-sitter–based code splitter is unaffected. See ATTRIBUTIONS.md for full provenance and license terms.

Added

  • #619: Async extraction API — POST /extract-async accepts the same multipart form as POST /extract and immediately returns AsyncJobResponse (job_id) with HTTP 202. Background task runs the extraction pipeline (with a configurable timeout) and stores the result in an in-memory JobStore (5-minute TTL, evicted on restart). GET /jobs/{job_id} returns JobStatus (job_id, state, created_at, updated_at, result, error) allowing clients to poll for PendingRunningCompleted / Failed transitions. Both endpoints are gated behind the api feature flag.
  • HWPX Extraction Support (#875): Integrated the unhwp crate to natively extract text, structure, and comprehensive document metadata from modern Hangul Word Processor XML (application/haansofthwpx) documents. Features dedicated MIME routing distinct from legacy HWP5.
  • #761: ExtractionResult.extraction_method — new field exposing how text was extracted (native, ocr, mixed). Populated by PDF (native vs OCR vs force_ocr_pages mixed) and image (always ocr) extractors. Surfaced across every binding (Python, Node, PHP, Ruby, Java, C#, Go, R, Dart, Swift, Elixir, Gleam, Zig, WASM, C FFI).
  • #788 follow-up: Image classification + tile clustering on ExtractedImage. New optional fields image_kind (a public ImageKind enum: Photograph, Diagram, Chart, Drawing, TextBlock, Decoration, Logo, Icon, TileFragment, Mask, Unknown), kind_confidence (f32 ∈ [0.0, 1.0]), and cluster_id (u32). The classifier is offline, deterministic, and uses already-captured signals: dimensions, aspect ratio, colorspace, bits-per-component, format, plus Shannon entropy of a 64×64 thumbnail. The clusterer groups same-page images with similar dimensions whose bounding boxes sit within half a tile-side of each other and reclassifies the members to TileFragment, so a technical drawing composed of N raster fragments surfaces as one cluster_id. Wired through every extractor that produces ExtractedImage: PDF (lopdf and pdfium fallback), DOCX, PPTX, HTML, ODT, EPUB, FictionBook, Jupyter, RTF, and the standalone image extractor. Toggle via ImageExtractionConfig.classify (default true).
  • #784: PageInfo.has_vector_graphics: bool — page-level flag indicating that a PDF page contains non-trivial vector drawing content (paths, shapes, curves) that is otherwise invisible to ExtractionResult.images because it isn't embedded as a raster XObject. Populated by counting PdfPageObject::Path instances on the pdfium-rendered page; the flag flips when more than 8 paths are present. Lets downstream consumers (RAG / VLM pipelines) decide per-page whether to rasterise the page themselves to capture vector charts and diagrams produced by Adobe InDesign, LaTeX TikZ, etc. Bounding-box detection of vector regions and auto-rasterisation are deferred to a later minor.

Fixed

  • #799: Extract images nested inside PDF Form XObjects across XObject references — recursive Form XObject descent (depth-limited to 8) now follows indirect references through the resource chain, with cycle detection via a visited-set so self-referential XObject DAGs no longer hang. Both the lopdf and pdfium image-decoding paths benefit.
  • #824: Robust PDF image extraction across XObject references — fixes silent image drops when documents reference XObjects through indirect chains. Combined with the depth limit and cycle guard from #799 to harden the recursive walker against malformed structures.
  • #826: WASM loading on Next.js / Turbopack — @kreuzberg/wasm now bundles cleanly under webpack 5, Turbopack, and Next.js's app router. Dynamic imports of Node built-ins and the pdfium-js subsystem carry /* webpackIgnore: true */ markers so bundlers stop trying to inline platform-specific binaries.
  • #834: DOCX inject_placeholders flag honored end-to-end — image placeholders now appear in markdown/plain/djot output when ImageExtractionConfig.inject_placeholders = true, and the DOCX OCR pipeline runs before rendering so OCR text reaches the final document. Adds extractor-level security regression tests for LaTeX, EPUB, ODT, Jupyter, RST, and RTF inputs (deeply nested envs, unclosed math, oversized control words, entity bombs, depth bombs, large item lists).
  • #836: Prevent base64 image data leaking into structured PDF output when image extraction is disabled. The structure pipeline now suppresses populate_images_from_pdfium and inject_placeholders whenever neither ImageExtractionConfig.extract_images nor pdf_options.extract_images is enabled, so disabled-by-default users no longer see embedded raster blobs in their results.
  • #838: OCR elements are now propagated through the extraction pipeline — image and PDF OCR backends populate ExtractionResult.ocr_elements consistently, fixing downstream consumers that relied on per-token bounding boxes.
  • #839: extraction_timeout_secs now applies to the single-file extract_file / extract_bytes paths. Previously the timeout only fired in batch and async wrappers, so a hostile single document could hang past the configured limit. The timeout is gated on tokio-runtime; non-tokio builds remain timeout-less by design.
  • #797: Chunking presets no longer auto-inject an EmbeddingConfig. Setting ChunkingConfig.preset (e.g. "multilingual") without an explicit embedding field previously caused resolve_preset() to silently inject Some(EmbeddingConfig { model: Preset { name }, ..default() }), which made the extraction pipeline run generate_embeddings_for_chunks() and populate every chunk.embedding with vectors the caller never asked for. Presets now configure chunking parameters only; opt into embedding generation by providing an EmbeddingConfig explicitly.
  • #782: result_format = "element_based" now classifies headings and image placeholders correctly. process_hierarchy maps h1ElementType::Title and h2..h6ElementType::Heading, with the numeric level stored as metadata.additional["heading_level"]. process_content detects single-line markdown ATX headings (# Title, ## Section, …) and emits them as Title/Heading instead of NarrativeText. [Image: …] placeholder lines are now emitted as ElementType::Image carrying the description in metadata.additional["image_description"]. process_images writes metadata.additional["image_index"] so consumers can join elements back to the ExtractionResult.images array by index.
  • #844: Python wrapper extract_* functions no longer crash on every call. Picked up upstream alef 0.11.24+ codegen fixes (kreuzberg-dev/alef#44): #[serde(skip)] is propagated to wrapper structs (no more unknown field 'cancel_token'), api.py wrappers forward arguments by keyword (no more extract_file mime*type/config arg-reorder TypeError), async pyo3 functions emit async def + await, and trait-bridge register*\*helpers are re-exported throughapi.py/**init**.py **all**.
  • force_ocr_pages now reliably yields ExtractionMethod::Mixed even when pages config is not explicitly set. The PDF text path now synthesizes a default PageConfig when force-ocr-pages is non-empty so byte boundaries are always available for splicing OCR text into the right ranges.
  • PDF page extraction strategy enum renamed ExtractionMethodPageExtractionMethod in kreuzberg-pdfium-render to disambiguate from kreuzberg::ExtractionMethod (the new native/ocr/mixed strategy enum from #761). The pdfium variant remains exported via pdfium_render::prelude under the new name.

Changed

  • #787 (log hygiene): Default tracing subscriber now layers ureq=warn, ureq_proto=warn, rustls=warn, hyper_util=warn, hf_hub=info, tower_http=info on top of any user-supplied RUST_LOG, and the API router's TraceLayer demotes per-request/response events to DEBUG (failures stay at WARN). Default info log level no longer produces a wall of HTTP/TLS/transport DEBUG lines around HuggingFace model downloads. Per-target RUST_LOG rules continue to win, so RUST_LOG=ureq=debug still surfaces full transport detail when needed. The HuggingFace fetch chatter itself was never a re-fetch bug — models are cached on disk under HF_HOME and in-process via LazyLock engine caches; the noise was purely third-party transport DEBUG bleeding through.
  • Dependabot bumps: swift-actions/setup-swift 2 → 3 (#841), gradle/actions 4 → 6 (#840), docker/setup-qemu-action 3 → 4 (#833), mlugg/setup-zig 1 → 2 (#832).

4.10.0-rc.4 - 2026-04-28

Cycle 4 of the alef-backed publish-pipeline iteration. Cycle 3 surfaced fourteen build-stage publish failures across Elixir, Ruby, Python, PHP, Node, and C#; this RC bundles the targeted fixes for each.

Fixed

  • Elixir native libs build now uses the actual NIF directory name (kreuzberg_nif). The publish workflow had references to packages/elixir/native/kreuzberg_rustler that never existed in this repo (the crate is named kreuzberg_nif). All five Elixir matrix targets failed in cycle 3 with start process … working directory … invalid. Replaced 16 kreuzberg_rustler references in .github/workflows/publish.yaml.
  • Ruby gem build now declares its async-trait dependency. packages/ruby/ext/kreuzberg_rb/src/lib.rs (alef-generated) imports async_trait::async_trait for trait bridges, but the matching Cargo.toml was missing the dep. Both Ruby gem matrix targets failed in cycle 3 with unresolved import async_trait.
  • Ruby batch_reduce_tokens, chunk_text, chunk_texts_batch, chunk_semantic are now excluded from the alef-generated Magnus binding. alef's Magnus codegen referenced undeclared local bindings (texts_refs, page_boundaries_core) for these functions; the codegen fix is tracked upstream.
  • task php:build now exists (cargo build --release -p kreuzberg-php). The publish workflow invoked this task name; without it both PHP PIE matrix targets failed in cycle 3 with task: Task "php:build" does not exist.
  • Python wheel build on linux-aarch64 symlinks aarch64-linux-gnu-gcc to gcc when running natively in the manylinux container. The repo's .cargo/config.toml pins aarch64-unknown-linux-gnu's linker to the cross-compiler binary name, but that binary doesn't exist in the manylinux_2_28 container on a native ARM runner. Added a symlink in the cibuildwheel before-script-linux step.
  • C# native assets linux-musl-arm64 build no longer trips on packages/dart/rust. docker/Dockerfile.musl-ffi's sed pattern was missing the dart/swift workspace-member exclusions (added in earlier cycles to other Dockerfiles); cargo failed with failed to load manifest for workspace member /build/packages/dart/rust.
  • Node bindings build uses a path-based pnpm filter (pnpm --filter ./crates/kreuzberg-node) instead of the never-resolved pnpm --filter @kreuzberg/node. crates/kreuzberg-node/package.json's name is kreuzberg, so the scoped filter never matched. Updated three sites (workflow + two scripts).
  • alef bumped to v0.11.7. Carries two codegen fixes the regenerated bindings depend on: (1) optional string/bytes arguments in Rust e2e tests now bind to a typed Option<String>/Option<Vec<u8>> and pass via .as_deref() so signatures expecting Option<&str> no longer receive &Option<_>; (2) PHP backend correctly threads data_enum_names through the type mapper for tagged data enums.

4.10.0-rc.2 - 2026-04-28

Cycle 2 of the alef-backed publish-pipeline iteration. RC1 surfaced two failures: the actions/check-registry@v1 and actions/prepare-release-metadata@v1 shims now require alef ≥ 0.11.0 for the check-registry and release-metadata subcommands, but alef.toml's top-level version field still pinned 0.10.4 (which install-alef@v1 resolves "latest" against). Bump the alef pin to 0.11.0 so all kreuzberg jobs install an alef binary that has the new subcommands. Also fix the release-metadata.json artifact upload that was being wiped by the prepare job's re-checkout step (stash to /tmp before re-checkout, restore after).

4.10.0-rc.1 - 2026-04-28

First release candidate of v4.10.0. The release pipeline itself is the headline feature: this RC kicks off the iteration loop that proves out the alef-backed publish workflow against real registry endpoints. Substantive functional changes from v4.9.5 are listed below.

Changed (release pipeline)

  • The publish workflow now runs end-to-end on prerelease tags. Previously, if: !github.event.release.prerelease on the prepare job blocked every RC tag from triggering CI. The gate is removed; RCs publish for real with prerelease dist-tags (npm next, gemspec .pre.rc.N, PyPI rc{N}). Homebrew formula updates remain gated on stable releases via a new is_prerelease metadata flag.
  • task version:set -- <version> is the canonical way to set a release version (wraps alef sync-versions --set). Use it for both stable releases and RCs.
  • Cross-manifest version validation in scripts/publish/validate-version-consistency.sh now glob-discovers the Ruby version.rb file across all three rb-sys-style layouts and silently skips Go/PHP whose version-bearing manifest is absent (those ecosystems version via git tags, not in source).

Fixed

  • C FFI, PHP, Ruby, R bindings shipped 40+ stub functions that returned Not implemented at runtime. Every batch API (batch_extract_file_sync, batch_extract_bytes_sync, plus async variants), extract_file, extract_file_sync, and most of the Ruby gem's surface (21 functions) silently failed with error code 99. Root cause was in the alef binding generator: bare Path was misresolved to Named("Path") and sanitized to String; sanitized batch-tuple params (Vec<(PathBuf, Option<FileExtractionConfig>)>) were never handled by the PHP/Magnus/FFI codegen even though the IR carried the original type for JSON-roundtrip; Magnus rejected every extraction function via an over-strict is_named_ref_param check; and the R backend panicked on every async function. Fixed in alef and regenerated all bindings — Python, Node, FFI all build clean. Only kreuzberg_get_preset remains stubbed (return-type sanitization edge case; tracked separately).
  • #788: Extract images nested inside PDF Form XObjects — lopdf::get_page_images only scanned the page-level Resources dictionary and silently skipped images stored inside Subtype=Form XObjects, which is the structure used by technical drawings composed of tiled raster tiles. PDF image extraction now recursively descends into Form XObjects (up to 8 levels deep) in both the lopdf and pdfium code paths, so all constituent images are collected.
  • Security limits actually enforce now. SecurityLimits config fields (max_nesting_depth, max_entity_length, max_content_size, max_iterations, max_xml_depth, max_table_cells) and the matching SecurityError variants previously advertised protection that no extractor invoked — the validator helpers were #[cfg(test)]-gated and removed in commit c58069201 as dead code, leaving only the config knobs. Five internal validators (StringGrowthValidator, IterationValidator, DepthValidator, EntityValidator, TableValidator) are restored and now run on every extraction path that ingests user-controlled bytes — XML-class formats (DOCX/PPTX/XLSX/ODT/EPUB/SVG/JATS/DocBook/FictionBook/OPML), HTML, JSON/YAML/TOML, tabular extraction (CSV/Excel/HTML tables/DOCX cells), and final text accumulation for plain-text formats (Markdown/Org/RST/LaTeX/Jupyter/RTF). Hostile inputs (billion-laughs entity expansion, depth bombs, cell bombs, quadratic string growth, iteration bombs) now fail with a structured KreuzbergError::Security instead of OOMing or hanging. The validators are internal core-only types; bindings observe the protection through the new unified Security error variant returned from every extract_* entry point. Defaults relaxed where the previous values false-positived on legitimate documents: max_nesting_depth 100 → 1024, max_xml_depth 100 → 1024, max_entity_length 32 → 1 048 576 (per-token cap; cumulative size remains bounded by max_content_size).
  • #789: PDF image extraction would hang indefinitely on documents with thousands of image objects on a single page (observed: 2487 images). The max_images_per_page cap was added to ImageExtractionConfig in #766 but only wired to the structure pipeline's position counting, never to the byte-decoding path; pages exceeding the cap are now skipped with a WARN log before the FlateDecode loop runs. Both extract_images_from_pdf and the pdfium fallback now run inside tokio::task::spawn_blocking, so extraction_timeout_secs can interrupt them. (#800)
  • #794: Fix Helm chart default install broken by two conflicts: (1) the cache init container ran as root while podSecurityContext.runAsNonRoot: true is the default, causing kubelet to reject the pod; (2) Kubernetes service discovery injects KREUZBERG_PORT=tcp://... when the release is named kreuzberg, which the binary parses as a u16 and panics. Fixed by adding runAsNonRoot: false to the init container's securityContext, a new cache.initChown toggle (default true, set to false on fsGroup-aware storage to skip the init container entirely), and defaulting enableServiceLinks: false in the pod spec. (#822)
  • #825: kreuzberg cache manifest no longer fails with E0282 when kreuzberg-cli is built without paddle-ocr or layout-detection (e.g. --no-default-features --features bundled-pdfium). The command now bails with a clear actionable error if invoked at runtime in such a build.
  • @kreuzberg/node prebuilt bindings fail to load on RHEL 8 / AlmaLinux 8 / Rocky 8 / RHEL 9: the Linux x64/arm64 GNU prebuilds are now built via cargo-zigbuild, which caps the glibc floor at link time. Fixes the GLIBC_2.38 not found / GLIBCXX_3.4.31 not found / undefined symbol: __isoc23_strtoll load errors on RHEL 8, AlmaLinux 8, Rocky 8 (glibc 2.28) and RHEL 9 (glibc 2.34). Verified locally: the prebuilt .node drops from GLIBC_2.38 / GLIBCXX_3.4.31 down to GLIBC_2.28 / no GLIBCXX dependency. kreuzberg-tesseract/build.rs auto-detects the zigbuild toolchain and (1) disables tesseract's AVX512 codepath (zig/clang requires an explicit evex512 feature that tesseract's CMake doesn't pass) and (2) skips linking stdc++fs (zig's libstdc++ has std::filesystem inline). The publish pipeline now (a) runs objdump -T against each linux-gnu prebuild and rejects any artifact requiring GLIBC_* > 2.28, any GLIBCXX_* symbol, or any __isoc23_* symbol, and (b) loads the prebuilt .node inside redhat/ubi8 (glibc 2.28) and exercises the napi surface before publishing to npm. Refs #352.
  • #781: Fix DOCX OCR pipeline integration — reordered the extraction pipeline to ensure OCR processing runs before document rendering. Markdown, Plain, and Djot renderers now correctly receive and inject OCR text output instead of dropping it or omitting images.
  • #823: Fix WASM loading in Next.js / Turbopack by adding /* webpackIgnore: true */ to dynamic imports of Node.js built-ins and resolving bundling issues in the pdfium-js subsystem.

Added

  • #788: Extract images nested inside PDF Form XObjects — PDF image extraction now recursively descends into Form XObjects (up to 8 levels deep) in both the lopdf and pdfium code paths.

Changed

  • Fix use use duplicate-import syntax error in alef-generated elixir NIF binding (kreuzberg_nif/src/lib.rs).
  • Apply cargo fmt uniformly across all workspace crates (formatting only, no logic changes).
  • Fix typo entrysentries in auto-generated API reference docs.

Added

  • #788: Extract images nested inside PDF Form XObjects — PDF image extraction now recursively descends into Form XObjects (up to 8 levels deep) in both the lopdf and pdfium code paths.

Changed

  • Fix use use duplicate-import syntax error in alef-generated elixir NIF binding (kreuzberg_nif/src/lib.rs).
  • Apply cargo fmt uniformly across all workspace crates (formatting only, no logic changes).
  • Fix typo entrysentries in auto-generated API reference docs.

4.9.5 - 2026-04-23

Fixed

  • #790: Fix GPU acceleration — kreuzberg now bundles CPU-only ONNX Runtime by default (zero-config). When a GPU execution provider (cuda, tensorrt, coreml) is explicitly requested via AccelerationConfig but unavailable, kreuzberg returns an error with setup instructions instead of silently falling back to CPU. Auto mode gracefully falls back to CPU with an info log. For GPU support, set ORT_DYLIB_PATH to a GPU-enabled ONNX Runtime.
  • #791: Fix DOCX OCR extraction — OCR now runs on embedded images before document rendering, and OCR text is injected into the rendered output. Previously, OCR results were discarded and replaced with placeholder text.
  • #783: PaddleOCR backend not utilizing GPU (CUDA) despite AccelerationConfigAccelerationConfig from ExtractionConfig was never reaching PaddleOCR ONNX sessions, silently falling back to CPU. Acceleration is now propagated through OcrConfig to all OCR call sites (image extractor, PDF OCR).
  • #779: Expose PaddleOcrConfig in Python bindings and update OcrConfig for backward compatibility.
  • #792: Fix Ruby gem packaging — exclude staged libpdfium.dylib from gem artifacts by narrowing the native extension glob to only include the compiled kreuzberg_rb.* extension.

Added

  • GPU CI workflow (ci-gpu.yaml) targeting self-hosted GPU runners with NVIDIA GPUs.
  • Comprehensive GPU integration tests covering all ORT-accelerated paths: PaddleOCR (det/cls/rec), layout detection (RT-DETR), embeddings, document orientation detection, and end-to-end extraction. Tests use tracing log capture to verify CUDA EP is actually invoked.

4.9.4 - 2026-04-22

Fixed

  • Ruby gem build failure — add missing max_images_per_page field to ImageExtractionConfig initializer in Ruby binding (kreuzberg-rb), fixing compilation error E0063 on all platforms.
  • Node binding build failure on Linux — stop removing /usr/local/lib/node_modules in CI disk cleanup script; npm was being deleted before pnpm/action-setup could use it, causing spawn npm ENOENT.
  • Homebrew formula publish failure — grant contents: write permission to the publish-homebrew job so gh release upload can attach bottle artifacts (was contents: read).
  • #783: PaddleOCR now correctly utilises the GPU (CUDA) when AccelerationConfig(provider="cuda") is set. Previously self.acceleration on PaddleOcrBackend was always None (hardcoded at construction time), so the ONNX session builder never received the requested execution provider and silently fell back to CPU. AccelerationConfig is now threaded from ExtractionConfig into the ephemeral OcrConfig at each process_image call site (image extractor and both PDF OCR paths), and PaddleOcrBackend::process_image sets the module-level thread-local before the engine-pool slow path — so ONNX sessions are created with the correct provider on first use.

4.9.3 - 2026-04-22

Added

  • Layout detection regions on PageContent — new layout_regions field exposes detected layout regions (class, confidence, bounding box, area fraction) from the RT-DETR model when layout detection is enabled. Enables programmatic detection of diagrams, figures, tables, and other content types per page. Available across all 10 bindings. (#579)
  • LayoutRegion type files for Java, PHP, and Elixir bindings (were referenced but missing).
  • E2E assertions for layout regionshas_layout_regions and layout_classes_include assertion types in all 12 language generators.

Fixed

  • #779: Fix PaddleOcrConfig not bound in Python API — exposed PaddleOcrConfig as a first-class class in the Python bindings. Updated OcrConfig to accept both PaddleOcrConfig objects and raw dictionaries for backward compatibility. Added paddle_ocr_config property (getter/setter) to OcrConfig.
  • #770: DOCX page extraction (extract_pages=True) now works correctly — result.pages and result.get_page_count() are no longer always None/0. Two bugs fixed: (1) computed PageContent blocks were never stored on InternalDocument.prebuilt_pages, so the derivation pipeline always fell back to None; (2) page-break markers inside table cells were incorrectly added to the top-level element list, creating phantom page boundaries before tables and corrupting table_page_numbers. Page breaks (w:br[w:type="page"] and w:lastRenderedPageBreak) in body text are stored as DocumentElement::PageBreak and mapped to precise character offsets; breaks inside table cells are intentionally ignored at document level (tables spanning multiple pages remain a known limitation).
  • #773: serve and mcp CLI subcommands now correctly apply KREUZBERG_* environment variable overrides. Previously, variables such as KREUZBERG_OCR_LANGUAGE, KREUZBERG_LLM_MODEL, and KREUZBERG_LLM_API_KEY were silently ignored when starting the API or MCP server — only the extract command honoured them. Also fixes the provider env-var fallback in the LLM client: MISTRAL_API_KEY is now picked up for bare mistral-* model names (e.g. mistral-large-latest), not only for the mistral/ prefix form.
  • #774: Tagged-PDF structure tree dropped paragraph body text when a block had both own text and children, and wrapped numbered section headings in an invalid List → Heading AST (panics comrak in debug, emits malformed markdown in release). flatten_blocks now emits parent text alongside children; text-pattern list detection in element_to_paragraph is gated on heading_level.is_none().
  • Semantic chunker fallback path now respects max_characters — previously the non-embedding fallback hardcoded a 4000-char ceiling and silently ignored the caller's max_characters. A warning is also emitted when chunker_type='semantic' is used without an EmbeddingConfig so the fallback mode is discoverable. The ChunkerType::Semantic docstring has been corrected to describe both paths accurately.
  • OCR backend dispatch: OcrConfig(backend=...) with a non-default backend no longer silently falls back to paddleocr when the chosen backend errors — auto-fallback is limited to the default tesseract backend; users who want multi-backend fallback configure it via OcrConfig.pipeline (unchanged).
  • EasyOCR on PDFs: EasyOCRBackend.supports_document_processing() returns False so Rust's PdfRenderer handles page rendering, removing the implicit pdf2image/pymupdf requirement that was never declared in the [easyocr] extra.
  • Cross-format parity test failure — HTML extractor now normalizes setext headings to ATX and strips trailing whitespace from html-to-markdown-rs output.
  • Broken wasm-deno/wasm-workers e2e tasks — removed non-functional deno and workers e2e generate/lint/test tasks that referenced invalid generator lang values.
  • oxlint path in node e2e lintoxlint --fix typescript changed to oxlint --fix . (was looking for nonexistent typescript/ directory).
  • Clippy warnings in benchmark-harnesssort_by replaced with sort_by_key + Reverse.
  • Clippy warnings and compilation errors across workspace — added missing max_images_per_page field to ImageExtractionConfig in node and Python bindings; added missing vlm_prompt argument to VLM OCR test calls; collapsed nested if-let in WASM embeddings; added embeddings and tree-sitter passthrough features to kreuzberg-ffi to silence unexpected_cfgs warnings.
  • Cancellation token not wired in oxide segment structure pipelinecancel_token was passed into SegmentStructureConfig but never checked, meaning cancellation/timeout had no effect during pdf-oxide table extraction or paragraph building. Added cancellation checks at table page prep, heuristic table extraction loops, and a pre-flight guard before parallel paragraph extraction.
  • #771: OcrConfig.vlm_prompt is now correctly honored in VLM OCR requests. Previously, it was documented but never forwarded to the underlying VLM calls, causing the default template to be used regardless of configuration.
  • #762: PDF image links are no longer silently dropped from markdown output. Image extraction now correctly preserves correspondence between pdfium objects and lopdf data, and respects the inject_placeholders configuration.
  • #769: Downgraded pre-commit-shfmt to v3.13.1-1 (fixes broken CI due to non-existent version in main).
  • #766: PDF extraction with large numbers of image fragments no longer hangs indefinitely — added ImageExtractionConfig.max_images_per_page (default None) to cap images processed per page. Batch-level extraction_timeout_secs now interrupts blocking pdfium threads at the next inter-page checkpoint via a CancellationToken, preventing the timeout from being silently bypassed.
  • #764: PST extractor now populates email attachments — attachments was hardcoded to an empty list and never read from the message; now reads attachment name, filename, MIME type, size, and binary data via the attachment table. PST entry IDs are now formatted as proper 48-char MAPI hex strings instead of Rust Debug output.

Added

  • ImageExtractionConfig.max_images_per_page — optional cap on images decoded per page; prevents hangs on PDFs with thousands of inline image fragments.

Changed

  • Removed redundant .task/workflows/e2e.yml — e2e tasks consolidated in top-level Taskfile.yml.

4.9.2 - 2026-04-19

Fixed

  • Fix cancellation token not checked in WASM (non-tokio) path for Excel, DOC, PPT, Pages, Keynote, and Numbers extractors — cancellation was silently ignored in WASM builds
  • Propagate Cancelled error code (9) to all bindings — Go, C FFI, Python, TypeScript, and C API docs now include the new code
  • Fix PHP e2e embed tests calling instance methods statically — use procedural \Kreuzberg\embed() functions
  • Fix TypeScript e2e embed tests using wrong field names (type/namemodelType/value) for embedding model config
  • Fix Elixir e2e embed tests calling non-existent embed_async/2 — use sync embed/2
  • Fix TypeScript e2e generator missing html_output config mapping for styled HTML tests
  • Fix ORT_DYLIB_PATH on Windows CI pointing to lib/ instead of the actual DLL location
  • Fix C# CI build conditional to require successful FFI build
  • Add libuv1-dev to Linux CI system dependencies for R package builds

4.9.1 - 2026-04-19

Fixed

  • #754: Preserve _internal_bindings.pyi type stub during wheel artifact cleanup — published wheels now include inline type information for the core binding module
  • Add missing Default impl for PyCancellationToken to satisfy clippy new_without_default lint
  • Improve download resilience for eng.traineddata in build script — increase retries from 3 to 5, add fallback URL via raw.githubusercontent.com, and increase timeout to 300s
  • Increase Task installer retry resilience in CI — 5 attempts with --retry-all-errors curl flag

4.9.0 - 2026-04-18

Fixed

  • #588: Suppress C23 glibc symbols (__isoc23_strtoll etc.) in manylinux wheels — added CMake flag propagation and CI verification step to prevent incompatible symbols on glibc < 2.38 (Debian 12, Ubuntu 22.04)
  • #748: Remove kreuzberg-cli from Python wheel to fix libonnxruntime.so.1 loading failure — CLI is available as standalone release
  • #749: Add cancellation token support — cancelled extractions no longer block subsequent calls via PDFIUM_OPERATION_LOCK; wired across Python, Node.js, Ruby, WASM, and C FFI bindings
  • #750: Fix kreuzberg[easyocr] extra silently installing nothing on Python 3.14+; clean up stale [paddleocr] references in docs
  • #752: Fix ~1000x slowdown on Ghostscript-produced PDFs with structured output — replace O(N²) Vec::contains with O(1) AHashSet lookup, add minimum dimension filter for tiny inline images
  • #753: Fix llm_usage returning None when using VLM-based OCR — propagate usage through PDF OCR, image OCR, and force_ocr_pages paths

Added

  • Cancellation token API available in all language bindings (CancellationToken in Python/Node/Ruby/WASM/FFI)

Changed

  • Breaking: kreuzberg-cli binary is no longer bundled in the Python wheel — install the standalone CLI from GitHub releases

4.8.6 - 2026-04-17

Added

  • PST message EntryID in extracted metadata — the entry_id field from Outlook PST message entries is now included in the metadata HashMap of EmailExtractionResult, enabling callers to unambiguously link extracted data back to its source message. (#739)
  • AccelerationConfig wired through all ORT model loadingAccelerationConfig (CUDA, CoreML, TensorRT, Auto) is now propagated to all ONNX Runtime sessions: layout detection (RT-DETR, YOLO, SLANeT, TATR, TableClassifier), embeddings, document orientation, and PaddleOCR. Previously, GPU acceleration was silently ignored and all models used CPU. The acceleration field is also added to LayoutDetectionConfig and EmbeddingConfig across all 11 bindings (Python, TypeScript, Ruby, Go, Java, C#, PHP, R, Elixir, FFI, WASM). (#740)

Added

  • Semantic chunker (ChunkerType::Semantic) for topic-aware document splitting
  • topic_threshold configuration field for embedding-based topic detection
  • utils/markdown_utils shared utility for ATX heading detection
  • preset_chunk_size() helper in embeddings module
  • E2e contract fixtures for semantic chunking

Fixed

  • Batch extraction panics with "Lazy instance has previously been poisoned" on ARM64 Linux — OCR backend registry initialization used panic!() on Tesseract/PaddleOCR init failures, poisoning the Lazy static and cascading to all concurrent batch tasks. Replaced with tracing::warn!() + graceful skip. Also converted GLOBAL_RUNTIME, EXTRACTORS_INITIALIZED, and 3 PROCESSOR_INITIALIZED statics from once_cell::sync::Lazy to once_cell::sync::OnceCell (retry on failure instead of permanent poisoning). Migrated ~15 collection/cache Lazy statics to std::sync::LazyLock. (#741)
  • PaddleOCR model_tier from TOML config ignored by API server — the singleton PaddleOcrBackend always used self.config.model_tier (default "mobile") to resolve models, ignoring the per-request paddle_ocr_config.model_tier from the user's TOML/API config. Engine initialization now uses the effective per-request config. (#725)
  • VLM OCR backend ignored when paddle-ocr feature enabled — the auto-constructed OCR pipeline hardcoded vlm_config: None on pipeline stages, silently discarding the user's VLM configuration. Users who configured OcrConfig(backend="vlm", vlm_config=LlmConfig(...)) got tesseract/paddleocr output instead of VLM. The pipeline now propagates vlm_config from the parent OcrConfig. (#738)
  • Doubled OCR content and corrupted page text in image extraction — OCR elements were injected into the rendering pipeline as OcrText internal elements, causing render_plain to append every raw word token after the coherent HOCR string. ExtractionResult.content was effectively duplicated and pages[*].content contained a word-by-word dump instead of the readable text. OCR elements are now stored directly via prebuilt_ocr_elements, bypassing the rendering pipeline. (#706)
  • Image OCR pages[] emptyinclude_elements was not forced true for image extraction, so backends that gate element output (e.g. paddle-ocr) returned None, leaving pages[] empty. (#723)
  • LlmConfig missing Default trait — the documented ..Default::default() struct-update pattern failed to compile with "trait not satisfied". Added Default to the derive macro; all optional fields default to None, model to "". (#716)
  • Incorrect llm Cargo feature name in docsllm-integration.md, api-rust.md, and configuration.md referenced a llm feature that does not exist; the correct name is liter-llm. (#717)
  • LLM embedding provider panics in server modeembed_texts called block_on inside a new runtime, which panics when already inside tokio (HTTP server, MCP). Uses block_in_place with the current runtime handle when available, falls back to a new runtime for standalone sync callers. (#713, #714)
  • Duplicate output_format key in OCR metadata — stale additional HashMap insert caused a duplicate JSON key violating RFC 8259. The value is already on the typed Metadata::output_format field. (#712)
  • OCR table metadata serialized as strings instead of numberstable_count, tables_detected, table_rows, and table_cols were "0" instead of 0, breaking numeric comparisons in all bindings. (#712)
  • Ruby structured_output not exposed on Result — the field was missing from the Ruby binding's Result class and not serialized from the native extension. (#736)
  • Stale hf-hub lock files block embedding model downloads — cleaned up orphaned lock files before downloading. (#721)
  • WASM live demo enableOcr() not called — OCR was silently unavailable in the demo; also throws on missing Rust registry export. (#719, #720)
  • DOCX tables assigned wrong page numbers — tables were numbered by index instead of by their actual document position based on page breaks. (#718)
  • ocr.enabled=false config ignored — OCR ran even when explicitly disabled; also dropped trailing newline in --format text output. (#715)
  • Go module tag push fallback — added git push fallback when tag push fails.
  • Go E2E LlmUsage type mismatch — generated Go test helper used []interface{}{} instead of []kreuzberg.LlmUsage{}.
  • Rust E2E extractMetadata field name — html_options fixture used camelCase extractMetadata instead of snake_case extract_metadata expected by html-to-markdown-rs v3.2.
  • R package documentation stale — 14 exported functions lacked .Rd man pages and extraction_config.Rd was missing 13 parameters added in v4.8.0–4.8.5. Regenerated all roxygen2 documentation.

Changed

  • Updated all dependencies including html-to-markdown-rs 3.1→3.2, pdf_oxide 0.3.30→0.3.32, tokio 1.51→1.52.

4.8.5 - 2026-04-14

Added

  • LLM usage tracking — new llm_usage field on ExtractionResult captures token counts, estimated cost (USD), model identifier, and finish reason for every LLM call (VLM OCR, structured extraction, LLM embeddings). Multiple entries are produced when multiple LLM calls occur in a single extraction. Exposed across all bindings: Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, R, C FFI, and WASM.

Fixed

  • Markdown chunker duplicates heading when prepend_heading_context is enabled — the heading was prepended twice when a chunk boundary aligned with a heading node, producing repeated heading text in the output. (#701)
  • Helm chart icon 404 on Artifact HubChart.yaml referenced logo.png but the file is logo.svg.
  • Python wheel manylinux compliance failure — bumped manylinux from 2_38 to 2_39 to allow GLIBCXX_3.4.31 symbols from the build toolchain, matching the v4.6.x baseline that worked.
  • Python wheel requires glibc ≥ 2.38 (breaks Debian 12, Ubuntu 22.04) — GCC 14 in the manylinux_2_39 build container emitted C23-versioned glibc symbols (__isoc23_strtoll, __isoc23_sscanf, etc.), making the wheel uninstallable on systems with glibc < 2.38. Downgraded to manylinux_2_28 and added -std=gnu11/-std=gnu++17 CFLAGS to suppress C23 symbol emission. (#588)
  • FFI memory leakkreuzberg_free_result was not freeing djot_content_json, structured_output_json, and llm_usage_json pointers.
  • R e2e embed tests fail — generated R embedding config was missing the type discriminator field required by Rust's tagged enum deserialization.
  • Elixir parity test failsExtractionConfig struct was missing the :html_output field.
  • Go LLM e2e tests failEmbeddingModelType struct was missing Llm nested config, ExtractionConfig was missing StructuredExtraction field.
  • WASM tree-sitter build failstree-sitter-language-pack 1.6.0 removed the wasm feature; removed stale feature gate from wasm32 target dependency.

4.8.4 - 2026-04-13

Added

  • Helm chart for Kubernetes deployment — minimal, security-hardened Helm chart with Deployment, Service, Ingress, PVC, HPA, PDB, and ServiceAccount templates. Publishes to GHCR as an OCI artifact. (#695)
  • Helm lint and kubeconform pre-commit hooks — added helm lint --strict and kubeconform (k8s 1.28.0 schema validation) to pre-commit and CI pipeline.
  • Helm chart publish workflow — new publish-helm.yaml GitHub Actions workflow pushes versioned chart to oci://ghcr.io/kreuzberg-dev/charts.

Fixed

  • Helm chart: init container cannot chown as non-root — the init-cache container needs root to chown the PVC mount. Added securityContext.runAsUser: 0 to the init container.
  • Helm chart: unpinned busybox image tags — pinned busybox:latest to busybox:1.37-glibc in init container and test pod for reproducibility.
  • Comrak bridge panics on multi-byte UTF-8 boundaries — annotation byte offsets landing inside multi-byte characters (e.g. Cyrillic, \u00ab\u00bb) caused panics in build_inlines(). Snaps offsets to valid char boundaries using ceil_char_boundary()/floor_char_boundary(). (#696)

4.8.3 - 2026-04-12

Fixed

  • ONNX session creation fails on Linux x86-64 with "graph_optimization_level is not valid"GraphOptimizationLevel::Level3 maps to ORT_ENABLE_LAYOUT (value 3), only valid in ORT >= 1.21. The Linux wheel bundled ORT 1.20.1 due to a hardcoded version override in the publish workflow. Fixed by switching to GraphOptimizationLevel::All (ORT_ENABLE_ALL = 99, valid across all ORT 1.x) and aligning all ORT versions to 1.24.2 (matching ort-sys 2.0.0-rc.12). Also upgraded manylinux target from manylinux_2_28 to manylinux_2_35 to support the newer ORT binaries. (#683)

Documentation

  • Documented AVX/AVX2 CPU requirement for ONNX Runtime features — CPUs without AVX support (e.g. Intel Atom, Celeron N5105/Jasper Lake) cannot use PaddleOCR, layout detection, or embeddings. Added warning and system requirements entry to installation docs. (#691)

4.8.2 - 2026-04-10

Added

  • HtmlOutputConfig typed in all bindingshtml_output config field (themes, CSS classes, embed CSS, custom CSS, class prefix) now fully typed in Python, TypeScript/Node, Go, Ruby, Elixir, PHP, Java, C#, R, and FFI. Previously only available in Rust core.

Fixed

  • PDF: legitimate repeated content stripped during page merging regardless of strip_repeating_text flagdeduplicate_paragraphs() in the PDF merge pipeline runs unconditionally after per-page extraction, removing consecutive identical paragraphs (≥5 chars) and non-consecutive body-text duplicates (≥15 chars) via HashSet dedup. This strips brand names and other legitimately repeated content even when ContentFilterConfig.strip_repeating_text is set to false. Gated both deduplication passes behind the strip_repeating_text flag so they are skipped when content filtering is disabled (#670, #681)
  • R package build failure — R binding Cargo.toml version was stuck at 4.6.3 while core was at 4.8.1, causing tokio version resolution failure. Version sync script now includes the R native extension Cargo.toml.
  • CI: PyPI publish action failure — pinned pypa/gh-action-pypi-publish to v1.13.0 (v1.14.0 has broken Docker image on GHCR)
  • E2E: Elixir generator emitted undefined is_nan/1 function — added helper function definition to the generated Elixir test helpers

4.8.1 - 2026-04-09

Added

  • Styled HTML output — New HtmlOutputConfig on ExtractionConfig with 5 built-in themes (default, github, dark, light, unstyled), semantic kb-* CSS class hooks on every structural element, CSS custom properties (--kb-*), custom CSS injection (inline or file), and configurable class prefix. The existing Html output format is upgraded in-place when html_output is set (#633, #665)
  • 5 new CLI flags: --html-theme, --html-css, --html-css-file, --html-class-prefix, --html-no-embed-css — any flag implicitly sets --content-format html
  • HtmlOutputConfig and HtmlTheme types exposed in Rust public API

Changed

  • Vendored yake-rust 1.0.3 into kreuzberg core, removing external dependency
  • Fixes #676: BacktrackLimitExceeded panic on large files (10+ MB) by replacing regex-based sentence splitting with memchr-based approach
  • Expanded YAKE stopwords from 34 to 64 languages using kreuzberg's unified stopwords module
  • Removed 6 transitive dependencies (yake-rust, segtok, fancy-regex, streaming-stats, hashbrown, levenshtein)
  • Styled HTML renderer included in the html feature (no separate html-styled feature gate)

Fixed

  • PPTX: panic on non-char-boundary during page boundary recomputation — byte offsets could land inside multi-byte UTF-8 characters (e.g. U+2026), causing a panic when slicing content (#674)
  • PDF: include_headers / include_footers flags ignored by layout-model furniture stripping — when a layout-detection model classified paragraphs as PageHeader or PageFooter, they were unconditionally stripped as furniture regardless of ContentFilterConfig flag values. Setting strip_repeating_text=false with include_headers=true now correctly preserves those regions (#670)
  • PDF: heuristic table detector misclassifies body text as tables on slide-like PDFs — PowerPoint-exported PDFs with column-like text gaps produced false-positive 2–3 row "tables" whose bounding boxes covered the entire page, suppressing all body text from the structured extraction pipeline. Tables with ≤3 rows spanning >50% of the page height are now rejected as false positives
  • PPTX: ImageExtractionConfig.inject_placeholders silently ignored — setting inject_placeholders=false now correctly suppresses ![alt](target) image references in PPTX markdown output (#671, #677)
  • DOCX/HTML/DocBook/LaTeX/RST: inject_placeholders config ignored — all extractors now honour ImageExtractionConfig.inject_placeholders to suppress image reference injection when set to false
  • PPTX public API cleanupextract_pptx_from_path and extract_pptx_from_bytes now accept &PptxExtractionOptions instead of 6 positional parameters

4.8.0 - 2026-04-08

Added

  • Cross-extractor content filtering configuration — New ContentFilterConfig on ExtractionConfig with include_headers, include_footers, strip_repeating_text, and include_watermarks flags. Controls header/footer/furniture inclusion across PDF, DOCX, RTF, ODT, HTML, EPUB, and PPT extractors. Typed in all bindings (Python, TypeScript, Ruby, Go, Elixir, PHP, Java, C#, WASM).
  • Local LLM support via liter-llm 1.2 — use Ollama, LM Studio, vLLM, llama.cpp, LocalAI, or llamafile as VLM OCR, embedding, or structured extraction backends with zero API key configuration
  • LLM-powered document intelligence via liter-llm — Integrates with 146 LLM providers (including local inference engines) for three new capabilities:
  • VLM OCR: Vision language models as OCR backend (OpenAI GPT-4o, Anthropic Claude, Google Gemini, etc.). Superior accuracy for low-quality scans, handwriting, Arabic/Farsi, and complex layouts. Configure via ocr.backend = "vlm" with ocr.vlm_config.
  • Structured Extraction: Extract structured JSON data from documents using a JSON schema constraint. Users provide a schema and optional Jinja2 prompt template; the LLM returns conforming data. Supports strict mode (OpenAI) with automatic schema sanitization for cross-provider compatibility.
  • VLM Embeddings: Provider-hosted embedding models (e.g., openai/text-embedding-3-small, mistral/mistral-embed) as alternative to local ONNX models. Works through existing /embed API, embed_text MCP tool, and embed CLI command.
  • New CLI command: kreuzberg extract-structured for schema-guided LLM extraction
  • New API endpoint: POST /extract-structured with multipart file upload
  • New MCP tool: extract_structured for AI assistant integration
  • Minijinja template engine for customizable LLM prompts — structured extraction supports {{ content }}, {{ schema }}, {{ schema_name }}, {{ schema_description }}; VLM OCR supports {{ language }}
  • 5 new environment variables: KREUZBERG_LLM_MODEL, KREUZBERG_LLM_API_KEY, KREUZBERG_LLM_BASE_URL, KREUZBERG_VLM_OCR_MODEL, KREUZBERG_VLM_EMBEDDING_MODEL
  • LlmConfig and StructuredExtractionConfig types exposed in Python, Node.js, and PHP bindings
  • structured_output field on ExtractionResult across all languages
  • structured_output_json field in C FFI CExtractionResult struct
  • EmbeddingModelType::Llm variant for provider-hosted embeddings
  • VLM OCR registered as plugin backend in OCR registry
  • Standalone text embedding API (#599, #614) with /embed endpoint, embed_text MCP tool, and embed CLI command

Changed

  • License changed from MIT to Elastic License 2.0 (ELv2) — copyright holder changed to Kreuzberg, Inc. Forked upstream crates (kreuzberg-paddle-ocr, kreuzberg-tesseract, kreuzberg-pdfium-render) retain their original MIT licenses.
  • All ExtractionResult constructors refactored to use ..Default::default() for forward compatibility
  • Embed CLI command extended with --provider llm and --model flags
  • Embed MCP tool extended with model and api_key parameters
  • Extract CLI overrides extended with --vlm-model, --vlm-api-key, --vlm-prompt
  • API returns 501 Not Implemented (instead of 500) when liter-llm feature is disabled
  • JSON schema additionalProperties automatically stripped for non-OpenAI providers

Fixed

  • FFI error code tests updated for Embedding variant
  • Flaky FFI string_intern tests serialized with serial_test
  • TypeScript NativeBinding interface updated with embedSync/embed declarations
  • E2E generator emits minimal cfg (no any() wrapper for single conditions)
  • PDF: brand names stripped by repeating text detectionContentFilterConfig.strip_repeating_text = false disables cross-page repeating text removal that incorrectly strips brand names from PowerPoint-exported decks (#667)
  • PPTX: slide order scrambled for decks with 10+ slides — Fixed lexicographic sort of slide paths (slide10.xml before slide2.xml) to use numeric ordering (#669)
  • UTF-8 panic in arXiv watermark strippingstrip_arxiv_watermark_noise panics when a multi-byte character spans the 6000-byte search limit. Fixed with floor_char_boundary (#663)
  • DOC: garbled text from old Word files — CP1252 text misread as UTF-16LE when the fCompressed bit is unreliable. Added heuristic to detect and re-decode garbled output (#666)
  • WASM: table extraction returns empty array — TypeScript validation silently drops tables when pageNumber is null. Fixed to default to page 0 (#655)

4.7.4 - 2026-04-06

Added

  • Re-added --layout boolean CLI flag for easy layout detection enablement (use --layout to enable with model defaults, --layout false to explicitly disable)
  • arXiv watermark/sidebar noise filtering for academic PDFs — strips LaTeX sidebar identifiers from extracted text
  • Second-tier cross-page repeating text detection — catches conference headers and journal running titles that repeat on >70% of pages but appear outside the margin zone
  • Figure/picture text suppression — text inside layout-detected Picture regions is now marked as page furniture and excluded from body output

Fixed

  • Figure-internal text leaking into body output — Text from inside figures and diagrams (e.g., diagram labels, axis text) was incorrectly included in the extracted body content, sometimes promoted to headings. The layout detection pipeline now suppresses text paragraphs classified as Picture regions.
  • CLI tests now correctly reference --content-format instead of deprecated --output-format
  • Empty image references in PDF markdown/HTML output — PDFs with embedded images produced empty ![]() references in markdown and <img src="" alt=""> in HTML output. The PDF structure pipeline now extracts actual image pixel data via pdfium and populates document images, producing proper ![](image_N.png) references.
  • Invalid extractFromFile config in documentation — Demo code in the TypeScript API reference included invalid configuration parameters that caused runtime errors.
  • WASM build failure with extern "C-unwind" — The LLVM WASM backend does not support cleanupret instructions generated by extern "C-unwind" FFI blocks. Added ffi_extern! macro that uses extern "C-unwind" on native targets (for C++ exception safety) and extern "C" on WASM.
  • Go module tag format — Go module tags now use the correct packages/go/v4/vX.Y.Z format matching the module path in go.mod, plus the legacy packages/go/vX.Y.Z format for backwards compatibility. Backfilled tags for all stable releases.

Changed

  • CLI documentation updated with all missing extraction override flags (--layout-table-model, --disable-ocr, --cache-namespace, --cache-ttl-secs)

4.7.3 - 2026-04-05

Fixed

  • Archive extraction SIGBUS crash on macOS ARM64 — ZIP, 7Z, TAR, and GZIP archive extraction crashed with SIGBUS (signal 10) in release builds due to miscompilation of unsafe code in sevenz-rust2 and zip crates under opt-level=3. Reduced optimization level to 2 for these crates. This also fixes Elixir, R, Go, and C benchmark crashes when processing archive files.
  • Native-text PDF extraction fails when OCR backend unavailable (#646) — PDFs with extractable native text hard-failed with ParsingError: All OCR pipeline backends failed when no OCR backend (PaddleOCR/Tesseract) was installed, even though pdfium already extracted text successfully. The automatic OCR quality-enhancement pass now gracefully falls back to the native extraction result when OCR backends are unavailable, emitting a warning instead of failing.
  • Elixir Logger pollutes stdout — Elixir benchmark scripts produced [debug] Initialized Kreuzberg.Plugin.Registry on stdout, corrupting JSON output. Logger default handler now configured to write to stderr via config :logger, :default_handler.
  • WASM benchmark module resolution — WASM benchmark script failed to load @kreuzberg/wasm through pnpm virtual store due to import.meta.url resolution issues in tsx. Changed to direct import from local build path.
  • CI: FFI-dependent tests fail when FFI build skipped — Go, Elixir, R, C FFI, and CLI test jobs ran and failed when build-ffi was skipped by paths-filter. Added needs.build-ffi.result == 'success' guard.
  • Rust cannot catch foreign exceptions crash (#606) — C++ exceptions from Tesseract or Leptonica (e.g. on corrupted images or edge-case inputs) propagated across the FFI boundary unhandled, causing fatal runtime error: Rust cannot catch foreign exceptions, aborting. All Tesseract/Leptonica FFI declarations now use extern "C-unwind" to allow foreign exceptions to unwind safely, and OCR processing is wrapped with catch_unwind to convert them to recoverable errors.

4.7.2 - 2026-04-04

Added

  • E2E generator published modecargo run -p kreuzberg-e2e-generator -- generate --mode published --version <V> generates standalone test apps against published registry versions (PyPI, npm, Maven, NuGet, crates.io, Hex, RubyGems). All 12 language generators now also produce their project/dependency files (pyproject.toml, package.json, composer.json, etc.).

Changed

  • Global model cache (#641) — Models now download to platform-appropriate global cache (~/.cache/kreuzberg/ on Linux, ~/Library/Caches/kreuzberg/ on macOS, %LOCALAPPDATA%/kreuzberg/ on Windows) instead of per-directory .kreuzberg/ folders. Override with KREUZBERG_CACHE_DIR env var. Consolidates 7 duplicate cache-dir resolution implementations into a single cache_dir::resolve_cache_dir() function.

Fixed

  • Embedded HTML in PDF text layers — PDFs with raw HTML in their text layer (<p>, <br />, <a href>) produced escaped garbage (\<p\>) in output. Now detected and converted to clean markdown using html-to-markdown-rs, the same crate and config used by the HTML extractor. Comrak-generated <!-- end list --> comments also stripped from output.
  • Code classification false positives — Layout model sometimes classified regular prose as Code blocks. Added a prose guard that rejects Code classification for text with sentence punctuation, low syntax density, and many words.
  • PageBreak rendering as ----- separators — PageBreak elements in InternalDocument were rendered as ThematicBreak (-----) in markdown and <hr> in HTML output. This polluted extraction output with separators that don't exist in the source document. PageBreak is now treated as structural metadata — paragraph breaks between elements provide sufficient page separation, matching the pdfium baseline behavior.
  • Leptonica DPI crash (#606) — Images with resolution 0 DPI caused Leptonica preprocessing (background normalization, unsharp mask, grayscale conversion) to trigger a C++ exception that Rust cannot catch, aborting the process. Now validates and fixes DPI to 72 before preprocessing. Also disabled C++ exception handling on Windows MSVC builds (/EHsc removed).
  • Node.js ExtractionResult.children missing at runtime — The children field was declared in TypeScript definitions but missing from the runtime NAPI object in the published v4.7.1 binary, causing parity test failures.
  • Layout detection fixture stale preset field — E2E fixture layout_detection.json included removed preset field, causing Python test failures. Removed from fixture.
  • Node.js disable_ocr config not respected — Setting disableOcr: true in the Node.js binding still produced OCR content for images instead of returning empty content.
  • C# Serialization class inaccessible — Generated e2e tests referenced Serialization class with insufficient access level in the published NuGet package.
  • Java PdfAnnotation missing gettersgetContent() and getPageNumber() methods were missing from the Java record, causing parity test failures. Added JavaBean-style getters to match getAnnotationType() and getBoundingBox().
  • Java Table missing gettersgetCells(), getMarkdown(), and getPageNumber() methods were missing from the Java record. Added JavaBean-style getters to match existing getBoundingBox().
  • Go test_app module conflict — Generated Go test_apps used the same module name as e2e/go, causing workspace conflicts. Published mode now uses a distinct module path.
  • PaddleOCR angle classification crash (#643) — V2 angle classifier model (PP-LCNet_x1_0_textline_ori) expects [N, 3, 80, 160] input but preprocessing resized to [N, 3, 48, 192] (old mobile cls dimensions). Fixed input dimensions to match the v2 model.
  • Centralized concurrency controls — Fixed 5 places bypassing resolve_thread_budget(): embeddings ONNX session (no thread config at all), image OCR (hardcoded 8 tasks), batch extraction fallback (num_cpus * 1.5), doc orientation (.min(4) cap), PaddleOCR BaseNet (inter_threads set to num_thread instead of 1).
  • Chunk page numbers missing (#636) — Chunks produced with first_page: null, last_page: null when chunking was configured without explicit pages config. Three fixes: (1) auto-enable page tracking when chunking is configured, so the PDF extractor always produces per-page boundaries; (2) improved page boundary recomputation with first-line fallback when exact content match fails due to rendering transformations; (3) allow zero-length boundaries for blank pages instead of failing validation.

4.7.1 - 2026-04-03

Added

  • Tree-sitter grammar management CLI — New kreuzberg tree-sitter subcommand with download, list, cache-dir, and clean sub-commands for managing tree-sitter grammar parsers. Supports downloading by language name, group (--groups web,systems,scripting), or all (--all). Reads [tree_sitter] config from kreuzberg.toml with --from-config.
  • Tree-sitter grammar management API — New REST endpoints: POST /grammars/download, GET /grammars/list, GET /grammars/cache, DELETE /grammars/cache for programmatic grammar management.
  • Tree-sitter grammar management MCP tools — New MCP tools: download_grammars, list_grammars, grammar_cache_info, clean_grammar_cache for AI assistant-driven grammar management.
  • Tree-sitter config startup initialization — API and MCP servers auto-download tree-sitter grammars on startup when [tree_sitter] config specifies languages or groups.

Changed

  • Normalized OCR+layout pipeline — Tesseract+layout path now follows the same architecture as pdfium+layout: hOCR → PdfParagraph → apply_layout_overridesassemble_internal_document → comrak. Replaces the broken custom apply_layout_to_ocr_document path that destroyed paragraph structure and reading order.
  • Elixir NIF crash protection — All extraction and batch NIFs now wrapped with catch_unwind to prevent panics in native C libraries (pdfium, tesseract) from crashing the BEAM VM. Panics are caught and returned as {:error, reason} tuples with error-level tracing including backtraces.

Fixed

  • hOCR parser depth tracking — Fixed paragraph boundary detection in the hOCR parser that used a generic depth counter for <p>, <span>, and <div> tags. Closing tags from inner word spans could prematurely terminate a paragraph, causing content after that point to be silently dropped. Now uses tag-name-specific depth tracking.
  • hOCR multi-page content loss — Per-page hOCR documents from tesseract always report ppageno=0 (page=1), but the paragraph conversion filtered by the actual page index, silently dropping all content on pages 2+. Removed the per-page filter since each hOCR document is independently extracted per page.
  • OCR batch parallelization — OCR page processing was hardcoded to 4 concurrent pages regardless of available CPUs. Now uses resolve_thread_budget() (auto-detects CPUs, capped at 8) for significantly faster multi-page document processing.
  • Benchmark workflow — Removed reference to deleted kreuzberg-extract binary target.
  • Ruby OCR backend — Added missing ocr_internal_document field to ExtractionResult construction.
  • Keyword extraction tests — Updated test assertions to use new extracted_keywords field instead of deprecated metadata.additional["keywords"].
  • PaddleOCR cache dir test — Fixed test failure when KREUZBERG_CACHE_DIR environment variable is set by CI setup actions.
  • API pdf_password handler — Added #[cfg(feature = "pdf")] gate to prevent compile error when api feature is enabled without pdf.
  • Chunking page boundary regression (#636): Page boundaries were computed against raw extractor text but result.content uses rendered text with different byte lengths. Chunks now recompute boundaries from per-page content, fixing first_page/last_page being null and the "Page boundary byte_end exceeds text length" validation warning.
  • HF Hub environment variables (#634): Use ApiBuilder::from_env() instead of ApiBuilder::new() for Hugging Face model downloads, respecting HF_HOME and HF_ENDPOINT environment variables. Fixes permission errors on Kubernetes when running as non-root.
  • PDF bridge tracing panic on multibyte characters (#635): Use .chars().take() instead of byte indexing for text_preview in PDF structure bridge tracing, preventing panics on multibyte UTF-8 characters (e.g., ).
  • Go FFI struct layout — vendored C header was missing children_json field, causing 8-byte offset shift. All FFI fields after chunks_json read wrong memory (e.g., ocr_elements_json read mime_type instead).
  • Java FFI struct layoutCExtractionResult layout was missing code_intelligence_json field, causing success flag to read from wrong offset. All Java extractions returned success=false.
  • PHP __get magic method bypass — six JSON fields (elements, djotContent, document, ocrElements, children, uris) returned raw JSON strings instead of deserialized arrays because #[php(prop)] intercepted property access before __get.
  • Ruby disable_ocr configdisable_ocr keyword was not parsed in Ruby config handler, causing OCR to run even when explicitly disabled.
  • Node.js ExtractionResult paritydocument, djotContent, and ocrElements fields were Option<Value> which NAPI-RS omitted from JS objects when None. Changed to Value defaulting to null.
  • Node.js convertChunk missing chunkType — TypeScript type converter did not forward the chunk_type field from NAPI bindings.
  • ODT caption text extraction — text inside draw:frame > draw:text-box > text:p (e.g., image captions) was not extracted. The ODT extractor now recurses into text-box content.
  • OCR InternalDocument propagationrun_ocr_pipeline discarded the structured InternalDocument built by extract_with_ocr, causing OCR results to fall back to naive \n\n paragraph splitting. Now propagated through the full pipeline.
  • OCR table cells — OCR-detected tables (via TATR) had empty cells vectors, causing comrak to render them as paragraphs instead of proper tables. Now populated from the cell grid, matching the native text path fix.
  • OCR non-layout InternalDocument — When layout detection is not active, the OCR path now builds an InternalDocument from results instead of returning None. Ensures structured output regardless of layout detection availability.
  • Italian/European PDF ligature corruption — Extended contextual ligature repair to handle tt, ti, tti ligatures common in Italian fonts. Fixes garbled text like Dire*oreDirettore, ges:onegestione, progeMprogetti.
  • OCR layout false heading classification — Tesseract+layout pipeline was worse than pure tesseract (33% vs 41% SF1) because layout confidence threshold was too low (0.5). Raised to 0.7 for OCR path where font-size validation is unavailable.
  • OCR table rendering — OCR-detected tables were not linked to InternalDocument elements, causing comrak to skip them entirely. Tables now properly registered via push_table() with corresponding ElementKind::Table elements.
  • Spurious table detection — Multi-column prose with short cells (like nougat_008) bypassed the prose row check due to a 30-char minimum row length. Lowered to 15 chars so short-cell prose tables are correctly rejected.
  • PHP enum registration — PHP enums (ContentLayer, ElementType, etc.) were registered with .class() instead of .enumeration(), causing empty case lists. Virtual properties on ExtractionResult and ArchiveEntry now declared via builder modifiers for reflection visibility.
  • Go macOS FFI linking — monorepo dev build (ffi_dev.go) was missing -framework Foundation in CGO LDFLAGS, causing linker failures on macOS with CoreML-enabled ONNX Runtime.
  • Unified WASM e2e tests — replaced broken separate Deno/Workers e2e generators with a single vitest-based WASM generator. ORT-dependent features (embeddings, layout, paddle-ocr) gracefully skip.
  • WASM Rayon thread pool panic — Rayon's par_iter() / into_par_iter() and ThreadPoolBuilder::build_global() panicked in WASM (RuntimeError: unreachable) because WASM has no threading support. All Rayon usages now fall back to sequential iteration on wasm32 target.
  • PHP virtual property reflectionClassBuilder::property() declarations for __get-backed fields (metadata, chunks, document, etc.) shadowed the magic method, returning null. Replaced with getter methods that don't interfere with __get. Parity test updated to check both hasProperty() and getter methods.

4.7.0 - 2026-03-30

Added

  • Semantic chunk labeling (#600): Chunks now include a chunk_type field identifying the semantic nature of the content (e.g., paragraph, heading, list_item, table_cell, code_block). Supported across all 11 language bindings with updated E2E test parity.
  • Unified InternalDocument architecture: All extractors now return a canonical InternalDocument with typed elements, relationships, images, and tables. Replaces format-specific intermediate representations.
  • Unified rendering layer: New new_markdown.rs renderer produces CommonMark from InternalDocument, supporting headings, lists, tables, code blocks, formulas, footnotes, images, and inline annotations (bold, italic, links).
  • PDF structure pipeline: Full rewrite of PDF extraction using page.text().all() for clean text, char-indexed font metadata for heading/bold detection, segment-based paragraph gap detection, and pdfium segment bounding boxes for precise paragraph regions.
  • Image extraction across 8 formats: Embedded images now extracted as ExtractedImage with binary data, format, dimensions, and alt text. Supported for DOCX, PPTX, PDF, EPUB, ODT, HTML (data URIs), RTF (hex-decoded), and Markdown/MDX/Jupyter. Markdown output renders as ![alt](image_N.ext) with binary data in ExtractionResult.images.
  • Recursive OCR on embedded images: When OCR is configured, extracted images from EPUB, ODT, HTML, and RTF are processed through process_images_with_ocr(), producing nested ExtractionResult in ExtractedImage.ocr_result.
  • PDF watermark artifact filtering: Uses pdfium's /Artifact content marks (PDF tagged content spec) to identify and filter watermark text from output.
  • Vertical table header reconstruction: Detects and fixes rotated column headers in PDF tables where pdfium extracts characters as spaced single characters in reverse order (e.g., "y t i r o h t u A o N" → "NoAuthority").
  • Position-based page furniture detection: Cross-page repeating text detection now uses actual page margins (top/bottom 10%) and page heights instead of word-count heuristics.
  • html-to-markdown v3 migration: Switched to html-to-markdown v3 with unified convert() API returning ConversionResult (content, metadata, tables, images, document structure in a single call). Uses visitor-based table collection. hOCR module vendored as table_core.
  • Markdown ground truth for 336 documents: Pandoc-generated GT across 10 formats (DOCX, HTML, RTF, PPTX, EPUB, ODT, XLSX, XLS, CSV, DOC) for structural quality benchmarking. All 371 markdown GT files cleaned of HTML remnants (415 tables converted to GFM pipe tables, 28 inline tags fixed).
  • Multi-format benchmark support: Pipeline benchmark now scores all document formats (not just PDF), shows file type per document, replaces NaN with "—", and reports ground truth loading errors.
  • Comprehensive PDF pipeline tracing: Trace-level logging across heading lifecycle (layout overrides, demotion passes, furniture detection, render layer) for debugging.
  • Pages API for PDF extraction: Per-page content now properly wired through the extraction pipeline via prebuilt_pages on InternalDocument, making result.pages available for PDF documents.
  • TOON wire format: Token-Oriented Object Notation support across CLI (--format toon), API (Accept: application/toon), MCP (response_format: "toon"), and all 11 language bindings (Python, Node.js, WASM, C FFI, PHP, Ruby, Elixir, Go, Java, C#, R). TOON is a token-efficient alternative to JSON for LLM prompts — losslessly convertible to/from JSON but uses ~30-50% fewer tokens. Core functions serialize_to_toon() and serialize_to_json() exposed as public API.
  • Renderer registry: Trait-based Renderer and RendererRegistry for custom output format plugins. Built-in renderers (markdown, HTML, djot, plain) registered at startup. External crates can register custom renderers (e.g., DOCX output) via register_renderer().
  • comrak-based rendering: Markdown and HTML rendering now uses comrak AST bridge instead of hand-rolled string building. Produces GFM-compliant markdown and semantic HTML5. Paragraph consolidation merges consecutive same-format paragraphs at sentence boundaries (fixes DOCX CV fragmentation where each visual line was a separate *...* italic block).
  • Benchmark quality scoring improvements: Content normalization for HTML blocks in markdown scoring, Image↔Paragraph and Table↔ListItem type compatibility, correct field in QualityMetrics, HTML detection in ground truth validation.
  • Benchmark harness overhaul: Per-format SF1/TF1 aggregation, noise detection (10 heuristics for HTML remnants, garbled text, broken tables, page artifacts), diagnostic diff mode (--diagnose), JSON output (--json-output), ground truth validation subcommand (validate-gt). Comprehensive tracing across all extractors and the rendering layer.
  • Markdown ground truth for 23 formats: 350+ benchmark fixtures across CSV, DOCX, HTML, EPUB, LaTeX, RST, RTF, PPTX, ODT, XLSX, XLS, OPML, ORG, JATS, IPYNB, FictionBook, DocBook, Typst, DOC, PPT, and more. GT generated via pandoc and verified against source documents.
  • OpenWebUI integration: Kreuzberg serves as a document extraction backend for Open WebUI chat interfaces.
  • URI extraction: New Uri type with UriKind classification (Hyperlink, Image, Anchor, Citation, Reference, Email) extracted from 20+ document formats. URIs are always-on, deduplicated by (url, kind) pair, and capped at 100k per document. Available in ExtractionResult.uris.
  • Recursive email attachment extraction: EML/MSG/PST attachments are now recursively extracted as ArchiveEntry children using the same pattern as archive extractors. Nested message/rfc822 parts also extracted as children. Respects max_archive_depth.
  • PDF embedded file extraction: PDF file attachments (portfolios) are now recursively extracted as ArchiveEntry children via lopdf. Includes filename sanitization, decompression size limits, and name tree depth guards.
  • PDF bookmark/outline extraction: Document outlines (bookmarks) extracted as URIs — page destinations as UriKind::Anchor, external links as UriKind::Hyperlink.
  • DOCX/PPTX embedded object extraction: OLE objects and embedded files from word/embeddings/ and ppt/embeddings/ directories are now recursively extracted as children.
  • PPTX hyperlink extraction: Hyperlinks from slide XML (<a:hlinkClick> in run properties) now resolved via relationship files and extracted as URIs.
  • Image path resolution for markup formats: When using extract_file(), relative image paths in Markdown, MDX, LaTeX, RST, OrgMode, Typst, Djot, and DocBook are resolved from the filesystem and extracted as ExtractedImage data. OS-agnostic with path traversal prevention.
  • Unified image OCR pipeline stage: Image OCR moved from per-extractor calls to a single pipeline stage after derivation. All extracted images (including path-resolved markup images) are now OCR'd uniformly when OCR is configured. Concurrency limited to 8 concurrent tasks.
  • FictionBook image and link extraction: Base64-encoded <binary> images and <a> hyperlinks now extracted from FB2 documents.
  • Apple iWork extractor improvements: Numbers outputs tables instead of paragraphs, Keynote has improved slide structure, Pages has heading detection. All three extract metadata from ZIP plist.
  • code_intelligence field on ExtractionResult: Top-level access to tree-sitter ProcessResult with full structure, imports, exports, chunks, symbols, diagnostics, and docstrings. Previously only available inside FormatMetadata::Code metadata.
  • CodeContentMode config: Control code extraction content mode -- chunks (semantic TSLP chunks, default), raw (source as-is), structure (headings + docstrings only). Configured via TreeSitterProcessConfig.content_mode.
  • TSLP semantic chunking for code: Code files bypass the text-splitter entirely. TSLP's CodeChunks (function/class-aware) map directly to kreuzberg Chunks with semantic types and heading context.
  • Cross-format output parity tests: 36 tests verifying Markdown, HTML, Djot, and Plain produce equivalent text content. GFM lint validation, bracket escaping checks, structural block comparison.
  • HTML input markdown passthrough: HTML files extracted as Markdown now use html-to-markdown output directly via pre_rendered_content, bypassing the lossy InternalDocument to comrak round-trip.

Code Intelligence

  • Tree-sitter integration for 248 programming languages via tree-sitter-language-pack
  • Extract functions, classes, imports, exports, symbols, docstrings, diagnostics
  • Syntax-aware code chunking
  • Language detection from file extension and shebang
  • Dynamic grammar download (native) / 30-language static subset (WASM)
  • New tree-sitter and tree-sitter-wasm feature flags (included in full and wasm-target)
  • TreeSitterConfig and TreeSitterProcessConfig in ExtractionConfig
  • Re-exported TSLP types (ProcessResult, StructureItem, FileMetrics, etc.)
  • TSLP documentation

Typed Metadata

  • New FormatMetadata variants: Code, Csv, Bibtex, Citation, FictionBook, Dbf, Jats, Epub, Pst
  • Extended PptxMetadata with image_count and table_count
  • Migrated deprecated metadata.additional writes to typed fields across all extractors
  • Strong types for all new metadata variants across all 11 language bindings

Breaking Changes

  • Layout detection preset removed: The preset field on LayoutDetectionConfig has been removed across all bindings. Layout detection now uses the RT-DETR v2 model unconditionally — no "fast" vs "accurate" distinction. The --layout-preset CLI flag is removed. Old configs with "preset": "..." are silently ignored for backward compatibility.
  • Table model config typed: table_model on LayoutDetectionConfig changed from Option<String> to a TableModel enum (tatr, slanet_wired, slanet_wireless, slanet_plus, slanet_auto, disabled). Defaults to tatr. String values still accepted in JSON/TOML configs.

Fixed

  • PDF table rendering: Populate Table.cells from TATR/SLANeXT grid so comrak renders proper Table nodes instead of wrapping markdown in a Paragraph. Table SF1 improved from 15.5% to 53.7%.
  • Markdown GFM quality: Enable prefer_fenced for code blocks, un-escape brackets/parens (\[ to [), fix code block language spacing in djot.
  • Semantic HTML output: Enable github_pre_lang and full_info_string for code blocks with class="language-X".
  • Djot text normalization: Shared normalize_inline_text() for consistent whitespace handling. MD-to-Djot TF1 now 1.0000.
  • PDF structural extraction quality: Improved heading detection (font-size-ratio H2/H3 differentiation, section numbering patterns, ALL-CAPS detection, paragraph-to-heading rescue pass), table discrimination (reject multi-column prose misclassified as tables via flow-through detection, row-count/column-count ratio, and table quality validation), list detection (multi-token prefix patterns), image scoring (normalize image block matching), and formula detection (math character density heuristic). Layout SF1 improved from 40.7% to 43.7% across 157 verified PDF fixtures.
  • PDF ground truth verified: All 157 PDF benchmark fixtures verified using vision (rendered page images vs GT markdown). 7 broken Mistral OCR GTs with hallucinated content replaced with vision-verified markdown.
  • LaTeX extraction: Convert \href, \emph, \textbf, \textgreater, \verb, \sout, blockquotes, lists, special characters, and typographic ligatures to markdown.
  • XLSX/XLS sheet name headings: Emit ## SheetName heading before each sheet's table, matching pandoc convention.
  • OPML outline headings: All outline nodes now emit headings at appropriate depth, not just parent outlines. Inline HTML in text attributes converted to markdown.
  • IPYNB heading detection: Markdown cells now detect ATX headings and emit proper heading elements. Code cell outputs (stdout, execute_result) included in extraction.
  • JATS abstract and references: Abstract section with sub-headings now included. References rendered as numbered list with structured citation formatting.
  • ODT formula extraction: Embedded MathML formula objects extracted as formula content instead of empty image placeholders. Image alt text and captions now extracted from draw:frame elements.
  • PPTX slide titles: Title placeholders detected via OOXML placeholder type and emitted as H2 headings. Bulleted/numbered lists in slides extracted with proper ListStart/ListEnd wrapping.
  • ORG source blocks: #+BEGIN_SRC blocks converted to fenced code blocks with language annotation. #+BEGIN_EXAMPLE blocks converted to unfenced code blocks. Inline code ~text~ converted to backtick spans. Paragraph line wrapping joined.
  • RST heading levels: Overline+underline document titles assigned H1. Code block language hints preserved from .. highlight:: and .. code:: directives. :: literal block shorthand handled.
  • RTF formatting: Bold/italic/strikethrough formatting now uses exact byte offsets from a unified text+formatting extraction pass, eliminating bold bleeding across paragraphs. Hidden text (\v) suppressed. Hyperlink field parsing fixed. Strikethrough support added. Table row rendering fixed for multi-row tables. Ordered list detection from \listtext markers.
  • HTML preprocessing: Navigation elements, forms, and sidebars now stripped by default. Previously disabled, causing page chrome to appear in extraction output.
  • PDF table detection: Reject false table detections where >70% of cells contain single-word fragments (justified prose incorrectly classified as multi-column table).
  • DocBook root element handling: XML fragments without a root element now wrapped automatically, fixing extraction of multi-element DocBook files.
  • FictionBook poem support: Verse lines (<v>), subtitles, text-author, and date elements within poem blocks now extracted. Heading levels aligned with pandoc conventions.
  • PDF image FlateDecode fallback: When decode_flate_to_png() fails for FlateDecode, CCITT, or JBIG2 streams, images are now re-extracted via pdfium's bitmap rendering pipeline, producing valid PNG output instead of unusable raw bytes (#615).
  • Metadata standardization: Metadata from PPTX, Excel, ODT, RST, OrgMode, Typst, RTF, JATS, DOC, PPT, HTML, Email, BibTeX, and Citation extractors now mapped to standard Metadata struct fields (title, authors, dates, keywords, language) instead of only additional map.
  • MDX link parity with Markdown: Links and annotations in headings and list items now extracted (was silently dropped).
  • RST hyperlink extraction: Inline hyperlinks (`text <url>`_) and reference targets now extracted.
  • LaTeX \url{} extraction: \url{...} commands now extracted as URIs alongside \href.
  • OrgMode image detection: Added .webp, .bmp, .tiff, .avif to recognized image extensions.
  • BibTeX URI classification: URL fields now correctly classified as Hyperlink (was Citation). Entry title used as label instead of BibTeX key.
  • JATS title field: Article title now stored in metadata.title (was only in subject).
  • PDF bookmark stack safety: Sibling traversal converted from recursion to iterative loop preventing stack overflow on wide outlines.
  • PDF embedded file security: Filename sanitization (strip directory components), decompressed size limit (50MB), name tree depth limit (50 levels).

  • Tesseract C++ exception crash (#606): Fixed fatal runtime error where C++ exceptions from Tesseract unwound through Rust FFI frames, triggering std::terminate(). Now compiles Tesseract with -fno-exceptions on macOS, Linux, and MinGW. The Tesseract CLI executable target (which uses try/catch) is patched out of CMakeLists.txt at build time since only the library is needed.

  • ExtractionConfig rejects unknown fields: #[serde(deny_unknown_fields)] added to ExtractionConfig. Previously, typos or invalid fields (e.g., layout_analysis instead of layout) were silently ignored.

  • RTF delimiter space consumption: Fixed space-in-word bug where font encoding directives (\loch, \hich, \dbch) caused spaces mid-word ("H eading" → "Heading"). Root cause: RTF spec requires consuming trailing delimiter space after control words.
  • PPTX markdown mode: Derive plain/markdown mode from output_format config instead of hardcoding plain=true. Tables now render as markdown tables, lists get bullet markers, text elements get newline separation.
  • EPUB test compilation: Added InternalDocument::content() method and fixed epub_spine_semantics_tests to use it instead of removed .content field.
  • HTML extraction rewrite: Replaced ~400-line manual HTML tag parser with html-to-markdown v3's DocumentStructure mapping. Single-pass conversion eliminates CSS/script content leakage and [image: X] placeholder artifacts.
  • Chunking heading context with plain output: Fixed heading_context always returning None when using plain text output format. The markdown chunker now receives the original markdown for heading map building even when content is rendered as plain text.
  • WASM build compatibility: Inlined workspace-inherited fields (version, edition, authors) in kreuzberg-wasm Cargo.toml because wasm-pack 0.14.0 cannot resolve field.workspace = true references.
  • Pre-commit hooks: Fixed rumdl hook config (use rumdl-fmt from official repo), wasm build (feature-gate layout config access), kreuzberg-node build (missing formatted_content field), broken relative links in READMEs and CHANGELOG.
  • Binding compilation: Added missing formatted_content field to kreuzberg-py and kreuzberg-php binding crates.
  • PDF heading body_size_guard: Narrowed guard range from ≤ body+0.5 to body±1.5pt so headings well below body font size (e.g., 8pt in 12pt body) pass through.
  • RTF table extraction: Fixed critical bug where table cell content was written to both result string and TableState, causing cells to appear as individual lines instead of proper markdown tables.
  • DOCX merged cells: Repeat content across gridSpan (horizontal) and vMerge (vertical) spans. Added source_path field to ExtractedImage for DOCX image relationship paths.
  • DOCX formatting: Merge adjacent runs with identical formatting to prevent spurious **** sequences. Strip <u> underline HTML tags.
  • Python wheel __isoc23_strtoll error on older Linux distributions (#588): Downgraded the Linux build environment manylinux target from manylinux_2_39 to manylinux_2_28 for pre-compiled Python wheels to ensure compatibility with systems using glibc versions prior to 2.39 (e.g., Ubuntu 20.04/22.04, Debian 11/12).
  • clear_ocr_backends now fully clears the registry: Calls shutdown_all() instead of reset_to_defaults(), so the backend list is empty after clearing as expected by the API contract.
  • Go macOS link failure: Added missing -framework Foundation to CGO LDFLAGS. ORT's CoreML provider uses Foundation for NSLog/NSFileManager, causing undefined symbol errors on macOS.
  • Tesseract Windows MinGW build (Elixir/Go/C FFI publish): CMake resolved bare g++ to MSVC cl.exe on CI runners with both toolchains. Added resolve_mingw_compiler() to find absolute paths from MSYS2 subsystem dirs. Bumped Tesseract cache key to invalidate stale MSVC-compiled artifacts.
  • Windows GNU ORT linking: bundled strategy on Windows GNU now uses dynamic linking with pre-downloaded Microsoft ORT (pyke.io has no static binaries for x86_64-pc-windows-gnu). Documented ONNX Runtime DLL requirement for Go, Elixir, and C/C++ on Windows.

Changed

  • PDF text extraction: Full rewrite from segment-indexed assembly to page.text().all() + char-indexed font metadata. Produces cleaner text with correct word spacing.
  • hOCR table reconstruction vendored: HocrWord, reconstruct_table, table_to_markdown moved from html-to-markdown-rs::hocr to kreuzberg::table_core module.
  • CLI format flags: --format (-f) now supports text, json, and toon wire formats. --output-format renamed to --content-format (deprecated alias kept with warning). OutputFormat enum gains Custom(String) variant for extensible format plugins.
  • html-to-markdown-rs v3.0.0: Switched from git dependency to crates.io release.
  • License policy: MPL-2.0 and LGPL-2.1 no longer globally allowed — pinned to specific crate exceptions (cbindgen, option-ext, r-efi). Unicode-DFS-2016 allowed for comrak dependency.

Removed

  • max_upload_mb server config field: Use max_multipart_field_bytes (in bytes) instead. The KREUZBERG_MAX_UPLOAD_SIZE_MB environment variable is also removed — use KREUZBERG_MAX_MULTIPART_FIELD_BYTES.
  • metadata.additional legacy insertions: Pipeline features (chunking, embeddings, language detection, keywords) no longer insert error/status keys into metadata.additional. Errors are available via processing_warnings. Keywords are in extracted_keywords. Embedding status is derivable from chunk embeddings.
  • derive_content_string function: Replaced by render_plain() in the rendering module.

4.6.3 - 2026-03-27

Added

  • Tower service layer (service module): Composable ExtractionService implementing tower::Service with configurable middleware layers (tracing, metrics, timeout, concurrency limit). New tower-service feature flag, auto-enabled by api and mcp. ExtractionServiceBuilder provides ergonomic layer composition.
  • Semantic OpenTelemetry conventions (telemetry module): Formal kreuzberg.* attribute namespace with 30+ span attributes, metric names, and operation/stage constants. Documented conventions for document extraction, pipeline stages, OCR, and model inference telemetry.
  • Extraction metrics: 11 OTel metric instruments (counters, histograms, gauge) covering extraction totals, durations, cache hits/misses, pipeline stages, OCR, and concurrent extractions. Feature-gated behind otel.
  • InstrumentedExtractor wrapper: Automatic per-extractor tracing spans and metrics without per-extractor annotations. Injected at registry dispatch when otel feature is enabled.

Improved

  • Deeper instrumentation: Pipeline post-processing stages (Early/Middle/Late), individual processor execution, OCR operations, and RT-DETR layout model inference now have semantic spans and duration metrics.
  • API and MCP servers use ExtractionService: Both consumers now route extractions through the Tower service stack, getting unified tracing, metrics, and middleware for free.
  • Unified config merge: JSON config merge logic deduplicated between CLI and MCP into a shared function.
  • API server hardening: Added response compression (gzip/brotli/zstd), panic recovery, request-ID correlation, and sensitive header redaction via tower-http middleware.

Changed

  • Removed per-extractor #[instrument] annotations: 29 manual #[cfg_attr(feature = "otel", tracing::instrument(...))] annotations replaced by the automatic InstrumentedExtractor wrapper.
  • Span attribute names migrated to kreuzberg.* namespace: extraction.filename -> kreuzberg.document.filename, extraction.mime_type -> kreuzberg.document.mime_type, etc.

Fixed

  • EPUB spine semantics refactor (#594): Richer OPF package model preserves manifest fallback chains, guide references, and non-linear spine items. Navigation chrome stripped from output. Malformed guide references now produce warnings instead of hard failures. Tested for fallback cycles and empty spines.
  • DOCX image extraction for <a:blip> with child elements (#591): Images with high-quality settings (containing <a:extLst> children) were not extracted because only Event::Empty was handled. Now also handles Event::Start for <a:blip>.
  • OCR table extraction returned empty results via pipeline path (#593): Layout detection was gated behind a needs_structured check, skipping it for the default Plain output format. Tables from run_ocr_pipeline were discarded. Both paths now propagate tables correctly.
  • Missing chunker_type field in bindings (#592): Exposed chunker_type, sizing_cache_dir, and prepend_heading_context fields across Python, TypeScript/WASM, Go, C#, PHP bindings.
  • Full API parity across all 10 bindings: Added max_archive_depth to all bindings. Added missing acceleration, email to Ruby/R. Added layout to PHP. Added 7 missing fields to WASM. Fixed parity script regex for Go slice types.
  • test_pipeline_with_all_features assertion without quality feature: quality_score assertion now gated behind #[cfg(feature = "quality")].
  • Node Windows publish failure: Prepare script fallback used bash-specific mkdir -p and echo > which fail on Windows. Replaced with cross-platform node -e fallback.
  • CI Validate path triggers too narrow: Broadened glob patterns to cover docs/**, biome.json, .task/**, and other lintable paths that prek hooks check.
  • Publish pipeline ORT bundling: Added configurable strategy input (system/bundled) to setup-onnx-runtime action. Set strategy: bundled for all publish jobs so ort-bundled cargo feature takes effect, producing self-contained binaries.

4.6.2 - 2026-03-26

Added

  • PDF page rendering API (#583): New render_pdf_page function and PdfPageIterator for rendering individual PDF pages as PNG images. Available across all 11 language bindings with idiomatic patterns (Python context manager, Go Close(), Java AutoCloseable, C# IDisposable, Elixir Stream, etc.). Default 150 DPI, configurable per call.

Fixed

  • Table recognition coordinate mismatch on scanned PDFs (#582): Layout detection bboxes (640x640 model space) are now scaled to OCR render resolution before TATR table recognition. Previously, coordinate space mismatch caused zero tables to be found.
  • OCR elements report page_number: 1 for all pages (#582): Tesseract resets page numbers per single-page render. Page numbers are now correctly stamped after OCR in the batch loop.
  • Rust E2E tests missing PDF feature: Added pdf feature to the e2e-generator Rust template, fixing 41 UnsupportedFormat("application/pdf") failures.
  • HWP styled extraction empty on ARM: Added skip_on_platform support to Python and Java e2e generators, skipping the hwp_styled fixture on aarch64-unknown-linux-gnu.
  • WASM CI build failure: Made kreuzberg-node prepare script resilient to missing native addon, preventing ENOENT: dist/cli.js during pnpm workspace install.
  • Go C header stale at 4.5.0: Synced header and DefaultVersion constant to match current version.
  • Ruby gem missing ONNX Runtime: Added ort-bundled feature to Ruby native Cargo.toml.
  • Elixir doctest failures: Updated ExtractionConfig.to_map/1 doctests for force_ocr_pages field.
  • WASM benchmark timeout: Reduced per-extraction timeout from 600s to 120s and job timeout from 6h to 2h.

Improved

  • version:sync now syncs Go C header, DefaultVersion, and Docker compose tags: Prevents version drift across language bindings.
  • Publish pipeline commits Elixir NIF checksums back to main: Prevents stale checksums after releases.
  • WASM test app migrated to Deno: Replaced Node.js/vitest with Deno test runner, fixing fetch() unavailability.
  • Docs migrated from MkDocs to Zensical: 4-5x faster incremental builds.

4.6.1 - 2026-03-25

Added

  • Per-file batch extraction timeouts (#546): New extraction_timeout_secs on ExtractionConfig (batch-level default) and timeout_secs on FileExtractionConfig (per-file override). Timeouts apply after semaphore acquisition. New KreuzbergError::Timeout variant with elapsed_ms and limit_ms fields. All binding layers updated.
  • Page-level OCR overrides (#432): New force_ocr_pages option (1-indexed) on both ExtractionConfig and FileExtractionConfig. Enables selective OCR on specific pages of mixed-quality PDFs while preserving native text on others.
  • PST extraction support (#502): Extract emails from Microsoft Outlook PST archives via the outlook-pst crate. Iterative depth-first folder traversal with depth cap of 50. Feature-gated under email.
  • JSONL/NDJSON extraction (#575): Native .jsonl/.ndjson extraction via StructuredExtractor. Registered as application/x-ndjson MIME type.

Fixed

  • OCR elements now propagated to ExtractionResult (#566): OCR elements with geometry data are collected during extraction and set on ExtractionResult.ocr_elements. Hierarchy transformer emits body-level blocks as NarrativeText elements with coordinates. OpenAPI schema registers OCR-related types.
  • OOM crash on multi-page scanned PDFs (#570): Replaced pre-rendering all PDF pages into memory with batched rendering. Pages are now rendered and OCR'd in bounded batches, capping peak memory to batch_size * page instead of page_count * page.
  • OCR memory usage reduced 60-78%: Restructured the OCR batch rendering loop to render-and-encode one page at a time instead of holding all decoded RGB buffers simultaneously. A 98-page scanned PDF dropped from 4.6GB to 1.9GB peak RSS (batch_size=4), and from 3.3GB to 713MB (batch_size=1). Batch size now adapts to available system memory on Linux and macOS.
  • PDF control character encoding artifacts: PDFs with broken ToUnicode font mappings that produce U+0002 (STX) and other control characters where hyphens should appear now have these replaced with hyphens when between word characters, or stripped otherwise. Fixes garbled output like re\x02labellingre-labelling.
  • DocumentStructure missing Heading nodes for PDFs: push_heading_group now inserts a Heading child inside each Group node (matching DOCX builder behavior). Fallback add_paragraphs now detects markdown heading markers and creates heading groups instead of flat paragraphs.
  • Layout detection returns empty tables on scanned PDFs (#574): Three independent bugs caused result.tables to always be [] for scanned/image-based PDFs: (1) layout detection was gated behind a needs_structured output-format check, silently skipping detection for Plain (the default); (2) TATR-recognized tables in the OCR path were inlined as markdown text but never converted to Table structs; (3) run_ocr_with_layout returned only text, discarding table data. All three paths now propagate tables correctly.
  • Table recognition coordinate mismatch on scanned PDFs (#582): Layout detection operates at 640×640 pixels but TATR table recognition and layout-hint classification consumed those coordinates verbatim against OCR-rendered images (e.g. 2480×3508 px at 300 DPI). Bounding boxes never overlapped OCR word positions, producing zero recognized tables and incorrect paragraph-class overrides. Bounding boxes are now scaled from layout-model resolution to the actual OCR render resolution before both recognize_page_tables and detection_to_layout_hints are called.
  • OCR elements report page_number: 1 for all pages (#582): The Tesseract backend resets page_number to 1 for every single-page render. The page-number is now stamped with the correct 1-indexed page index after collecting each batch page's OCR elements.
  • PDF layout engine panic on malformed input (#544): Replaced the panicking .expect() inside the thread-local LayoutEngine initializer in layout_runner.rs with proper Result-based error propagation. A failure to initialise the layout engine now returns a descriptive error instead of crashing the host process via FFI (Python, Node, etc.).

4.6.0 - 2026-03-24

Added

  • Recursive archive extraction: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own ExtractionResult including DocumentStructure, annotations, and metadata. New ArchiveEntry type with path, mime type, and nested result. Configurable via max_archive_depth (default: 3, set to 0 for legacy single-text behavior).
  • YAML/JSON section chunker: New ChunkerType::Yaml variant that splits structured files by keys with full hierarchy paths (e.g., database > primary > host). Auto-inferred from extraction metadata — no explicit chunker_type needed for YAML/JSON files.
  • Unified DocumentStructure DTO: Extended the DocumentStructure model with 7 new node types (Slide, DefinitionList, DefinitionItem, Citation, Admonition, RawBlock, MetadataBlock), 4 new annotation kinds (Highlight, Color, FontSize, Custom), and format-specific attributes bag on every node.
  • DocumentStructureBuilder: Ergonomic builder with heading-driven section nesting, container stack (Quote/Admonition/Slide auto-parenting), and annotation helpers. Replaces hand-constructed DocumentNode structs across all extractors.
  • Unified rendering module: render_to_markdown() and render_to_plain() renderers that walk a DocumentStructure tree to produce consistent output with inline annotation rendering, table pipe escaping, and nested list depth support.
  • DocumentStructure support for all extractors: Every extractor (35 formats) now natively produces a DocumentStructure when include_document_structure is enabled:
  • Office: DOCX (with TextAnnotation from Run formatting, Formula from OMML), PPTX (Slide containers), ODT, DOC, PPT
  • Markup: HTML (1,100-line tag parser with inline annotations), LaTeX, RST (admonitions, definition lists), OrgMode, Markdown, MDX, Djot, Typst
  • Books: EPUB (chapter structure from spine), FictionBook (inline formatting annotations)
  • Scientific: JATS (article structure), DocBook (section hierarchy)
  • Data: Excel (sheet headings + tables), CSV, DBF, JSON/YAML/TOML, BibTeX (citations), Jupyter (code + markdown cells)
  • Other: Email (metadata headers), RTF, OPML (outline hierarchy), HWP, iWork (Keynote/Numbers/Pages), XML, Image (OCR text)
  • DocBook/JATS inline annotations: Semantic inline formatting for academic/technical documents — emphasis, bold, code, links, subscript/superscript mapped to AnnotationKind variants.
  • Document-level OCR: OcrBackend trait supports process_document() for whole-file extraction without per-page rasterization. Up to 30% faster on multi-page documents with better context.

Changed

  • CSV extraction for embedding quality: Produces Row N: Header: Value format instead of space-separated when a header row is detected. Programmatic tables field unchanged.
  • XML extraction for embedding quality: Indented hierarchical output preserving element tree with attributes inline, blank lines between top-level siblings, and xmlns:* filtering.

Improved

  • Zero-copy file I/O: Automatic memory-mapping for files >1MB via memmap2 with SIMD-accelerated UTF-8 validation (simdutf8). Measurable speed improvement for large PDFs and archives. WASM falls back to heap allocation.
  • Unified concurrency management: Centralized thread budget for Rayon, ONNX, and PaddleOCR with configurable ConcurrencyConfig. PDF OCR batched in chunks instead of all-at-once, reducing memory footprint on large documents.

Fixed

  • Incorrect page numbers in element-based output (#557): When result_format="element_based" was used without PageConfig(extract_pages=True), all elements received page_number=1. Now auto-enables extract_pages when element-based output is requested.
  • Misleading PageConfig docstring (#558): Updated docstring and type stub to show default constructor first and document interaction with result_format="element_based".
  • MSG extraction misses compressed RTF bodies (#560): Added PR_RTF_COMPRESSED (0x1009) fallback for .msg files that store the body only in compressed RTF format. Implements MS-OXRTFCP decompression and RTF-to-plain-text stripping.
  • Indexed colour PDF images returned as raw (#561): Palette-based PDF images now decode correctly. Extracts the colour palette from the PDF dictionary and applies palette lookup to produce valid PNG output instead of unusable raw bytes.
  • ODT extraction robustness: Replaced unwraps with safe fallbacks in ODT parsing.

4.5.4 - 2026-03-23

Added

  • Document-level OCR optimization: The OcrBackend trait now supports native process_document() for efficient whole-file extraction without rasterizing individual PDFs to images when the backend supports it (e.g., Python's EasyOCR backend).

Changed

  • OCR protocol clarity: Differentiated process_file to process_image_file in OCR backend trait for clearer protocol semantics.
  • Python refactoring: Removed unused loop variable in EasyOCR implementation.
  • Dependency optimization: Dropped redundant tokio multi-thread feature flag.

Tests

  • Backend registry robustness: Hardened backend registry tests with drop guards and comprehensive mock coverage.

Added

  • PST (Outlook Personal Folders) extraction: New PstExtractor backed by the outlook-pst crate. Traverses the full IPM folder hierarchy iteratively, extracts subject, sender, recipients (TO/CC/BCC), body, and date from every message in the archive. Enabled via the existing email feature flag. MIME type: application/vnd.ms-outlook-pst.

Fixed

  • PDF image extraction panic on mismatched buffer lengths (#552): Replaced assert! in pdf/images.rs with graceful error handling. Malformed PDF images with wrong buffer sizes are now skipped instead of panicking. Regression from v4.5.0.
  • pdf feature compilation without layout-detection (#550): config.layout reference in extraction.rs was not behind a #[cfg(feature = "layout-detection")] gate, causing compilation errors when pdf was enabled without layout-detection.
  • Unused table_model variable warning: Fixed cfg-gating in pipeline.rs so table_model parameter is properly handled when layout-detection feature is disabled.
  • Clippy too_many_arguments on recognize_tables_slanet: Added allow attribute for the 8-parameter function in table_recognition.rs.
  • Ruby binding missing table_model field: Added table_model parsing to LayoutDetectionConfig initializer in Ruby native extension.
  • WASM module resolution in Supabase/Deno edge functions (#551): Added explicit package.json exports for pkg/kreuzberg_wasm.js and WASM binary. Extended wasm-loader.ts with Deno detection and clear error messaging for restricted edge runtimes.
  • zip dependency pinned below 7.4: Avoids let-chain build failures on some stable Rust toolchains (#549).
  • Vendored HWP text extraction: Replaced external hwpers crate with vendored subset (~1,650 lines). Eliminates zip 2.x transitive dependency that caused WASM and CI Validate build failures.

Added

  • prepend_heading_context chunking option: When true and chunker_type is Markdown, prepends the heading hierarchy path (e.g. # Title > ## Section) to each chunk's content string. Useful for RAG pipelines where chunks need self-contained structural context. Available across all 10 language bindings, CLI, and WASM. Includes fixture-driven e2e tests and documentation for all languages.

4.5.3 - 2026-03-22

Added

  • Apple iWork Format Support: Native parsing for modern (2013+) .pages, .numbers, and .key files via a new iwork feature flag. Uses zero-allocation protobuf text extraction from Snappy-compressed IWA containers.
  • SLANeXT table structure recognition models: Alternative table structure backends alongside TATR. New table_model field on LayoutDetectionConfig selects the backend. Options: "tatr" (default, 30MB), "slanet_wired" (365MB, bordered tables), "slanet_wireless" (365MB, borderless tables), "slanet_plus" (7.78MB, lightweight), "slanet_auto" (classifier-routed, ~737MB). Available across all 12 language bindings and CLI (--layout-table-model).
  • PP-LCNet table classifier: Automatic wired/wireless table detection for SLANeXT auto mode. Uses center-crop preprocessing with BGR channel order matching PaddleOCR convention.
  • CLI cache warm --all-table-models: Opt-in download of SLANeXT model variants (~730MB). Default warm downloads only RT-DETR + TATR.
  • ISO 21111-10 benchmark fixture: Table-heavy ISO standard document with MinerU ground truth for table extraction benchmarking.

4.5.2 - 2026-03-21

Fixed

  • PDF word splitting in extracted text: Pdfium's text extraction inserted spurious spaces mid-word (e.g. "s hall a b e active" instead of "shall be active"). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (font_size × 0.33 threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact.
  • Markdown underscore escaping: Underscores in extracted text (e.g. CTC_ARP_01) were incorrectly escaped as CTC\_ARP\_01 throughout the markdown output. Underscore escaping has been removed entirely since extracted PDF text contains literal identifiers, not markdown formatting.
  • Page header/footer leakage: Running headers like ISO 21111-10:2021(E) and copyright footers leaked into the document body. Added fuzzy alphanumeric matching to detect repeated header/footer text even when spacing or character extraction varies across pages.
  • R batch function spurious NULL argument: R wrapper batch functions passed an extra NULL positional argument to native Rust functions, causing "unused argument" errors on all batch operations.
  • Elixir Windows ORT DLL staging: ONNX Runtime DLL was only staged in target/release/ but not in priv/native/ where the BEAM VM loads NIFs. OCR/layout/embedding features now work correctly on Windows CI.

Added

  • General extraction result caching: All file types (PDF, Office, HTML, archives, etc.) are now cached — not just OCR results. Repeated extractions of the same file with the same config return instantly from cache.
  • Cache namespace isolation: New cache_namespace field on ExtractionConfig enables multi-tenant cache isolation on shared filesystems. Available via --cache-namespace CLI flag and across all language bindings.
  • Per-request cache TTL: New cache_ttl_secs field on ExtractionConfig overrides the global TTL for individual extractions. Set to 0 to skip cache entirely. Available via --cache-ttl-secs CLI flag.
  • Cache namespace deletion: delete_namespace() removes all cache entries under a namespace. get_stats_filtered() returns per-namespace statistics.
  • Multi-worker cleanup safety: Cache cleanup no longer triggers excessively when multiple worker pods share the same cache directory.
  • Bundled eng.traineddata: English OCR works out of the box with zero runtime configuration (~4MB bundled at build time).
  • Tessdata in cache warm: kreuzberg-cli cache warm now downloads all tessdata_fast language files (~120 languages) to KREUZBERG_CACHE_DIR/tessdata/, giving full Tesseract language support without system packages.
  • Tessdata in cache manifest: kreuzberg-cli cache manifest now includes all tessdata files with source URLs, enabling --sync-cache to download tessdata alongside models.
  • KREUZBERG_CACHE_DIR/tessdata resolution: resolve_tessdata_path() now checks KREUZBERG_CACHE_DIR/tessdata and the bundled build path before falling back to system paths. Resolution order: TESSDATA_PREFIX env → KREUZBERG_CACHE_DIR/tessdata → bundled tessdata → system paths.
  • CLI embed command: Generate vector embeddings from text via kreuzberg embed --text "..." --preset balanced. Supports stdin, multiple texts, JSON/text output. Feature-gated on embeddings.
  • CLI chunk command: Split text into chunks via kreuzberg chunk --text "..." --chunk-size 512. Configurable size, overlap, chunker type, tokenizer model.
  • CLI completions command: Generate shell completions for bash, zsh, fish, powershell via kreuzberg completions <shell>.
  • CLI --log-level global flag: Override RUST_LOG via kreuzberg --log-level debug extract doc.pdf.
  • CLI extraction overrides: 27 flags exposed via ExtractionOverrides struct with #[command(flatten)]. New flags: --layout-preset, --layout-confidence, --acceleration, --extract-pages, --page-markers, --extract-images, --target-dpi, --pdf-extract-images, --pdf-extract-metadata, --token-reduction, --include-structure, --max-concurrent, --max-threads, --msg-codepage, --ocr-auto-rotate.
  • CLI colored output: Text output uses anstyle for colored headers, labels, success values, and dim separators. Respects NO_COLOR env var.
  • API POST /detect: MIME type detection endpoint via multipart file upload.
  • API GET /version: Version info endpoint.
  • API GET /cache/manifest: Model manifest with checksums and sizes.
  • API POST /cache/warm: Eager model download endpoint with embedding preset support.
  • MCP get_version tool: Query server version from MCP clients.
  • MCP cache_manifest tool: Get model manifest via MCP.
  • MCP cache_warm tool: Pre-download models via MCP.
  • MCP embed_text tool: Generate embeddings via MCP (feature-gated).
  • MCP chunk_text tool: Text chunking via MCP.
  • Pipeline table extraction tracing: Added zero-cost tracing::trace! and tracing::debug! logging throughout the layout detection and table extraction pipeline for easier debugging.
  • TATR model availability check: Layout detection now returns an error if table regions are detected but the TATR model is unavailable, instead of silently falling back to degraded extraction.
  • Publish idempotency checks: All publish jobs now have re-check steps using check-registry@v1 before publishing. Added check-elixir-release job for GitHub release asset verification.
  • ARM benchmark runners: Benchmark workflows switched to runner-medium-arm64 for ARM-native performance testing.
  • Registry check tool: python3 scripts/publish/check_all_registries.py <version> checks all 10+ registries and GitHub release assets locally.

Changed

  • CLI batch flags: Batch command now supports all extraction override flags (chunking, layout, acceleration, etc.) via shared ExtractionOverrides struct, matching extract command parity.
  • CLI config architecture: Replaced 13-parameter apply_extraction_overrides function with ExtractionOverrides struct using #[command(flatten)]. Config fields auto-scale as ExtractionConfig evolves.
  • MCP tool architecture: Removed dead tools/ trait-based duplicates; all tools implemented directly in server.rs.

Improved

  • CLI validation: OCR backend values validated (tesseract, paddle-ocr, easyocr). Chunk size/overlap bounds checked. DPI range (36-2400) and layout confidence (0.0-1.0) validated. Zero-value max_concurrent/max_threads rejected. --chunking-tokenizer errors when feature disabled.
  • API validation: Embedding preset names validated in /embed. Chunk max_characters bounds checked (1-1M) in /chunk.
  • MCP validation: Empty paths rejected in batch_extract_files. Chunk max_characters bounds checked in chunk_text. Embedding preset validated in embed_text.
  • Chunk overlap auto-clamping: When --chunk-size is smaller than default overlap, overlap is automatically clamped to size/4 instead of producing a confusing error.

4.5.1 - 2026-03-20

4.5.1 - 2026-03-20

  • Java FFI CBatchResult struct layout mismatch: The count and results fields were swapped in the Java Panama FFM layout, causing all batch extraction operations to fail with memory access errors.
  • Go FFI stale C header: The CExtractionResult struct field order in the Go binding's C header did not match the Rust #[repr(C)] layout (reordered alphabetically in 4.5.0, added djot_content_json). Go read fields at wrong offsets, causing pages_json to deserialize metadata_json instead.
  • FFI LayoutDetectionConfig not feature-gated: The FFI crate unconditionally imported LayoutDetectionConfig and exposed kreuzberg_config_builder_set_layout, causing compilation failures on targets without the layout-detection feature (e.g., x86_64-pc-windows-gnu).
  • Python wheel builds on Linux aarch64: OpenSSL library path was hardcoded to x86_64-linux-gnu in the manylinux build script, failing on aarch64 runners. Now detects architecture via uname -m.
  • R batch function signature mismatch: R wrapper functions were missing the file_configs parameter when calling native Rust functions, causing "Expected Scalar, got Language" errors on all batch operations.
  • R package ORT linking: The R build configuration (config.R) did not link against ONNX Runtime when ORT_LIB_LOCATION was set, causing undefined symbol: OrtGetApiBase at load time.

4.5.0 - 2026-03-20

Added

  • ONNX-based document layout detection: New layout config field enables document layout analysis using RT-DETR v2 with 17 element classes. Supports "fast" and "accurate" presets with auto-downloaded models. Available across all language bindings.
  • SLANet table structure recognition: Detected Table regions are processed by SLANet-plus for neural HTML structure recovery, producing markdown tables with colspan/rowspan support. Now runs on all pages including structure-tree pages (previously skipped).
  • Layout-enhanced heading detection: Layout model SectionHeader and Title regions guide heading detection in both structure tree and heuristic extraction. High-confidence hints (>=0.7) can override font-size-based classification.
  • Multi-backend OCR pipeline: New OcrPipelineConfig enables quality-based fallback across OCR backends (e.g., Tesseract then PaddleOCR) with configurable priority, language, and backend-specific settings.
  • OCR quality thresholds: New OcrQualityThresholds config with 16 tunable parameters for OCR output quality assessment and fallback decisions.
  • OCR auto-rotate: New OcrConfig.auto_rotate flag (default: false) for automatic page rotation detection. Handles 0/90/180/270 degree rotations.
  • PaddleOCR v2 model tier system: New model_tier field with "mobile" (default, ~21MB, fast) and "server" (~172MB, highest accuracy). Both use unified multilingual models (CJK+English in one model). Available across all bindings.
  • AccelerationConfig for GPU/execution provider control: Fine-grained control over ONNX execution providers (CPU, CoreML, CUDA, TensorRT) for layout detection and table recognition. Typed across all bindings.
  • ConcurrencyConfig for thread limiting (#503): New max_threads field caps Rayon, ONNX intra-op threads, and batch concurrency to a single limit. Typed across all bindings.
  • EmailConfig for MSG fallback codepage (#505): Configurable fallback codepage for MSG files lacking a codepage property (default: windows-1252). Set e.g. 1251 for Cyrillic. Typed across all bindings.
  • Per-file extraction configuration (FileExtractionConfig): Per-file config overrides in batch operations. Each file can specify its own OCR, chunking, output format settings. CLI supports --file-configs, MCP supports file_configs parameter.
  • Opt-in single-column pseudo tables (#449): New allow_single_column_tables on PdfConfig (default: false). Allows single-column structured data (glossaries, itemized lists) to be emitted as tables.
  • Experimental: pdf_oxide text extraction backend (pdf-oxide feature): Pure Rust PDF text extraction as an alternative to pdfium. Opt-in only, not included in full feature set.
  • CLI cache warm command: Eagerly downloads all PaddleOCR and layout detection models. Supports --all-embeddings or --embedding-model <preset>. Useful for containerized or offline deployments.
  • CLI cache manifest command: Outputs a JSON manifest of all expected model files with SHA256 checksums, sizes, and source URLs for scripted cache verification.
  • ChunkSizing configuration: sizing_type, sizing_model, and sizing_cache_dir fields exposed in ChunkingConfig across all bindings.
  • Chunk heading context: New HeadingContext type in ChunkMetadata providing heading level and text.
  • ModelManifestEntry type and manifest() / ensure_all_models() methods: Public API for querying and eagerly downloading model cache manifests.
  • SF1 structural quality metrics in benchmark CI: SF1 quality scores now computed alongside TF1, with PDF-specific quality rankings for tracking extraction quality regressions.

Changed

  • Layout preset default: Changed from "fast" to "accurate". The Fast variant has been removed. The "fast" string is still accepted for backwards compatibility.
  • PaddleOCR default model tier: Changed from "server" to "mobile". Mobile models provide equivalent quality on standard documents while being 3-5x faster. Server tier remains available via with_model_tier("server").
  • PaddleOCR v2 models: All models updated to v2 generation (PP-OCRv5 detection, PP-LCNet classification, unified multilingual recognition). V1 models remain available for older versions.
  • Unified multilingual recognition models: PP-OCRv5 unified server (84MB) and mobile (16.5MB) models replace per-script English and Chinese models. Per-script models retained for 9 other script families.
  • Batch API unification: _with_configs batch functions removed; per-file FileExtractionConfig is now an optional parameter on the unified batch functions.
  • Layout pipeline no longer forces heuristic extraction: Structure tree extraction proceeds normally when layout detection is enabled, preserving text quality.
  • Global ONNX model caching: Layout detection and SLANet models are cached globally and reused across extractions, avoiding expensive ONNX session recreation in batch scenarios.
  • Vendored text embedding pipeline: Replaced fastembed dependency with vendored engine using ONNX Runtime directly for tighter integration.
  • Embedding embed() now takes &self instead of &mut self: Enables parallel embedding generation without mutable reference constraints.
  • L2 normalization parallelized: Embedding batches >= 64 vectors now use multi-threaded normalization.
  • padding field in PaddleOcrConfig: Now exposed across Python, TypeScript, Ruby, and Go bindings (previously Rust-only).
  • Language-agnostic section pattern recognition: Headings ending with a period are now allowed when they match structural patterns (section symbol, all-caps, numbered sections). Improves heading detection for legal, academic, and multilingual documents.
  • Layout classification guards: Heading overrides from the layout model now have word count limits, punctuation checks, figure label detection, and body-font-size validation to prevent false heading promotions.
  • Strong typing across bindings: Replaced weak Dictionary/Map/array types with strongly typed config classes in C#, Java, and PHP. Added missing config types to Python stubs, Node.js, Ruby, Elixir, and PHP.

Removed

  • fastembed dependency: Replaced by vendored embedding engine using ONNX Runtime directly.
  • EmbeddingModelType::FastEmbed variant: Use Preset or Custom variants instead.

Fixed

  • C# FFI struct layout mismatch (#538): CExtractionResult struct layout between Rust and C# was mismatched, causing deserialization failures and overflow exceptions that made the C# library completely broken in 4.4.6.
  • PDF force_ocr without explicit OCR config (#495): force_ocr=true was silently ignored when no ocr config block was provided. Now unconditionally triggers the OCR pipeline with default settings.
  • PDF image extraction (#511): Extracted images returned raw compressed data instead of properly decoded image bytes. Now automatically decoded and re-encoded as standard formats (PNG/JPEG).
  • Node.js extractFileInWorker mime_type passthrough (#523): MIME type was silently injected into PDF password config instead of being forwarded to extraction. Now correctly passed through.
  • DOCX parser type inference failure (#519): The zip 8.2.0 dependency introduced type ambiguity in DOCX and XML parsers, causing compilation failures.
  • Python py.typed and .pyi missing from sdist: Type stubs and py.typed marker now included in both wheel and sdist formats.
  • PDF broken CMap word spacing: Geometric validation now vetoes false word boundaries in PDFs with broken font CMaps, fixing "co mputer" -> "computer" style errors.
  • PDF structure tree heading trust: Structure tree heading tags (H1-H6) are now trusted as author-intent metadata. Previously, font-size validation rejected valid headings close to body size.
  • PDF structure tree extraction performance: Text and style maps now built in a single pass, eliminating multi-second extraction times on complex pages.
  • OCR Picture regions suppressing text: Layout-detected Picture regions now preserve embedded text as plain paragraphs instead of silently dropping it.
  • Non-transitive sort comparators: Spatial reading-order sorts now use discrete row buckets instead of tolerance-based grouping, ensuring correct and stable ordering.
  • Page furniture over-stripping: Added bulk and per-paragraph guards to prevent aggressive furniture stripping from removing legitimate content.
  • KREUZBERG_CACHE_DIR not respected by all caches: Embeddings, OCR result cache, and document extraction cache now honor the environment variable.
  • MSG PT_STRING8 encoding: MSG files now correctly decode ANSI string properties using the declared Windows code page instead of UTF-8 lossy conversion.
  • SLANet-Plus ONNX model: Re-exported with shape fix, resolving inference failures that caused all SLANet table extractions to silently fail on macOS CoreML.
  • TATR model panic in batch processing: Model unavailability in parallel closures caused crashes in FFI callers (Java, C#). Now falls back gracefully to heuristic table extraction.
  • Docker musl builds: Alpine/musl Docker images now link against the system ONNX Runtime library, fixing build failures. All features work in musl CLI images.
  • FFI batch functions null handling: C#/Java FFI batch functions now accept NULL for file_config_jsons instead of rejecting it.

Known Issues

  • PHP PIE Windows package temporarily unavailable: The Windows build for the PHP PIE extension is disabled due to a transitive dependency conflict (ort-syslzma-rust2crc version collision on the x86_64-pc-windows-gnu target). Linux and macOS PHP packages are unaffected. Will be resolved when upstream ort updates its lzma-rust2 dependency.
  • WASM: no layout detection, acceleration, or email config: ONNX Runtime does not support WebAssembly, so layout detection (RT-DETR), hardware acceleration config, and concurrency config are unavailable in the WASM binding. OCR via Tesseract WASM and embeddings are supported.

4.4.6

Added

  • dBASE (.dbf) format support: Extract table data from dBASE files as markdown tables with field type support.
  • Hangul Word Processor (.hwp/.hwpx) support: Extract text content from HWP 5.0 documents (standard Korean document format).
  • Office template/macro format variants: Added support for .docm, .dotx, .dotm, .dot (Word), .potx, .potm, .pot (PowerPoint), .xltx, .xlt (Excel) formats.

Fixed

  • DOCX image placeholders missing (#484): Extracting .docx files with extract_images=True no longer produced ![](image) placeholders in the output. The default plain text output path was stripping image references. Image extraction now forces markdown output so placeholders are always included.

Changed

  • Format count updated to 91+: Documentation across all READMEs, docs, and package manifests updated to reflect expanded format support (previously 75+).

4.4.5

Fixed

  • PDF markdown garbles positioned text (#431): PDFs with positioned/tabular text (CVs, addresses, data tables) had their line breaks destroyed during paragraph grouping. Added page-level positioned text detection: when fewer than 30% of lines on a page reach the right margin, short lines are split into separate paragraphs to preserve the document's visual structure.
  • Node worker pool password bug: extractFileInWorker was passing the password argument as mime_type to extract_file_sync, meaning passwords were never applied and MIME detection could break. Password is now correctly injected into config.pdf_options.passwords.
  • Unused import in kreuzberg-node: Removed unused use serde_json::Value import in result.rs that caused clippy warnings.
  • WASM Deno OCR test hang: OCR tests hung indefinitely on WASM Deno because Tesseract synchronous initialization blocks the single-threaded runtime. OCR fixtures are now skipped for the wasm-deno target.
  • WASM camelCase config deserialization: JS consumers send camelCase config keys (e.g. includeDocumentStructure) but serde expects snake_case. Added camel_to_snake transform in parse_config() so config fields are properly deserialized. Fixes document structure extraction returning empty results via WASM.
  • PHP 8.5 array coercion on macOS: On PHP 8.5 + macOS, ext-php-rs coerces #[php_class] return values to arrays instead of objects. Added normalizeExtractionResult() wrapper that transparently converts arrays via ExtractionResult::fromArray().
  • PHP 8.5 support: Upgraded ext-php-rs to 0.15.6 for PHP 8.5 compatibility.
  • Vendoring scripts missing path deps: Ruby and R vendoring scripts failed when workspace dependencies use path instead of version. Added path field handling to format_dependency() and kreuzberg-ffi fixup block to the Ruby vendoring script.
  • pdfium-render clippy lints: Fixed clippy warnings in kreuzberg-pdfium-render crate.

Added

  • CLI --pdf-password flag: New --pdf-password option on extract and batch commands for encrypted PDF support. Can be specified multiple times.
  • MCP pdf_password parameter: Added pdf_password field to extract_file, extract_bytes, and batch_extract_files MCP tool params for better discoverability.
  • API pdf_password multipart field: The HTTP API extract endpoint now accepts a pdf_password multipart field for encrypted PDFs.
  • PdfConfig Default impl: Added Default implementation for PdfConfig to support ergonomic config construction.
  • Binding crate clippy in CI: Added clippy steps to ci-node, ci-python, and ci-wasm workflows (gated to Linux). Added node:clippy, python:clippy, and wasm:clippy task commands.
  • E2E password-protected PDF fixture: Added pdf_password_protected fixture testing copy-protected PDF extraction across all bindings.

Changed

  • All binding crates linted in pre-commit: Removed clippy exclusions for kreuzberg-php, kreuzberg-node, and kreuzberg-wasm from pre-commit config.
  • golangci-lint v2.11.3: Upgraded from v2.9.0 across Taskfile, CI workflows, and install scripts.

4.4.4

Fixed

  • CLI test app fixes: Fixed broken symlinks in CLI test documents, corrected --format to --output-format flag usage, fixed multipart form field name (file=files=) in serve tests, and rewrote MCP test to use JSON-RPC stdin protocol instead of background process detection.
  • Publish idempotency check scripts: Fixed check_nuget.sh and check-nuget-version.sh using bash 4+ ${var,,} syntax incompatible with bash 3.x. Fixed check_pypi.sh and check_packagist.sh writing to $GITHUB_OUTPUT internally instead of stdout (conflicting with workflow-level redirect). Fixed check-rubygems-version.sh false negatives for native gems by switching from gem search to RubyGems JSON API. Fixed check-rubygems-version-python.sh Python operator precedence bug. Fixed check-maven-version.sh using unreliable Solr search API instead of direct repo HEAD request. Fixed stderr redirect missing on diagnostic messages in multiple scripts.
  • Node test app version: Updated Node.js test app to reference v4.4.4 package version.

Changed

  • CLI install with all features: CLI test install script now uses --all-features flag to enable API server and MCP server subcommands.
  • Publish workflow republish support: Added republish input to publish workflow that deletes and re-creates the tag on current HEAD before publishing, enabling clean retag + full republish.

4.4.3

Added

  • PDF image placeholder toggle: New inject_placeholders option on ImageExtractionConfig (default: true). Set to false to extract images as data without injecting ![image](...) references into the markdown content.

Fixed

  • Token reduction not applied (#436): Token reduction config was accepted but never executed during extraction. The pipeline now applies reduce_tokens() when token_reduction.mode is configured.
  • Nested HTML table extraction: Nested HTML tables now extract correctly with proper cell data and markdown rendering, using the visitor-based table extraction API from html-to-markdown-rs.
  • hOCR plain text output: hOCR conversion now correctly produces plain text when OutputFormat::Plain is requested, instead of silently falling back to Markdown.
  • PDF garbled text for positioned/tabular content (#431): PDF text extraction now detects X-position gaps between consecutive characters and inserts spaces when the gap exceeds 0.8 × avg_font_size. Previously, characters placed at specific coordinates without explicit space characters were concatenated without spaces.
  • Chunk page metadata drift with overlap (#439): Chunk byte offsets are now computed via pointer arithmetic from the source text, fixing cumulative drift that caused chunks to report incorrect page numbers when overlap is enabled.
  • Node.js metadata casing: Standardized all Metadata and EmailMetadata fields to camelCase (e.g., pageCount, creationDate, fromEmail) in the Node.js/TypeScript bindings. Also corrected pluralization for authors and keywords.
  • WASM build failure on Windows CI: CMake try-compile checks on Windows used the host MSVC compiler (cl.exe), which rejected GCC/Clang flags like -Wno-implicit-function-declaration. Added CMAKE_TRY_COMPILE_TARGET_TYPE=STATIC_LIBRARY to both build_leptonica_wasm and build_tesseract_wasm to skip linking during cross-compilation checks.
  • WASM OCR build panic when git/patch unavailable: The tesseract WASM patch (tesseract.diff) application panicked when both git apply and patch commands failed. Added programmatic C++ source fixups as a fallback, applying all necessary changes (CPUID guard, pixadebug unique_ptr conversion, source list trimming) via string replacement when the diff patch cannot be applied.

4.4.2

Fixed

  • E2E element type assertions: Fixed element type field name in E2E generator templates for Python, TypeScript, WASM Deno, Elixir, Ruby, PHP, and C#. Each binding uses different casing conventions (Python: dict key element_type, TypeScript/Node: elementType via NAPI camelCase, Elixir: atom-to-string conversion, C#: JSON serialization for snake_case wire value).
  • Ruby PDF annotation extraction: Fixed PdfAnnotation and PdfAnnotationBoundingBox classes not being registered in the autoload list, causing NameError when extracting PDF annotations. Also fixed bounding box field name mismatch between Rust output (x0/y0/x1/y1) and Ruby struct (left/top/right/bottom).
  • Ruby cyclomatic complexity: Refactored build_annotation_bbox in result.rb to extract repeated field lookup pattern, reducing cyclomatic complexity below threshold.
  • WASM OCR blocking event loop: The ocrRecognize() function in the WASM package was running synchronously on the main thread, blocking the Node.js event loop during image decoding and Tesseract OCR processing. This prevented timeouts and other async operations from firing while OCR was in progress. OCR now runs in a worker thread (Node.js worker_threads / browser Web Worker), keeping the main thread responsive.
  • JPEG 2000 OCR decode failure: JPEG 2000 images (jp2, jpx, jpm, mj2) and JBIG2 images failed with "The image format could not be determined" during PaddleOCR and WASM OCR because these code paths used the standard image crate which doesn't support JPEG 2000. A shared load_image_for_ocr() helper now detects JP2/J2K/JBIG2 formats by magic bytes and uses hayro-jpeg2000/hayro-jbig2 decoders across all OCR backends. The ocr-wasm feature now includes these decoders (pure Rust, WASM-compatible).
  • WASM PDF empty content: initWasm() fired off PDFium initialization asynchronously without awaiting it, causing a race condition where PDF extraction could start before PDFium was ready, returning empty content. PDFium initialization is now properly awaited during initWasm().

Added

  • OMML-to-LaTeX math conversion for DOCX: Mathematical equations in DOCX files (Office Math Markup Language) are now converted to LaTeX notation instead of being rendered as concatenated Unicode text. Supports superscripts, subscripts, fractions (\frac), radicals (\sqrt), n-ary operators (\sum, \int), delimiters, function names, accents, equation arrays, limits, bars, border boxes, matrices, and pre-sub-superscripts. Display math uses $$...$$ and inline math uses $...$ in markdown output. Plain text output includes raw LaTeX without delimiters.

  • Plain text output paths for all extractors: When OutputFormat::Plain or OutputFormat::Structured is requested, DOCX, PPTX, ODT, FB2, DocBook, RTF, and Jupyter extractors now produce clean plain text without markdown syntax (#, **, |, ![](image), -, etc.). Previously these extractors always emitted markdown regardless of the requested output format.

  • DOCX: Document::to_plain_text() skips heading prefixes, inline formatting markers, image placeholders, and renders footnotes/endnotes as id: text instead of [^id]: text.
  • PPTX: ContentBuilder respects plain mode — skips # title prefix, image markers, list markers, and uses Notes: instead of ### Notes:.
  • ODT: Heading prefixes (#), list markers (-), and pipe-delimited tables conditionally omitted for plain text.
  • FB2/FictionBook: Inline markers (*, **, `, ~~), heading prefixes, and cite prefixes skipped for plain text.
  • DocBook: Section title prefixes, code fences, list markers, blockquote prefixes, bold figure captions, and pipe tables all conditionally omitted.
  • RTF: Table output in result string uses tab separation instead of pipe-delimited markdown. Image ![image](...) markers omitted for plain text.
  • Jupyter: Skips text/markdown and text/html output types in plain mode, preferring text/plain.

  • cells_to_text() shared utility: Tab-separated plain text table formatter alongside existing cells_to_markdown(). Used by DOCX, PPTX, ODT, RTF, and DocBook extractors for plain text table rendering.

Changed

  • CLI includes all features: kreuzberg-cli now depends on kreuzberg with the full feature set instead of a separate cli subset. The cli feature group has been removed from kreuzberg. This ensures the CLI supports all formats including archives (7z, tar, gz, zip).

Fixed

  • Alpine/musl CLI Docker image: Fixed "Dynamic loading not supported" error when running kreuzberg-cli in Alpine containers. The CLI binary is now dynamically linked against musl libc, enabling runtime library loading for PDF processing.
  • R package Windows installation: Improved Python detection in configure script for Windows environments (added py launcher and RETICULATE_PYTHON support). Symlink extraction errors during source package installation are now handled gracefully.
  • PHP 8.5 precompiled extension binaries: Added PHP 8.5 support alongside existing PHP 8.4 in CI and release workflows.
  • OCR DPI normalization: The normalize_image_dpi() preprocessing logic is now integrated into the OCR pipeline. Images are normalized to the configured target DPI before being passed to Tesseract, and the calculated DPI is set via set_source_resolution(). This eliminates the "Estimating resolution as ..." warning and improves OCR accuracy for images with non-standard DPI.
  • HTML metadata extraction with Plain output: Fixed HTML metadata (headers, links, images, structured data) not being collected when using OutputFormat::Plain (the default). The underlying library's plain text fast path skips metadata extraction; kreuzberg now uses Markdown format internally for metadata collection and converts to plain text separately.
  • PPTX text run spacing: Adjacent text runs within paragraphs are now joined with smart spacing instead of being concatenated directly ("HelloWorld" → "Hello World").
  • CSV Shift-JIS/cp932 encoding detection: encoding_rs is now a non-optional dependency. CSV files with Shift-JIS encoding are correctly decoded instead of producing mojibake. Fallback encoding detection tries common encodings (Shift-JIS, cp932, windows-1252, iso-8859-1, gb18030, big5).
  • EML multipart body extraction: All text/html body parts are now extracted by iterating over all indices instead of only index 0. Nested message/rfc822 parts in multipart/digest are recursively extracted.
  • EPUB media tag leakage: <video>, <audio>, <source>, <track>, <object>, <embed>, <iframe> tags no longer leak into extracted text. Added <br> → newline and <hr> → newline handling.
  • FB2 poem extraction: Added support for <poem>, <stanza>, and <v> (verse) elements. Previously poetry content was silently dropped.
  • FB2 Unicode sub/superscript: Characters inside <sup> and <sub> are converted to Unicode equivalents. Added strikethrough support, horizontal rules for <empty-line>, and footnote extraction from notes body.
  • ODT StarMath-to-Unicode conversion: Mathematical formulas in ODT files are now converted to Unicode equivalents (Greek letters, operators, super/subscripts) instead of raw StarMath syntax.
  • BibTeX output format: Output now uses @type{key, field = {value}} format matching standard BibTeX conventions.
  • LaTeX display math: \[...\] display math environments are converted to $...$ format.
  • RST directive preservation: Field lists, directive markers, and .. code-block:: directives are preserved in extracted text.
  • RTF table cell separators: Plain mode now uses pipe delimiters for table cells instead of tabs.
  • Typst extraction improvements: Layout directives stripped, headings output as plain text, tables extracted with column-aware layout, links output as display text only.
  • DOCX field codes refined: Field instructions (between begin and separate) are now skipped while field results (between separate and end) are preserved. Previously all content between field begin/end was dropped, losing visible text like "Figure 1:" and page numbers.
  • DOCX drawing alt text in plain text: to_plain_text() now emits image alt text from wp:docPr descriptions instead of silently skipping drawings.
  • DOCX/drawing/table XML entity decoding: get_attr() helpers in drawing.rs and table.rs now use quick_xml::escape::unescape() to correctly decode XML entities like &#xA; in attribute values.

4.4.1

Added

  • OCR table inlining into markdown content (#421): When output_format = Markdown and OCR detects tables, the markdown pipe tables are now inlined into result.content at their correct vertical positions instead of only appearing in result.tables. Adds OcrTableBoundingBox to OcrTable for spatial positioning. Sets metadata.output_format = "markdown" to signal pre-formatted content and skip re-conversion.
  • OCR table bounding boxes: OCR-detected tables now include bounding box coordinates (pixel-level) computed from TSV word positions, propagated through all bindings as Table.bounding_box.
  • OCR table test images: Added balance sheet and financial table test images from issue #421 for integration testing.

Fixed

  • OCR test_tsv_row_to_element used wrong Tesseract level: Test specified level: 4 (Line) but asserted Word. Fixed to level: 5 (correct Tesseract word level).

  • MSG recipients missing email addresses: The MSG extractor read PR_DISPLAY_TO which contains only display names (e.g. "John Jennings"), losing email addresses entirely. Now reads recipient substorages (__recip_version1.0_#XXXXXXXX) with PR_EMAIL_ADDRESS and PR_RECIPIENT_TYPE to produce full "Name" <email> output with correct To/CC/BCC separation.

  • MSG date missing or incorrect: Date was parsed from PR_TRANSPORT_MESSAGE_HEADERS which is absent in many MSG files. Now reads PR_CLIENT_SUBMIT_TIME FILETIME directly from the MAPI properties stream, with fallback to transport headers.
  • EML date mangled for non-standard formats: mail_parser parsed ISO 8601 dates (e.g. 2025-07-29T12:42:06.000Z) into garbled output (2000-00-20T00:00:00Z) and replaced invalid dates with 2000-00-00T00:00:00Z. Now extracts the raw Date: header text from the email bytes, preserving the original value.
  • EML/MSG attachments line pollutes text output: build_email_text_output() appended an Attachments: ... line that doesn't represent message content. Removed from text output; attachment names remain in metadata.
  • HTML script/style tags leak in email fallback: The regex-based HTML cleaner for email bodies used .*? which doesn't match across newlines, allowing multiline <script>/<style> content to leak into extracted text. Added (?s) flag for dotall matching.
  • SVG CData content leaks JavaScript/CSS: Event::CData handler in the XML extractor didn't check SVG mode, causing <script> and <style> CDATA blocks to appear in SVG text output.
  • RTF parser leaks metadata noise into text: The RTF extractor did not skip known destination groups (fonttbl, stylesheet, colortbl, info, themedata, etc.) or ignorable destinations ({\*\...}), causing ~17KB of font tables, color definitions, and internal metadata to appear in extracted text.
  • RTF \u control word mishandled: Control words like \ul (underline) and \uc1 were incorrectly interpreted as Unicode escapes (\u + numeric param), producing garbage characters instead of being treated as formatting commands.
  • RTF paragraph breaks collapsed to spaces: \par control words emitted a single space instead of newlines, causing all paragraphs to merge into a single line. Now correctly emits double newlines for paragraph separation.
  • RTF whitespace normalization destroys paragraph structure: normalize_whitespace() treated newlines as whitespace and collapsed them to spaces. Rewritten to preserve newlines while collapsing runs of spaces within lines.

4.4.0

Added

  • R language bindings — Added kreuzberg R package via extendr with full extraction API (sync/async, batch, bytes), typed error conditions, S3 result class with accessors, config discovery, OCR/chunking configuration, plugin system, and 32 documentation snippets.
  • PHP async extraction: Non-blocking extraction via DeferredResult pattern with Tokio thread pool. Includes extractFileAsync(), extractBytesAsync(), batchExtractFilesAsync(), batchExtractBytesAsync() across OOP, procedural, and static APIs. Framework bridges for Amp v3+ (AmpBridge) and ReactPHP (ReactBridge).
  • WASM native OCR (ocr-wasm feature): Tesseract OCR compiled directly into the WASM binary via kreuzberg-tesseract, enabling OCR in all environments (Browser, Node.js, Deno, Bun) without browser-specific APIs. Supports 43 languages with tessdata downloaded from CDN into memory.
  • WASM Node.js/Deno PDFium support: PDFium initialization now works in Node.js and Deno by loading the WASM module from the filesystem. Configurable via KREUZBERG_PDFIUM_PATH environment variable.
  • WASM full-feature build: OCR, Excel, and archive extraction are now enabled by default in the WASM package. All wasm-pack build targets include the ocr-wasm feature.
  • WASM Excel extraction (excel-wasm feature): Calamine-based Excel/spreadsheet extraction available in WASM without requiring Tokio runtime.
  • WASM archive extraction: ZIP, TAR, 7z, and GZIP archive extraction now available in WASM via synchronous extractor implementations.
  • WASM PDF annotations: PDF annotations (text notes, highlights, links, stamps) are now exposed in the WASM TypeScript API via the annotations field on ExtractionResult.
  • C FFI distribution: Official C shared library (libkreuzberg) with cbindgen-generated header, cmake packaging (find_package(kreuzberg)), pkg-config support, and prebuilt binaries for Linux x86_64/aarch64, macOS arm64, and Windows x86_64. Includes 10 test files, benchmark harness integration, and full API reference documentation.
  • Go FFI bindings: Go package (packages/go/v4) consuming the C FFI shared library with prebuilt binaries published as GitHub release assets for all four platforms.
  • C as 12th e2e test language: The e2e-generator now produces C test files exercising the FFI API, with 15 passing test cases.
  • R distribution via r-universe: Switched R package distribution from CRAN to r-universe for faster release cycles and easier native compilation. Includes vendoring script for offline builds.

Fixed

  • DOCX equations not extracted: OMML math content (<m:oMath>, <m:r>, <m:t> elements) was completely ignored by the DOCX parser, causing all equation text (e.g. A=πr², quadratic formula) to be silently dropped. Math runs are now extracted as regular text.
  • DOCX line breaks ignored: <w:br/> elements were not handled, causing adjacent text segments to merge (e.g. timestamps concatenated with following text). Line breaks now insert whitespace.
  • PPTX/PPSX table content lost: Tables were rendered as HTML without whitespace between tags, causing the entire table to tokenize as a single unreadable blob. Tables now render as markdown pipe tables with proper cell separation.
  • PPTX/PPSX/PPTM image markers pollute text: Image references like ![rId2](slide_1_image_rId2.jpg) injected spurious numeric tokens into extracted content. Image markers now use a clean ![image]() format.
  • DOCX image markers pollute text: Drawing references like ![alt](image_3) injected spurious numeric tokens. Changed to ![alt](image).
  • EPUB double-lossy conversion: XHTML content was converted through an XHTML→markdown→plain-text pipeline, losing content at each stage (underscores, asterisks, numeric URLs stripped). Replaced with direct roxmltree traversal that extracts text content from XHTML elements without intermediate markdown.
  • Excel float formatting drops numeric precision: format_cell_to_string() formatted whole-number floats as "1.0" instead of "1", causing numeric token mismatches in quality scoring. Also fixed DateTime handling to use to_ymd_hms_milli() instead of the unavailable as_datetime() API.
  • HTML metadata extraction pollutes content: When using convert_html_to_markdown_with_metadata(), the extract_metadata option was left enabled, causing YAML frontmatter to be prepended to the content string even though metadata was already returned as a struct. Set extract_metadata = false in the metadata extraction path.
  • Markdown extractor loses tokens through AST reconstruction: The markdown extractor parsed content into a pulldown-cmark AST then reconstructed text, losing tokens through transformation. Now returns raw text content directly (after frontmatter extraction) while still parsing the AST for table and image extraction.
  • SVG text extraction includes element prefixes: XML extractor prepended element_name: to all text content, adding spurious tokens. SVG extraction now targets only text-bearing elements (<text>, <tspan>, <title>, <desc>) without prefixes.
  • XML ground truth uses raw source: CSV, XML, and IPYNB ground truth files contained raw source markup (delimiters, tags, JSON structure) instead of expected extracted text, causing quality scores near zero. Regenerated all 20 ground truth files.
  • Elixir benchmark UTF-8 locale: Erlang VM running with latin1 native encoding corrupted UTF-8 strings from Rust NIFs. Added ERL_LIBS path configuration in the benchmark harness.
  • WASM OCR not working (enableOcr() regression): enableOcr() registered the OCR backend only in a JS-side registry, but the Rust extraction pipeline uses a separate Rust-side plugin registry. OCR via extractBytes/extractFile always failed with "OCR backend 'tesseract' not registered". The function now bridges both registries so OCR works end-to-end.
  • WASM tessdata CDN URL returns 404: The NativeWasmOcrBackend tessdata URL pointed to a non-existent path in the tesseract-wasm npm package. Updated to use the official tesseract-ocr/tessdata_fast GitHub repository.
  • XML UTF-16 parsing fails on files with odd byte count: The XML extractor rejected valid UTF-16 encoded files that had a trailing odd byte (e.g. factbook-utf-16.xml) with "Invalid UTF-16: odd byte count". The decoder now truncates to the nearest even byte boundary, matching the lenient approach already used in email extraction.
  • R bindings crash on strings with embedded NUL bytes: Extraction results containing NUL (\0) characters (e.g. from RTF files) caused the R FFI layer to error with "embedded nul in string" since R strings are C-based. NUL bytes are now stripped before passing strings to R.
  • R bindings %||% operator incompatible with R < 4.4: The R package used the %||% null-coalescing operator which is only available in base R >= 4.4, but the package declares R >= 4.2. Added a package-local polyfill for backwards compatibility.
  • API returns HTTP 500 for unsupported file formats (#414): Uploading files with unsupported or undetectable MIME types (e.g. DOCX via curl -F) returned HTTP 500 Internal Server Error instead of HTTP 400 Bad Request. The /extract endpoint now falls back to extension-based MIME detection from the filename when the client sends application/octet-stream, and UnsupportedFormat errors are mapped to HTTP 400 with a clear UnsupportedFormatError response.
  • PDF markdown extraction missing headings/bold for flat structure trees (#391): PDFs where the structure tree tags everything as <P> (common with Adobe InDesign) now produce proper headings and bold text. The structure tree path previously bypassed font-size-based heading classification entirely. Pages with font size variation but no heading tags are now enriched via K-means font-size clustering. Additionally, bold detection now recognizes fonts with "Bold" in the name (e.g. MyriadPro-Bold) even when the PDF doesn't set the font weight descriptor.
  • PaddleOCR backend not found when using backend="paddleocr" (#403): The PaddleOCR backend registered itself as "paddle-ocr" but users and documentation use "paddleocr". The OCR backend registry now resolves the "paddleocr" alias to the canonical "paddle-ocr" name.
  • WASM metadata serialization: Fixed #[serde(flatten)] with internally-tagged enums dropping format_type and format-specific metadata fields. Switched from serde_wasm_bindgen to serde_json + JSON.parse() for output serialization.
  • WASM config deserialization: Fixed camelCase TypeScript config keys (e.g. outputFormat, extractAnnotations) not being recognized by Rust serde. Config keys are now converted to snake_case before passing to the WASM boundary.
  • WASM PDFium module loading: Fixed copy-pkg.js overwriting the real PDFium Emscripten module with a stub init helper. The build script now locates and copies the actual PDFium ESM module (pdfium.esm.js + pdfium.esm.wasm) from the Cargo build output, with a Deno compatibility fix for bare import("module").
  • Email header extraction loses display names: EML and MSG parsers extracted only bare email addresses, discarding sender/recipient display names. From, To, CC, and BCC fields now use "Display Name" <email@example.com> format when a display name is available.
  • Email date header normalized to RFC 3339: The EML parser always converted dates to RFC 3339 format, losing the original date string. Now preserves the raw Date header value and only falls back to RFC 3339 normalization when the raw header is unavailable.
  • Docker builds fail due to missing snippet-runner exclusion: The sed command in Dockerfile.cli, Dockerfile.core, and Dockerfile.full did not remove the snippet-runner workspace member, causing build failures when the crate directory was not COPY'd into the build context.
  • WASM Deno e2e tests skip OCR fixtures: Generated Deno test files called initWasm() but never called enableOcr(), so the Tesseract OCR backend was never registered and all OCR tests silently skipped. The e2e generator now calls enableOcr() after initWasm() in every generated test file.
  • WASM Deno e2e tests ignore pages config: The buildConfig() helper in generated Deno tests did not map the pages extraction config (page markers, page extraction), causing tests with page-related assertions to use defaults. Added mapPageConfig() to the test helper template.

Removed

  • polars dependency: Removed unused polars crate and table_from_arrow_to_markdown dead code from the excel feature. Excel extraction uses calamine directly.

4.3.8

Added

  • MDX format support (mdx feature): Extract text from .mdx files, stripping JSX/import/export syntax while preserving markdown content, frontmatter, tables, and code fences
  • List supported formats API (#404): Query all supported file extensions and MIME types via list_supported_formats() in Rust, GET /formats REST endpoint, list_formats MCP tool, or kreuzberg formats CLI subcommand

Fixed

  • PDF ligature corruption in CM/Type1 fonts: Added contextual ligature repair for PDFs with broken ToUnicode CMaps where pdfium doesn't flag encoding errors. Fixes corrupted text like di!erentdifferent, o"cesoffices, #nancialfinancial in LaTeX-generated PDFs. Uses vowel/consonant heuristic to disambiguate ambiguous ligature mappings. Applied to both structure tree and heuristic extraction paths.
  • PDF dehyphenation across line boundaries: Added paragraph-level dehyphenation that rejoins words broken across PDF line breaks (e.g. soft waresoftware, recog nitionrecognition). Handles both explicit trailing hyphens (Case 1) and implicit breaks where pdfium strips the hyphen (Case 2, using full-line detection). Applied to both structure tree and heuristic extraction paths.
  • PDF page markers missing in Markdown and OCR output (#412): Page markers (insert_page_markers / marker_format) were not inserted when using Markdown output format or OCR extraction since the 4.3.5 pipeline rewrite. Fixed by threading the marker format through the markdown assembly pipeline and OCR page joining. Djot output inherits markers automatically.
  • PDF Djot/HTML output quality parity: Djot and HTML output formats now use the same high-quality structural extraction pipeline as Markdown (headings, tables, bold/italic, dehyphenation). Previously these formats fell back to plain text split into paragraphs, losing all document structure.
  • PDF sidebar text pollution: Widened the margin band for sidebar character filtering from 5% to 6.5% of page width, fixing cases where rotated sidebar text (e.g. arXiv identifiers) leaked into extracted content.
  • Node.js PDF config options not passed to native binding: Fixed extractAnnotations, hierarchy, topMarginFraction, and bottomMarginFraction PDF config fields being silently dropped by the TypeScript config normalizer, causing PDF annotation extraction to always return undefined in the Node.js binding.

4.3.7

Added

  • NFC unicode normalization applied to all extraction outputs, ensuring consistent representation of composed characters across all backends (gated behind quality feature)
  • Configurable PDF page margin fractions (top_margin_fraction, bottom_margin_fraction) in PdfConfig
  • PDF annotation extraction with new PdfAnnotation type supporting Text, Highlight, Link, Stamp, Underline, StrikeOut, and Other annotation types
  • extract_annotations configuration option in PdfConfig
  • annotations field on ExtractionResult across all language bindings (Rust, Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, WASM)

Fixed

  • PDF markdown extraction quality at parity with docling (91.0% avg F1 vs docling's 91.4% across 16 test PDFs, while being 10-50x faster): Replaced PdfiumParagraph::from_objects() with per-character text extraction using pdfium's PdfPageText::chars() API, which correctly handles font matrices, CMap lookups, and text positioning. Adaptive line-break detection uses measured Y-position changes rather than font-size-relative thresholds, fixing PDFs where pdfium reports incorrect unscaled font sizes.
  • PDF markdown extraction no longer drops all content on PDFs with broken font metrics: Added font-size filter fallback — when the MIN_FONT_SIZE filter (4pt) removes all text segments (e.g. PDFs where pdfium reports font_size=1 due to font matrix scaling), the filter is skipped and unfiltered segments are used instead.
  • PDF margin filter no longer drops all content on edge-case PDFs: Added margin filter fallback — when margin filtering removes all text segments (e.g. PDFs where pdfium reports baseline_y values outside expected margin bands), the filter is skipped for that page.
  • PDF ligature repair integrated into per-character extraction: Ligature corruption (fi!, fl#, ff") is now repaired inline during character iteration rather than as a separate post-processing pass, improving both accuracy and performance.
  • PDF multi-column text extraction improved: Federal Register-style multi-column PDFs went from 69.9% to 90.7% F1 by using pdfium's text API which naturally handles reading order.
  • PDF table detection now requires ≥3 aligned columns, eliminating false positives from two-column text layouts (academic papers, newsletters)
  • PDF table post-processing rejects tables with ≤2 columns, >50% long cells, or average cell length >50 chars
  • PDF markdown rendering no longer drops content when pdfium returns zero-value baseline coordinates (fixes missing titles/authors in some LaTeX-generated PDFs)
  • PaddleOCR backend validation now dynamically checks the plugin registry instead of hardcoding, preventing false "backend not registered" errors when the plugin is available (#403)
  • WASM bindings now export detectMimeFromBytes and getExtensionsForMime MIME utility functions
  • Node.js NAPI-RS binding correctly exposes annotations field on ExtractionResult
  • Python output format validation tests updated to reflect json as a valid format (alias for structured)
  • XLSX extraction with output_format="markdown" now produces markdown tables instead of plain text (#405)
  • MCP tools with no parameters (cache_stats, cache_clear) now emit valid inputSchema with {"type": "object", "properties": {}} instead of {"const": null}, fixing Claude Code and other MCP clients that validate schema type (#406)
  • Python get_valid_ocr_backends() now unconditionally includes paddleocr in the returned list, matching all other language bindings
  • TypeScript E2E test generator now maps extract_annotations to extractAnnotations in mapPdfConfig(), fixing annotation assertion failures
  • PHP PdfConfig now includes extractAnnotations, topMarginFraction, and bottomMarginFraction fields, restoring parity with the Rust core config

4.3.6

Added

  • Pdfium PdfParagraph object-based extraction: New markdown extraction path using pdfium's PdfParagraph::from_objects() for spatial text grouping, replacing raw page-object iteration. Provides accurate per-line baseline positions via into_lines() and styled text fragments with bold/italic/monospace detection.
  • Structure tree and content marks API in pdfium-render: New ExtractedBlock, ContentRole, and PdfParagraph types for tagged PDF semantic extraction. Structure tree headings are validated against font size and word count to prevent broken structure trees from misclassifying body text.
  • Modular markdown pipeline: Refactored PDF markdown rendering into focused modules — bridge.rs (pdfium API bridge), lines.rs (baseline grouping), paragraphs.rs (paragraph detection), classify.rs (heading/code classification), render.rs (inline markup), assembly.rs (table/image interleaving), pipeline.rs (orchestration).
  • Text encoding normalization: normalize_text_encoding() in bridge.rs converts trailing soft hyphens (\u{00AD}) to regular hyphens for word-rejoining, strips mid-word soft hyphens, and removes stray C0 control characters from PDF text.
  • Table post-processing validation: Ported post_process_table() from html-to-markdown-rs with 10-stage validation — empty row removal, long cell rejection, data row detection, header extraction, column merging, dimension checks, column sparsity, overall density, content asymmetry, and cell normalization. Eliminates false positive table detections in non-table PDFs.
  • Font quality detection for OCR triggering: Added has_unicode_map_error() to pdfium-render's PdfPageTextChar, wrapping FPDFText_HasUnicodeMapError. During extraction, characters are sampled per page; if >30% have broken unicode mappings (tofu/garbage), OCR fallback is triggered automatically.
  • Extended list prefix detection: Paragraph list detection now recognizes en dashes (), em dashes (), single-letter alphabetic prefixes (a., b), A., B)), and roman numerals (i. through xii.).

Fixed

  • UTF-8 panic in PDF list detection (#398): detect_list_items() assumed all newlines are 1 byte, causing panics on multi-byte UTF-8 content with CRLF line endings. Fixed with proper CRLF-aware newline advancement and char boundary guards in process_content().
  • PaddleOCR backend not respected in Python bindings (#399): _ensure_ocr_backend_registered() silently returned without registering for paddleocr/paddle-ocr backends. These are now correctly skipped like tesseract, letting the Rust core handle them.
  • Ruby gem missing sorbet-runtime at runtime (#400): sorbet-runtime was listed as a development dependency in the gemspec but is required at runtime for T::Struct types. Promoted to a runtime dependency.
  • E2e generator Ruby rubocop warnings: The Ruby e2e generator emitted redundant RSpec/DescribeClass and RSpec/ExampleLength inline disable directives that rubocop autocorrect mangled into invalid syntax. Simplified to only disable Metrics/BlockLength.
  • E2e generator TypeScript npm warnings: Replaced npx with pnpm exec for running biome in the e2e generator, eliminating spurious warnings from pnpm-specific .npmrc settings.
  • Tesseract TSV level mapping off-by-one: OCR element hierarchy levels were incorrectly mapped — levels are 1=Page, 2=Block, 3=Paragraph, 4=Line, 5=Word. Fixed parse_tsv_to_elements to include word-level entries.
  • OCR elements dropped in image OCR path: image_ocr.rs hardcoded ocr_elements to None instead of passing through the elements parsed from Tesseract TSV output.
  • DOCX extractor panic on multi-byte UTF-8 page boundaries (#401): Page break insertion used byte-index slicing on multi-byte UTF-8 content, causing panics. Fixed with char-boundary-safe insertion.
  • Node.js djot_content field missing: JsExtractionResult in kreuzberg-node was not mapping the djot_content field from Rust results, always returning undefined.
  • E2e generator missing mapPageConfig and mapHtmlOptions: TypeScript e2e test generator did not map page extraction or HTML formatting options from fixture configs, causing tests with those options to use defaults.
  • Pipeline test race conditions: Replaced manual REGISTRY_TEST_GUARD mutex with #[serial] from serial_test, fixing flaky failures in test_pipeline_with_quality_processing, test_pipeline_with_all_features, and test_postprocessor_runs_before_validator caused by global registry state pollution between parallel tests.
  • test_pipeline_with_keyword_extraction permanently ignored: Test was marked #[ignore] due to test isolation issues. Fixed the underlying problem — Lazy static prevented re-registration after shutdown_all() — by clearing the processor cache after re-registration.
  • OCR cache deserialization failure: Added #[serde(default)] to OcrConfidence.detection field so cached OCR data from before the field was added can still deserialize.
  • CI validate, Rust e2e, Java e2e, and C# e2e failures: Fixed ChunkerType serde casing, populated djot_content in pipeline for Djot output format, fixed Java/C# e2e test helper APIs.
  • PDF table detection false positives: Table detection precision improved from 50% to 100% by applying post_process_table() validation to both the pdfium and OCR table detection paths. Non-table PDFs (simple.pdf, fake_memo.pdf, searchable.pdf, google_doc_document.pdf) no longer produce spurious table detections.
  • Baseline tolerance drift in PDF line grouping: Line grouping tolerance was computed from the minimum font size across all segments in a line, causing it to shrink when subscripts/superscripts were added. Now anchored to the first segment's font size per line.
  • Paragraph gap detection using minimum spacing: The paragraph break threshold used the minimum inter-line spacing, which was fragile to outlier-tight spacings from superscripts/subscripts. Changed to 25th percentile (Q1) for robustness.

4.3.5

Added

  • PDF markdown output format: Native PDF text extraction now supports output_format: Markdown, producing structured markdown with headings (via font-size clustering), paragraphs, inline bold/italic markup, and list detection — instead of flat text with visual line breaks.
  • Multi-column PDF layout detection: Histogram-based column gutter detection identifies 2+ column layouts (academic papers, magazines) and processes each column independently, preventing text interleaving across columns.
  • Bold/italic detection via font name fallback: When PDF font descriptor flags don't indicate bold/italic, the extractor checks font names for "Bold"/"Italic"/"Oblique" substrings and font weight >= 700 as secondary signals.
  • musl/Alpine Linux native builds for Elixir, Java, and C#: New Docker-based CI jobs build native libraries (libkreuzberg_rustler.so, libkreuzberg_ffi.so) targeting x86_64-unknown-linux-musl and aarch64-unknown-linux-musl. Enables instant install on Alpine Linux and musl-based distributions without compiling from source.
  • Pre-compiled platform-specific Ruby gems: The publish workflow now ships pre-compiled native gems for x86_64-linux, aarch64-linux, arm64-darwin, and x64-mingw-ucrt, eliminating the 30+ minute compile-from-source on gem install kreuzberg. A fallback source gem is still published for unsupported platforms.
  • bounding_box: Option<BoundingBox> field on Table struct: Added spatial positioning data for table extraction, enabling precise table layout reconstruction. Computed from character positions during PDF table detection.
  • bounding_box: Option<BoundingBox> field on ExtractedImage struct: Added spatial positioning data for extracted images, enabling image layout reconstruction in document pipelines.
  • Inline table embedding in PDF markdown output: Tables are inserted at correct vertical position within markdown content instead of being appended at the end. Position determined by bounding box y0 coordinate.
  • Image placeholder injection in PDF markdown output: Image references are inserted with OCR text as blockquotes at correct vertical position matching the image's bounding box.
  • render_document_as_markdown_with_tables() function: New public function for table-aware markdown rendering that embeds tables inline at correct positions and injects image placeholders. Used internally by render_document_as_markdown().
  • inject_image_placeholders() function: New post-processing function for markdown that injects ![Image description]() placeholders and OCR text blockquotes at correct vertical positions in the content.
  • bounding_box field in all language bindings: Added bounding_box (optional BoundingBox) to Table and ExtractedImage types across all 10 language bindings: Python, TypeScript (Node/Core/WASM), Ruby, PHP, Go, Java, C#, and Elixir.

Fixed

  • Pipeline test flakiness: Disabled post-processing in pipeline tests that don't test post-processing, fixing test_pipeline_without_chunking and related tests that failed due to global processor cache poisoning in parallel execution.
  • PHP FFI bridge missing bounding_box: The PHP Rust bridge (kreuzberg-php) was not passing bounding_box through for Table or ExtractedImage, causing the field to always be null despite being defined in the PHP user-facing types.

  • PaddleOCR dict index offset causing wrong character recognition (#395): read_keys_from_file() was missing the CTC blank token (#) at index 0 and the space token at the end, causing off-by-one character mapping errors. Now matches the get_keys() layout used for embedded models.

  • PaddleOCR angle classifier misfiring on short text (#395): Changed use_angle_cls default from true to false. The angle classifier can misfire on short text regions (e.g., 2-3 character table cells), rotating crops incorrectly before recognition. Users can re-enable via PaddleOcrConfig::with_angle_cls(true) for rotated documents.
  • PaddleOCR excessive padding including table gridlines (#395): Reduced default detection padding from 50px to 10px and made it configurable via PaddleOcrConfig::with_padding(). Large padding on small images caused table gridlines to be included in text crops.
  • Ruby CI Bundler gems destroyed by vendoring script: The vendor-kreuzberg-core.py script was deleting the entire vendor/ directory including vendor/bundle/ (Bundler's gem installation). Now only cleans crate subdirectories, preserving Bundler state.
  • PDF document loaded twice for markdown rendering: Eliminated redundant Pdfium initialization and document parsing by rendering markdown speculatively during the first document load, saving 25-40ms per PDF.
  • NaN panics in PDF text clustering and block merging: Replaced expect() calls on partial_cmp with unwrap_or(Ordering::Equal) across clustering, extraction, and markdown modules to handle corrupt PDF coordinates gracefully.
  • PDF heading detection false positives: Added distance threshold to font-size centroid matching — decorative elements with extreme font sizes no longer receive heading levels.
  • PDF list item false positives: Long paragraphs starting with "1." or "-" no longer misclassified as list items (added line count constraint).
  • Silent markdown fallback: tracing::warn messages for markdown rendering failures are no longer gated behind the otel feature flag.
  • PDF font-size clustering float imprecision: Changed exact dedup() to tolerance-based dedup (0.05pt) and added NaN/Inf filtering for font sizes from corrupt PDFs.

  • ExtractionResult typed keyword and quality fields: ExtractionResult now includes typed fields extracted_keywords: Option<Vec<ExtractedKeyword>> and quality_score: Option<f64> instead of untyped metadata.additional entries. Keywords now carry algorithm, score, and position information for better keyword analysis.

  • ProcessingWarning type for extraction pipeline: New ProcessingWarning { source: String, message: String } type added to ExtractionResult.processing_warnings to explicitly surface non-fatal warnings during document processing (e.g., recoverable decoding issues, missing optional features).
  • Metadata typed fields: Metadata struct now includes typed fields category, tags, document_version, abstract_text, and output_format for better structured metadata handling across all language bindings.
  • output_format always populated: The metadata.output_format field is now set for all output formats (plain, markdown, djot, html, structured), not just structured. Previously only the structured format populated this field.
  • Language binding updates for typed fields: All language bindings (Python, TypeScript/Node.js, Ruby, PHP, Go, Java, C#, Elixir) updated with corresponding typed properties matching the Rust API (e.g., extractedKeywords, qualityScore in TypeScript; extracted_keywords, quality_score in Python/Ruby).

Fixed

  • PaddleOCR recognition height mismatch (#390): Changed CRNN_DST_HEIGHT from 32 to 48 pixels to match PP-OCRv4/v5 model input shape [batch, 3, 48, width]. The previous value caused ONNX Runtime dimension errors on all platforms.
  • Go binding: ChunkingConfig missing Embedding field: Added Embedding *EmbeddingConfig to Go's ChunkingConfig struct to match the Rust canonical type. Previously, embedding configuration nested inside chunking was silently dropped during JSON round-trip, causing embedding-enabled extractions to run without embeddings.
  • Go binding: extracted_keywords, quality_score, processing_warnings always nil: The vendored C header (packages/go/v4/internal/ffi/kreuzberg.h) was missing the three new CExtractionResult fields, and convertCResult() never decoded them. Updated the header and added the missing decodeJSONCString calls.
  • extraction_duration_ms missing from Go, Java, PHP, C# bindings: The Metadata.extraction_duration_ms field was present in Rust, TypeScript, and Elixir but absent from four bindings. Added the field with proper serialization/deserialization to all four.
  • C# Metadata.Additional not marked obsolete: The deprecated additional map (superseded by typed fields) was not marked [Obsolete] in C#. Added [Obsolete] attribute matching the Rust deprecation. Also added @Deprecated in Java and // Deprecated: doc comment in Go.
  • Ruby RBS type signatures incomplete: packages/ruby/sig/kreuzberg.rbs lacked struct definitions for all T::Struct types (ExtractedKeyword, ProcessingWarning, BoundingBox, DocumentNode, etc.) and inner result classes (Table, Chunk, OcrElement, etc.). Rewrote with comprehensive type definitions matching types.rb and result.rb.
  • Python .pyi stub missing extraction_duration_ms: Added extraction_duration_ms: int | None to the Metadata TypedDict in _internal_bindings.pyi.

Changed

  • PDF table extraction now computes bounding boxes from character positions: Table bounding box is calculated as the aggregate bounds of all constituent character positions, enabling precise spatial positioning in downstream rendering pipelines.
  • render_document_as_markdown() now delegates to render_document_as_markdown_with_tables() with empty tables: The original function is now a thin wrapper for backward compatibility, with all table-aware rendering logic centralized in the new _with_tables() variant.

  • PaddleOCR recognition models upgraded to PP-OCRv5: Upgraded arabic, devanagari, tamil, and telugu recognition models from PP-OCRv3 to PP-OCRv5 for improved accuracy. All 11 script families now use PP-OCRv5 models.

  • PDFium upgraded to chromium/7678: Upgraded PDFium binary version from 7578 to the latest release (chromium/7678, Feb 2026) across all CI workflows, Docker images, and task configuration. C API is fully backward-compatible with existing bindings.
  • kreuzberg-pdfium-render trimmed to single version: Removed support for 22 legacy PDFium API versions (5961-7350 + future), deleting ~328k lines of dead code including bindgen files, C headers, and ~4,256 version-conditional compilation blocks. Removed XFA, V8, Skia, and Win32 feature-gated code paths.
  • Workspace dependency consolidation: Moved wasm-bindgen, wasm-bindgen-futures, js-sys, web-sys, console_error_panic_hook, and log to workspace-level dependency management, deduplicating versions across kreuzberg-pdfium-render, kreuzberg-wasm, and kreuzberg-ffi.
  • Docker full image: pre-download all PaddleOCR models: Replaced broken single-language model download with all 11 recognition script families (english, chinese, latin, korean, eslav, thai, greek, arabic, devanagari, tamil, telugu) plus dictionaries. Fixed incorrect HuggingFace URLs and cache paths. Added retry logic with backoff for transient HuggingFace 502 errors.
  • Docker test suite: PaddleOCR verification: Added test_paddle_ocr_extraction to the full variant Docker tests to verify pre-loaded models work end-to-end.
  • E2E tests updated for typed extraction fields: End-to-end tests now validate typed extracted_keywords, quality_score, and processing_warnings fields instead of reading from metadata.additional dictionary.

4.3.4 - 2026-02-16

Fixed

  • Node.js keyword extraction fields missing: The TypeScript convertResult() type converter was silently dropping extractedKeywords, qualityScore, and processingWarnings from NAPI results because it only copied explicitly listed fields. Added the missing field conversions. Also renamed the mismatched keywords property to extractedKeywords in the TypeScript types to match the NAPI binding definition.
  • Windows PHP CI build failure (crc::Table not found): Downgraded lzma-rust2 from 0.16.1 to 0.15.7 to avoid pulling crc 3.4.0, which removed the Table type used by downstream dependencies.
  • CLI installer resolving benchmark tags as latest release: The install.sh script used GitHub's /releases/latest API which returned benchmark run releases instead of actual versioned releases. Changed to filter for v-prefixed tags. Also marked benchmark releases as prerelease in the workflow so they no longer interfere.

4.3.3 - 2026-02-14

Added

Centralized Image OCR Processing

  • Shared process_images_with_ocr function: Extracted duplicated OCR processing logic from DOCX and PPTX extractors into extraction::image_ocr module, providing a single shared implementation for all document extractors.

Jupyter Notebook Image Extraction

  • Base64 image decoding: Jupyter extractor now decodes embedded base64 image data (PNG, JPEG, GIF, WebP) from notebook cell outputs into ExtractedImage structs instead of emitting placeholder text.
  • OCR on notebook images: Extracted images are processed with OCR when configured, using the centralized process_images_with_ocr function.
  • SVG handling: SVG images in notebook outputs are handled as text content (not sent to raster OCR).

Markdown Data URI Image Extraction

  • Data URI image decoding: Markdown extractor now decodes data:image/...;base64,... URIs into ExtractedImage structs with proper format detection (PNG, JPEG, GIF, WebP).
  • OCR on embedded images: Decoded data URI images are processed with OCR when configured.
  • HTTP URLs preserved as text: Non-data URIs (HTTP/HTTPS) are kept as [Image: url] text markers without attempting network access or filesystem traversal.

PaddleOCR Multi-Language Support (#388)

  • 80+ language support via 11 script families: PaddleOCR recognition models now cover english, chinese (simplified+traditional+japanese), latin, korean, east slavic (cyrillic), thai, greek, arabic, devanagari, tamil, and telugu script families.
  • Per-family recognition model architecture: Shared detection/classification models with per-family recognition models and dictionaries, downloaded on demand from HuggingFace (Kreuzberg/paddleocr-onnx-models).
  • Engine pool for concurrent multi-language OCR: Replaced single-engine architecture with a per-family engine pool (HashMap<String, Arc<Mutex<OcrLite>>>), enabling concurrent OCR across different languages.
  • Backend-agnostic --ocr-language CLI flag: Works with all OCR backends (tesseract, paddle-ocr, easyocr). Tesseract expects ISO 639-3 codes (eng, fra, deu); PaddleOCR accepts flexible codes (en, ch, french, korean) via map_language_code().
  • SHA256 checksum verification: All model downloads verified against embedded checksums for integrity.

Changed

PaddleOCR Engine Internals

  • CrnnNet recognition height: Changed to 32 pixels (later found to be incorrect for PP-OCRv4/v5 models; fixed in next release).
  • Model manager split: MODELS constant replaced with SHARED_MODELS (det+cls) and REC_MODELS (11 families) with new cache layout rec/{family}/model.onnx.
  • Language code mapping expanded: map_language_code() now handles Thai, Greek, East Slavic, and additional Latin-script languages.

DOCX Full Extraction Pipeline (#387)

  • DocumentStructure generation: Builds hierarchical document tree with heading-based sections, paragraphs, lists, tables, images, headers/footers, and footnotes/endnotes when include_document_structure = true.
  • Pages field population: Splits extracted text into per-page PageContent entries using detected page break boundaries, with tables and images assigned to correct pages.
  • OCR on embedded images: Runs secondary OCR on extracted DOCX images when OCR is configured, following the PPTX pattern.
  • Image extraction with page assignment: Drawing image placeholders in markdown output enable byte-position-based page number assignment for extracted images.
  • Typed metadata fields: title, subject, authors, created_by, modified_by, created_at, modified_at, language, and keywords are now populated as first-class Metadata fields instead of only appearing in the additional map.
  • FormatMetadata::Docx: Structured format metadata with core_properties, app_properties, and custom_properties available via metadata.format.
  • Style-based heading detection: Uses StyleCatalog with outline_level and inheritance chain walking for accurate heading level resolution, with string-matching fallback.
  • Headers, footers, and footnote references: Headers/footers included in markdown with --- separators; [^N] inline footnote/endnote references rendered in text.
  • Markdown formatting: Bold (**), italic (*), underline (<u>), strikethrough (~~), and hyperlinks rendered as markdown.
  • Table formatting metadata: Vertical merge (v_merge) handled correctly, grid_span for horizontal merging, is_header row detection.
  • Drawing image placeholders: ![alt](image_N) placeholders in markdown output for embedded images.

DOCX Extractor Performance & Code Quality

  • Eliminated 3x code duplication: Extracted parse_docx_core() helper to deduplicate parsing logic across tokio/non-tokio cfg branches.
  • Removed unnecessary clones: Metadata structs (core/app/custom properties) borrowed then moved instead of cloned; drawings and image relationships only cloned when image extraction is enabled.
  • Optimized Run::to_markdown(): Single-pass string builder with pre-calculated capacity replaces clone + repeated format! calls on the hot path.
  • In-place output trimming: to_markdown() trims in-place instead of allocating a new String via trim().to_string().
  • Removed into_owned() on XML text decode: Uses Cow directly from e.decode() instead of forcing heap allocation.
  • write!/writeln! for string building: Footnote definitions and image placeholders use write! to avoid intermediate String allocations.
  • Safe element indexing: to_markdown() uses .get() with else { continue } instead of direct indexing to prevent potential panics.
  • Deduplicated document structure code: Header/footer loops and footnote/endnote loops consolidated using iterators.

Fixed

Extraction Quality Improvements

  • LaTeX zero-arg command handling: Added explicit skip list for 35 zero-argument commands (\par, \noindent, \centering, size commands, etc.). The catch-all handler no longer consumes the next {...} group as an argument, preventing silent text loss for unknown zero-arg commands.
  • Structured data is_text_field false positives: Changed from .contains() substring matching to exact equality on the leaf field name. Previously, "width" matched because it contains "id"; "valid" matched because it contains "id". Now only exact leaf name matches are considered.
  • XML dead code in Event::End handler: Removed unused variable allocation and discarded comparison (let _ = popped == name_owned), replaced with simple element_stack.pop().

Removed

  • Dead code cleanup: Removed unused Document.lists field, ListItem struct, process_lists() method, and HeaderFooter::extract_text() method.

4.3.2 - 2026-02-13

Fixed

PHP 8.4 Requirement Update

  • Updated PHP requirement to 8.4+: All PHP composer.json files, CI workflows, and documentation now require PHP 8.4+ to support PHPUnit 13.0. This fixes CI validation and PHP workflow failures caused by PHPUnit 13.0 requiring PHP 8.4.1+.

Elixir Publishing Workflow

  • Fixed macOS ARM64 build timeout: Increased timeout from 180 to 300 minutes (5 hours) for macOS ARM64 Elixir native library builds. The previous timeout caused incomplete builds and prevented Elixir v4.3.1 from being published to Hex.pm.

4.3.1 - 2026-02-12

Fixed

Elixir Package Checksums (#383)

  • Fixed checksum mismatch for Elixir 4.3.0 Hex package: Updated checksum-Elixir.Kreuzberg.Native.exs with correct SHA256 checksums for all 8 precompiled NIF binaries (NIF 2.16/2.17 across aarch64-apple-darwin, aarch64-unknown-linux-gnu, x86_64-unknown-linux-gnu, x86_64-pc-windows-gnu). The 4.3.0 release shipped with outdated 4.2.10 checksums, causing installation failures.

Dependency Updates

  • Updated all dependencies across 10 language ecosystems: Rust, Python, Node/TypeScript, Ruby, PHP, Go, Java, C#, Elixir, WASM, and pre-commit hooks all updated to latest compatible versions.
  • Enhanced dependency update tasks: All language-specific task update commands now upgrade to latest major versions (not just respecting version constraints). PHP, Ruby, C#, Elixir, and Python update tasks enhanced with major version upgrade support.

WASM Compatibility

  • Fixed WASM build failures: Added explicit getrandom 0.3.4 dependency with wasm_js feature to kreuzberg-wasm crate to ensure transitive dependencies (ahash, lopdf, rand_core) have WebAssembly support enabled.

Dependency Pins

  • Pinned lzma-rust2 to 0.15.7: The 0.16.1 upgrade is incompatible with crc 3.4.0. Keeping 0.15.7 until upstream compatibility is restored.

4.3.0 - 2026-02-11

Added

Blank Page Detection

  • is_blank field on PageInfo and PageContent: Pages with fewer than 3 non-whitespace characters and no tables or images are flagged as blank. Detection uses a two-phase approach: text-only analysis during extraction, then refinement after table/image assignment. Available across all 9 language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir, WASM). Closes #378.

PaddleOCR Backend

  • PaddleOCR backend via ONNX Runtime: New OCR backend (kreuzberg-paddle-ocr) using PaddlePaddle's PP-OCRv4 models converted to ONNX format, run via ONNX Runtime. Supports 6 languages (English, Chinese, Japanese, Korean, German, French) with automatic model downloading and caching. Provides superior CJK recognition compared to Tesseract.
  • PaddleOCR support in all bindings: Available across Python, Rust, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, and Elixir bindings via the paddle-ocr feature flag.
  • PaddleOCR CLI support: The kreuzberg-cli binary supports --ocr-backend paddle-ocr for PaddleOCR extraction.

Unified OCR Element Output

  • Structured OCR element data: Extraction results now include OcrElement data with bounding geometry (rectangles and quadrilaterals), per-element confidence scores, rotation information, and hierarchical levels (word, line, block, page). Available from both PaddleOCR and Tesseract backends.

Shared ONNX Runtime Discovery

  • ort_discovery module: Finds ONNX Runtime shared libraries across platforms, shared between PaddleOCR and future ONNX-based backends.

Document Structure Output

  • DocumentStructure support across all bindings: Added structured document output with include_document_structure configuration option across Python, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, Elixir, and WASM bindings.

Native DOC/PPT Extraction

  • OLE/CFB-based extraction: Added native DOC and PPT extraction via OLE/CFB binary parsing. Legacy Office formats no longer require any external tools.

musl Linux Support

  • Re-enabled musl targets: Added x86_64-unknown-linux-musl and aarch64-unknown-linux-musl targets for CLI binaries, Python wheels (musllinux), and Node.js native bindings. Resolves glibc 2.38+ requirement for prebuilt CLI binaries on older distros like Ubuntu 22.04 (#364).

Fixed

MSG Extraction Hang on Large Attachments (#372)

  • Fixed .msg (Outlook) extraction hanging indefinitely on files with large attachments. Replaced the msg_parser crate with direct OLE/CFB parsing using the cfb crate — attachment binary data is now read directly without hex-encoding overhead.
  • Added lenient FAT padding for MSG files with truncated sector tables produced by some Outlook versions.

Rotated PDF Text Extraction

  • Fixed text extraction returning empty content for PDFs with 90° or 270° page rotation. Kreuzberg now strips /Rotate entries from page dictionaries before loading, restoring correct text extraction for all rotation angles.

CSV and Excel Extraction Quality

  • Fixed CSV extraction producing near-zero quality scores (0.024) by outputting proper delimited text instead of debug format.
  • Fixed Excel extraction producing low quality scores (0.22) by outputting clean tab/newline-delimited cell text.

XML Extraction Quality

  • Improved XML text extraction to better handle namespaced elements, CDATA sections, and mixed content, improving quality scores.

WASM Table Extraction

  • Fixed WASM adapter not recognizing page_number field (snake_case) from Rust FFI, causing table data to be silently dropped in Deno and Cloudflare Workers tests.

DOCX Formatting Output (#376)

  • Fixed DOCX extraction producing plain text instead of formatted markdown. Bold, italic, underline, strikethrough, and hyperlinks are now rendered with proper markdown markers (**bold**, *italic*, ~~strikethrough~~, [text](url)).
  • Fixed heading hierarchy: Title style maps to #, Heading1 to ##, through Heading5+ clamped at ######.
  • Fixed bullet lists (-), numbered lists (1.), and nested list indentation (2-space per level).
  • Fixed tables missing from markdown output. Tables are now interleaved with paragraphs in document order and rendered as markdown pipe tables.
  • Fixed table cell formatting being stripped — bold/italic inside table cells is now preserved.
  • Added 16 integration tests covering formatting, headings, lists, tables, and document structure.

Typst Table Content Extraction

  • Fixed Typst extract_table_content double-counting opening parenthesis, which caused the table parser to consume all remaining document content after a #table() call.

PaddleOCR Recognition Model

  • Fixed PaddleOCR recognition model (en_PP-OCRv4_rec_infer.onnx) failing to load with ShapeInferenceError on ONNX Runtime 1.23.x.
  • Fixed incorrect detection model filename in Docker and CI action (en_PP-OCRv4_det_infer.onnxch_PP-OCRv4_det_infer.onnx).

Python Bindings

  • Fixed OcrConfig constructor silently ignoring paddle_ocr_config and element_config keyword arguments.
  • Fixed keyword extraction results (and all metadata.additional entries from post-processors) being silently dropped in Python bindings. The ExtractionResult.from_rust() method now propagates flattened additional metadata fields, matching all other bindings. Closes #379.

TypeScript/Node.js Bindings

  • Fixed PaddleOCR config (paddle_ocr_config) and element config (element_config) being silently dropped by the NAPI-RS binding layer.
  • Fixed ocr_elements missing from extraction result conversion in TypeScript wrapper.

Ruby Bindings

  • Fixed kreuzberg-pdfium-render vendored crate not included in gemspec, causing gem build failures.
  • Fixed PaddleOCR config and element config not being parsed in Ruby binding config layer.
  • Fixed ocr_elements missing from Ruby extraction result conversion.

Go Bindings

  • Fixed PdfMetadata deserialization failing when keyword extraction produces object arrays instead of simple strings. Added lenient UnmarshalJSON fallback with field-by-field recovery.

C# Bindings

  • Fixed keyword extraction data inaccessible in C# — ExtractedKeywords was marked [JsonIgnore] and excluded from metadata serialization. Added lenient metadata extraction fallback for mixed-type keyword fields.

PHP Bindings

  • Fixed document, elements, and ocrElements properties inaccessible on ExtractionResult — these fields were not exposed through the __get handler.
  • Fixed ExtractionConfig::toArray() not serializing include_document_structure, causing document structure extraction to be silently ignored.
  • Fixed wrapper function names for document extractor management (kreuzberg_*_document_extractorskreuzberg_*_extractors).
  • Added missing OCR backend management functions (kreuzberg_list_ocr_backends, kreuzberg_clear_ocr_backends, kreuzberg_unregister_ocr_backend).
  • Fixed page_count metadata key mismatch between serialization (pageCount) and deserialization (page_count).

Elixir Bindings

  • Fixed NIF config parser not forwarding include_document_structure, result_format, output_format, html_options, max_concurrent_extractions, and security_limits options.
  • Added missing document extractor management NIFs (list_document_extractors, unregister_document_extractor, clear_document_extractors).

CI

  • Fixed PHP E2E tests not actually running in CI — the task was configured to run package unit tests instead of E2E tests.

Changed

Build System

  • Bumped ONNX Runtime from 1.23.2 to 1.24.1 across CI, Docker images, and documentation.
  • Bumped vendored Tesseract from 5.5.1 to 5.5.2.
  • Bumped vendored Leptonica from 1.86.0 to 1.87.0.

Removed

LibreOffice Dependency

  • LibreOffice is no longer required: Legacy .doc and .ppt files are now extracted natively via OLE/CFB parsing. LibreOffice has been removed from Docker images, CI pipelines, and system dependency requirements, reducing the full Docker image size by ~500-800MB. Users on Kreuzberg <4.3 still need LibreOffice for these formats.

msg_parser Dependency

  • Replaced msg_parser crate with direct CFB parsing for MSG extraction. Eliminates hex-encoding overhead and reduces dependency count.

Guten OCR Backend

  • Removed all references to the unused Guten OCR backend from Node.js and PHP bindings. Renamed KREUZBERG_DEBUG_GUTEN env var to KREUZBERG_DEBUG_OCR.

4.2.15 - 2026-02-08

Added

Agent Skill for AI Coding Assistants

  • Agent Skill for document extraction: Added skills/kreuzberg/SKILL.md following the Agent Skills open standard, with comprehensive instructions for Python, Node.js, Rust, and CLI usage. Includes 8 detailed reference files covering API signatures, configuration, supported formats, plugins, and all language bindings. Works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any compatible tool.

MIME Type Mappings

  • Added .docbook (application/docbook+xml) and .jats (application/x-jats+xml) file extension mappings.

Fixed

ODT List and Section Extraction

  • Fixed ODT extractor not handling text:list and text:section elements. Documents containing bulleted/numbered lists or sections returned empty content.

UTF-16 EML Parsing

  • Fixed EML files encoded in UTF-16 (LE/BE, with or without BOM) returning empty content. Detects UTF-16 encoding via BOM markers and heuristic byte-pattern analysis, transcoding to UTF-8 before parsing.

Email Attachment Metadata Serialization

  • Fixed email extraction inserting a comma-joined string "attachments" into the additional metadata HashMap, which via #[serde(flatten)] overwrote the structured EmailMetadata.attachments array. This caused deserialization failures in Go, C#, and other typed bindings when processing emails with attachments.

WASM Office Document Support (DOCX, PPTX, ODT)

  • DOCX, PPTX, and ODT extractors were gated on #[cfg(all(feature = "tokio-runtime", feature = "office"))] but wasm-target does not enable tokio-runtime. Changed cfg gates to #[cfg(feature = "office")] with conditional spawn_blocking only when tokio-runtime is available. Office documents now extract correctly in WASM builds.

WASM PDF Support in Non-Browser Runtimes

  • PDFium initialization was guarded by isBrowser(), preventing PDF extraction in Node.js, Bun, and Deno. Removed the browser-only restriction so PDFium auto-initializes in all WASM runtimes.

Elixir PageBoundary JSON Serialization

  • Added missing @derive Jason.Encoder to PageBoundary, PageInfo, and PageStructure structs in the Elixir bindings. Without this, encoding page structure metadata to JSON would fail with a protocol error.

Pre-built CLI Binary Missing MCP Command

  • Pre-built standalone CLI binaries were built without the mcp feature flag, causing the kreuzberg mcp command to be unavailable. The build script now enables all features (--features all) to match the Python, Node, and Homebrew builds. Fixes #369.

PDF Error Handling Regression

  • Reverted incorrect change from v4.2.14 that silently returned empty results for corrupted/malformed PDFs instead of propagating errors. Corrupted PDFs now correctly return PdfError::InvalidPdf and password-protected PDFs return PdfError::PasswordRequired as expected.

Changed

API Parity

  • Added security_limits field to all 9 language bindings (TypeScript, Go, Python, Ruby, PHP, Java, C#, WASM, Elixir) for API parity with Rust core ExtractionConfig.

4.2.14 - 2026-02-07

Fixed

Excel File-Path Extraction

  • Fixed .xla (legacy add-in) and .xlsb (binary spreadsheet) graceful fallback only applied to byte-based extraction; file-path-based extraction still propagated parse errors.

PDF Test Flakiness

  • Fixed flaky PDF tests caused by concurrent pdfium access during parallel test execution. Added #[serial] to all pdfium-using tests to prevent global state conflicts.

Benchmark Fixtures

  • Replaced auto-generated fixture discovery (generate.rs) with curated, validated fixture set.
  • Added comprehensive fixture validation test suite (8 tests: JSON parsing, document existence, file sizes, ground truth, duplicate detection, format coverage).
  • Removed 5 duplicate fixture entries pointing to the same test documents.
  • Swapped encrypted EPUB fixture (epub2_no_cover.epub with IDPF font encryption) for clean features.epub.
  • Fixed 272 stale file_size declarations in fixture JSON files to match actual files on disk.
  • Fixed validate_ground_truth.py only checking root-level fixtures; now uses rglob for recursive validation.

Removed

  • Removed generate.rs auto-generation system from benchmark harness (caused recurring breakage from malformed vendored files).

4.2.13 - 2026-02-07

Added

WASM Office Format Support

  • Added office document extraction to the WASM target: DOCX, PPTX, RTF, reStructuredText, Org-mode, FictionBook, Typst, BibTeX, and Markdown are now available in the browser/WASM build.
  • Added WASM integration tests for all new office formats (office_extraction.rs).
  • Added e2e fixture definitions for RTF, RST, Org, FB2, Typst, BibTeX, and Markdown formats.
  • Regenerated e2e test suites across all language bindings to include new office format fixtures.

Citation Extraction

  • Added structured citation extraction for RIS (.ris), PubMed/MEDLINE (.nbib), and EndNote XML (.enw) formats via biblib crate with rich metadata including authors, DOI, year, keywords, and abstract.
  • Added CitationExtractor with priority 60 for application/x-research-info-systems, application/x-pubmed, and application/x-endnote+xml MIME types.

JPEG 2000 OCR Support

  • Added full JPEG 2000 image decoding for OCR via hayro-jpeg2000 (pure Rust, memory-safe decoder). JP2 container and J2K codestream images are now decoded to RGB pixels for Tesseract OCR processing.
  • Added pure Rust JP2 metadata parsing (dimensions, format detection) without external dependencies.

JBIG2 Image Support

  • Added JBIG2 bi-level image decoding for OCR via hayro-jbig2 (pure Rust, memory-safe decoder). JBIG2 is commonly used in scanned PDF documents.
  • Added image/x-jbig2 MIME type with .jbig2 and .jb2 file extension mappings.

Gzip Archive Extraction

  • Added GzipExtractor for extracting text content from gzip-compressed files (.gz) via flate2, with decompression size limits to prevent gzip bomb attacks.

Extractor Registration

  • Registered JatsExtractor and DocbookExtractor in the default extractor registry (extractors existed but were never registered).

MIME Type & Extension Mappings

  • Added missing MIME types to SUPPORTED_MIME_TYPES: text/x-fictionbook, application/x-fictionbook, text/x-bibtex, text/docbook, application/x-pubmed.
  • Added MIME type aliases for broader compatibility: text/djot, text/jats, application/x-epub+zip, application/vnd.epub+zip, text/rtf, text/prs.fallenstein.rst, text/x-tex, text/org, application/x-org, application/xhtml+xml, text/x-typst, image/jpg.
  • Added missing file extension mappings: .fb2, .opml, .dbk, .j2k, .j2c, .ris, .nbib, .enw, .typ, .djot.

Security

  • Wired SecurityLimits into the archive extraction pipeline: ZIP, TAR, 7z, and GZIP extractors now enforce configurable limits for max archive size, file count, compression ratio, and content size.
  • Added security_limits field to ExtractionConfig for user-configurable archive security thresholds.
  • ZIP archives are now validated with ZipBombValidator before extraction.
  • Replaced hardcoded 256 MB gzip decompression limit with configurable max_archive_size (default 500 MB).

Fixed

WASM Build

  • Fixed zstd-sys build failure for wasm32-unknown-unknown by disabling default features on the zip crate and using deflate-flate2 (pure Rust) instead of zstd (C code incompatible with WASM).
  • Fixed tokio/mio compilation failure on WASM by removing tokio-runtime from the office feature (only needed for LibreOffice subprocess conversion, not in-memory parsers).
  • Gated LibreOffice conversion paths (libreoffice.rs, legacy DOC/PPT handlers) behind not(target_arch = "wasm32") to prevent WASM builds from pulling in tokio filesystem and process APIs.

MIME Type Detection

  • Fixed .typ files not recognized as Typst format; added .typ as an alias for application/x-typst.
  • Fixed .djot files not recognized; added .djot extension mapping to text/x-djot.
  • Fixed application/gzip rejected by MIME validation; added to SUPPORTED_MIME_TYPES.
  • Fixed case-sensitive MIME type validation rejecting valid types with different casing (e.g., macroEnabled vs macroenabled); added RFC 2045 case-insensitive fallback.
  • Synced SUPPORTED_MIME_TYPES with extractor registry to prevent valid formats being rejected before reaching their extractor.

Image Extraction

  • Fixed JPEG 2000 images (.jp2) not handled by ImageExtractor; added image/jp2, image/jpx, image/jpm, and image/mj2 to supported types.

Extraction

  • Fixed YAML files rejected with "Unsupported format: application/yaml"; now accepts all four YAML MIME type variants including the standard application/yaml (RFC 9512).

CLI

  • Fixed .yml config files rejected by --config flag; now accepts both .yml and .yaml.

.tgz Archive Extraction

  • Fixed .tgz files parsed as raw TAR instead of gzip-compressed TAR. The MIME mapping now correctly routes .tgz to the GzipExtractor, which detects inner TAR archives via ustar magic bytes and delegates to TAR extraction.

Excel Exotic Formats

  • Fixed .xlam (Excel add-in), .xla (legacy add-in), and .xlsb (binary spreadsheet) files causing extraction errors when they lack standard workbook data. These formats now gracefully return an empty workbook instead of propagating parse errors.

PDF Error Handling

  • Fixed password-protected and malformed PDFs causing extraction errors. The PDF extractor now gracefully returns an empty ExtractionResult instead of propagating PdfError::PasswordRequired and PdfError::InvalidPdf.

Benchmark Harness

  • Fixed framework initialization check running before external adapters (Tika, pdfplumber, etc.) were registered, causing false "failed to initialize" errors.
  • Fixed missing composer install step in PHP benchmark CI job.
  • Fixed C# benchmark wrapper using wrong MIME type casing for macro-enabled Office formats and incorrect djot MIME type.
  • Fixed WASM benchmark wrapper missing MIME mappings for several supported formats.
  • Added error counts (framework_errors, harness_errors) and error detail breakdown to benchmark aggregation output.
  • Added distinct ErrorKind::Timeout tracking in benchmark results, propagated through aggregation and per-extension stats.
  • Removed 12 malformed/password-protected/broken fixture files from benchmark corpus.

4.2.12 - 2026-02-06

Fixed

DOCX Extraction

  • Fixed DOCX list items missing whitespace between text runs, causing words to merge together. (#359)

4.2.11 - 2026-02-06

Fixed

Python Bindings

  • Fixed CLI binary missing from all platform wheels in the publish workflow. (#349)

Fixed

OCR Heuristic

  • Pass actual page count to OCR fallback evaluator: evaluate_native_text_for_ocr was called with None for page count, defaulting to 1. This inflated per-page averages for multi-page documents, causing scanned PDFs to skip OCR.
  • Per-page OCR evaluation for mixed-content PDFs: Added evaluate_per_page_ocr which evaluates each page independently using page boundaries. If any single page triggers OCR fallback, the entire document is OCR'd. Previously, good pages masked scanned pages in the aggregate evaluation.

4.2.10 - 2026-02-05

Fixed

MIME Type Detection

  • Fixed DOCX/XLSX/PPTX files incorrectly detected as application/zip when using bytes-based MIME detection. (#350)

Java Bindings

  • Fixed format-specific metadata (e.g., sheet_count, sheet_names) missing from getMetadataMap().
  • Fixed ClassCastException when deserializing nested generic collections in model classes. (#355)

Python Bindings

  • Fixed Windows CLI binary still missing from wheel due to wrong filename in CI copy step. (#349)

4.2.9 - 2026-02-03

Fixed

MCP Server

  • Fixed "Cannot start a runtime from within a runtime" panic when using MCP server in Docker.
  • Removed unused async parameter from MCP tools.

Python Bindings

  • Fixed "embedded binary not found" error on Windows due to missing .exe extension handling. (#349)

OCR Heuristic

  • Fixed OCR fallback evaluator receiving None for page count, causing scanned PDFs to incorrectly skip OCR.
  • Added per-page OCR evaluation so that mixed-content PDFs with some scanned pages are properly OCR'd.

4.2.8 - 2026-02-02

Fixed

Python Bindings

  • Fixed ChunkingConfig serialization outputting wrong field names (max_characters/overlap instead of max_chars/max_overlap).

Java Bindings

  • Fixed ARM64 SIGBUS crash in kreuzberg_get_error_details by returning a heap-allocated pointer instead of struct-by-value.

Ruby Bindings

  • Fixed rb_sys missing as runtime dependency, causing LoadError during native extension compilation.

FFI

  • Added kreuzberg_free_error_details() to properly free heap-allocated CErrorDetails structs.

4.2.7 - 2026-02-01

Added

API

  • Added OpenAPI schema for /extract endpoint with full type documentation.
  • Added unified ChunkingConfig with canonical field names and serde aliases for backwards compatibility.

OCR

  • Added KREUZBERG_OCR_LANGUAGE="all" support to auto-detect and use all installed Tesseract languages. (#344)

Fixed

Ruby Bindings

  • Fixed Cow<'static, str> type conversions in Magnus bindings.
  • Fixed missing bytes workspace dependency in vendor Cargo.toml.

Python Bindings

  • Fixed runtime ExtractedImage import; defined as Python-level runtime types instead of importing from compiled Rust bindings.

C# Bindings

  • Fixed Attributes deserialization on ARM64 to handle both array-of-arrays and object JSON formats.

Java Bindings

  • Fixed test timeouts causing CI hangs by adding @Timeout(60) to concurrency and async tests.

Elixir Bindings

  • Overhauled all struct types to match Rust source: fixed Metadata, Table, Image, Chunk, Page, ExtractionResult field names and types.
  • Added new struct modules matching Rust types: ChunkMetadata, Keyword, PageHierarchy, DjotContent, PageStructure, ErrorMetadata, ImagePreprocessingMetadata, and more.

TypeScript Bindings

  • Overhauled type definitions to match NAPI-RS Rust source; fixed ChunkingConfig, ExtractionResult, ExtractionConfig, and FormattedBlock fields.

PHP Bindings

  • Overhauled type definitions to match Rust source; fixed Keyword, Metadata, ExtractionResult, and FormattedBlock fields.

Ruby Bindings

  • Overhauled RBS type stubs to match Ruby source and Rust Magnus bindings.

Python Bindings

  • Overhauled _internal_bindings.pyi type stubs to match Rust source; fixed Chunk, PptxMetadata, PdfMetadata, HtmlMetadata, and optionality on multiple fields.
  • Removed duplicate types.py containing 43 conflicting type definitions.

Java Bindings

  • Overhauled type definitions to match Rust source; fixed Metadata, PptxMetadata, PageInfo, ImageMetadata, LinkMetadata, and added missing enums and types.

C# Bindings

  • Overhauled type definitions to match Rust source; fixed Metadata, PptxMetadata, PageBoundary, ImageMetadata, and added missing types.
  • Fixed keyword deserialization to discriminate between simple string keywords and extracted keyword objects.

Go Bindings

  • Overhauled type definitions to match Rust source; fixed Metadata, PptxMetadata, ImageMetadata, PageBoundary, PageInfo, and added missing enums and types.

Changed

  • Bumped html-to-markdown-rs from 2.24.1 to 2.24.3.

Performance

  • Converted static string fields to Cow<'static, str> to eliminate heap allocations for string literals.
  • Reduced allocations in RST parser, fictionbook extractor, and email extractor.
  • Replaced HashMap with Vec for small metadata maps and AHashMap for hot-path maps.
  • Switched Metadata.additional keys to Cow<'static, str> for interning.
  • Replaced Vec<u8> with bytes::Bytes for ExtractedImage.data, enabling zero-copy cloning.

4.2.6 - 2026-01-31

Fixed

Python Bindings

  • Fixed missing output_format/result_format fields on ExtractionResult.
  • Fixed missing elements and djot_content fields on ExtractionResult.
  • Fixed chunks returned as dicts instead of objects; created proper PyChunk class with attribute access.

4.2.5 - 2026-01-30

Fixed

Python Bindings

  • Fixed missing OutputFormat/ResultFormat exports causing ImportError.
  • Fixed .pyi stub alignment for ExtractionResult, Element, and related types.
  • Fixed Python 3.10 compatibility for StrEnum (native StrEnum is 3.11+).

PHP Bindings

  • Fixed config alignment with Rust core for ImageExtractionConfig, PdfConfig, ImagePreprocessingConfig, and ExtractionConfig.
  • Removed phantom parameters not present in Rust core.

TypeScript/Node Bindings

  • Fixed missing elements field; added JsElement, JsElementMetadata, JsBoundingBox to NAPI-RS bindings.

C# Bindings

  • Fixed enum serialization using JsonStringEnumMemberName for .NET 9+.

Elixir Bindings

  • Fixed test failures and cleaned up warnings on Windows.

Node Bindings

  • Added Bun runtime support.

Changed

All Bindings

  • Achieved PageContent field parity across all language bindings.

4.2.4 - 2026-01-29

Fixed

TypeScript/Node Bindings

  • Fixed missing elements field; added Element, ElementType, BoundingBox, and ElementMetadata types.

Rust Core

  • Fixed KeywordConfig deserialization failing on partial configs by adding #[serde(default)].

C# Bindings

  • Fixed Element serialization for element_based result format deserialization.

Elixir Bindings

  • Derived Jason.Encoder for ExtractionConfig struct.

4.2.3 - 2026-01-28

Fixed

API

  • Fixed JSON array rejection; /embed, /chunk, and other endpoints now properly reject arrays in request bodies with 400 status.

CLI

  • Fixed --format json to serialize the complete ExtractionResult including chunks, embeddings, images, pages, and elements.

MCP

  • Fixed MCP tool responses to return full JSON-serialized ExtractionResult, matching API and CLI output.

Elixir Bindings

  • Added ExtractionConfig.new/0 and new/1 constructors.
  • Changed text field to content on Chunk for API parity with Rust core.

C# Bindings

  • Fixed file-not-found errors to throw KreuzbergIOException instead of KreuzbergValidationException.

WASM / Cloudflare Workers

  • Fixed initWasm() failing in Cloudflare Workers and Vercel Edge with "Invalid URL string" error; added initWasm({ wasmModule }) option for explicit WASM module injection.

Go Bindings

  • Removed references to deprecated WithEmbedding() API and Chunking.Embedding field.

Java Bindings

  • Removed non-canonical embedding and imagePreprocessing top-level fields from ExtractionConfig.

MCP

  • Fixed boolean merge logic bug causing configuration corruption when using config parameter.

4.2.2 - 2026-01-28

Changed

PHP Bindings

  • Removed 5 non-canonical fields from ExtractionConfig and fixed defaults; all 16 fields now match Rust canonical source.

Go Bindings

  • Removed non-canonical Success, Visible, and ContentType fields from result types.

Ruby Bindings

  • Fixed enable_quality_processing default from false to true to match Rust.

Java Bindings

  • Fixed enableQualityProcessing default from false to true to match Rust.

TypeScript Bindings

  • Removed non-existent type exports (EmbeddingConfig, HierarchyConfig, etc.) from index.ts.

Fixed

Elixir Bindings

  • Fixed force_build: true causing production installs to fail; now only builds from source in development. (#333)

Docker Images

  • Fixed "OCR backend 'tesseract' not registered" error by adding dynamic tessdata discovery for multiple tesseract versions.
  • Fixed "Failed to initialize embedding model" error by adding persistent Hugging Face model cache directory.

API

  • Fixed JSON error responses to return proper JSON ErrorResponse instead of plain text.
  • Added validation constraints for chunking config and embed texts array.
  • Added validation that overlap must be less than max_characters.
  • EmbeddingConfig.model now defaults to "balanced" preset when not specified.

Rust Core

  • Fixed XLSX out-of-memory with Excel Solver files that declare extreme cell dimensions. (#331)

4.2.1 - 2026-01-27

Fixed

Rust Core

  • Fixed PPTX image page numbers being reversed due to unsorted slide paths. (#329)
  • Added comprehensive error logging for silent plugin failures. (#328)
  • Extended VALID_OUTPUT_FORMATS to include all valid aliases (plain, text, markdown, md, djot, html).
  • Fixed validate_file_exists() to return Io error instead of Validation error for file-not-found.

Go Bindings

  • Added OutputFormatText and OutputFormatMd format constant aliases.

Elixir Bindings

  • Added text and md aliases to validate_output_format.

Ruby Bindings

  • Fixed extract and detect methods to accept both positional and keyword arguments.
  • Renamed image_extraction to images (canonical name) with backward-compatible alias.

PHP Bindings

  • Renamed fields to canonical names (images, pages, pdfOptions, postprocessor, tokenReduction).
  • Added missing postprocessor and tokenReduction fields.

Java Bindings

  • Added getImages() and images() builder methods as aliases for getImageExtraction().

WASM Bindings

  • Added outputFormat, resultFormat, and htmlOptions to ExtractionConfig interface.

Documentation

  • Added Kubernetes deployment guide with health check configuration and troubleshooting. (#328)

4.2.0 - 2026-01-26

Added

MCP Interface

  • Full config parameter support on all MCP tools, enabling complete configuration pass-through from AI agents.

CLI

  • Added --output-format flag (canonical replacement for --content-format).
  • Added --result-format flag for controlling result structure (unified, element_based).
  • Added --config-json flag for inline JSON configuration.
  • Added --config-json-base64 flag for base64-encoded JSON configuration.

API - All Language Bindings

  • Added outputFormat / output_format field (Plain, Markdown, Djot, HTML) to all bindings.
  • Added resultFormat / result_format field (Unified, ElementBased) to all bindings.

Go Bindings

  • Added OutputFormat and ResultFormat types with WithOutputFormat() and WithResultFormat() functional options.

Java Bindings

  • Added outputFormat and resultFormat to Builder pattern.

PHP Bindings

  • Added 6 missing configuration fields: useCache, enableQualityProcessing, forceOcr, maxConcurrentExtractions, resultFormat, outputFormat.

Changed

Configuration Precedence

  • CLI flag > inline JSON > config file > defaults.

MCP Schema Evolution

  • enable_ocr and force_ocr now under config object instead of top-level parameters.

Fixed

Ruby Bindings

  • Fixed batch chunking operations.

MCP

  • Fixed boolean merge logic bug in nested config objects.

BREAKING CHANGES

MCP Interface Only (AI-only, no user impact)

  • Removed enable_ocr and force_ocr top-level parameters from MCP tools; use config.ocr.enable_ocr and config.force_ocr instead.
  • MCP tools now require config object parameter; old names accepted in v4.2 with deprecation warnings.

Deprecated

CLI (backward compatible)

  • --content-format flag deprecated in favor of --output-format.

Environment Variables (backward compatible)

  • KREUZBERG_CONTENT_FORMAT deprecated in favor of KREUZBERG_OUTPUT_FORMAT.

4.1.2 - 2026-01-25

Added

Ruby Bindings

  • Added Ruby 4.0 support (tested with Ruby 4.0.1).

Fixed

Ruby Bindings

  • Fixed gem native extension build failure due to incorrect Cargo.toml path rewriting.

Go Bindings

  • Fixed Windows timeout caused by FFI mutex deadlock; now uses lazy initialization via sync.Once.

4.1.1 - 2026-01-23

Fixed

PPTX/PPSX Extraction

  • Fixed PPTX extraction failing on shapes without text (e.g., image placeholders). (#321)
  • Added PPSX (PowerPoint Show) file support.
  • Added PPTM (PowerPoint Macro-Enabled) file support.

4.1.0 - 2026-01-21

Added

API

  • Added POST /chunk endpoint for text chunking with configurable max_characters, overlap, and trim.

Core

  • Added Djot markup format support (.djot) with full parser, structured representation, and YAML frontmatter extraction.
  • Added content output format configuration (ContentFormat enum: Plain, Markdown, Djot, HTML) with CLI --content-format flag.
  • Added Djot output format support for HTML and OCR conversions.
  • Added element-based output format (ResultFormat::ElementBased) providing Unstructured.io-compatible semantic element extraction.

Language Bindings

  • All bindings (Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, WASM) updated with content format and result format configuration, plus Element, ElementType, ElementMetadata, BoundingBox, and DjotContent types.

Changed

  • Split 22 large monolithic files into 110+ focused modules for improved maintainability; no breaking changes to public APIs.

Fixed

Python

  • Fixed missing type exports (Element, ElementMetadata, ElementType, BoundingBox, HtmlImageMetadata) in kreuzberg.types.__all__.

Elixir

  • Fixed FunctionClauseError when extracting DOCX files with keywords metadata. (#309)

4.0.8 - 2026-01-17

Changed

Docker

  • Migrated from Docker Hub to GitHub Container Registry (ghcr.io/kreuzberg-dev/kreuzberg).

Fixed

C

  • Fixed HtmlConversionOptions serializing as null instead of {} when empty, causing Rust FFI errors.

Python

  • Fixed missing _internal_bindings.pyi type stub file in Python wheels. (#298)

Homebrew

  • Fixed bottle checksum mismatches by computing checksums from actual uploaded release files.

4.0.6 - 2026-01-14

Fixed

Elixir

  • Fixed checksum file generation for precompiled NIFs during Hex.pm publishing.

PHP

  • Fixed runtime panic from unregistered ChunkMetadata and Keyword classes in ext-php-rs.

4.0.5 - 2026-01-14

Added

Go Module

  • Added automated FFI library installer that downloads the correct platform-specific library from GitHub releases. (#281)

Fixed

Elixir

  • Fixed precompiled NIF checksums missing from Hex package.

4.0.4 - 2026-01-13

Fixed

Docker

  • Fixed MissingDependencyError when extracting legacy MS Office formats in Docker; added LibreOffice symlinks and missing runtime dependencies. (#288)

4.0.3 - 2026-01-12

Added

HTML Configuration Support

  • Full html_options configuration now available from config files and all language bindings. (#282)

Fixed

Go Module

  • Fixed header include path so go get users no longer get compilation errors about missing headers. (#280)

C# SDK

  • Fixed JsonException when using keyword extraction; keywords now properly deserialized as ExtractedKeyword objects. (#285)

Distribution

  • Made Homebrew tap repository public to enable brew install kreuzberg-dev/tap/kreuzberg. (#283)

4.0.2 - 2026-01-12

Fixed

Go Module

  • Fixed Go module tag format so go get works correctly. (#264)

Elixir

  • Fixed macOS native library extension (.dylib instead of .so).

4.0.1 - 2026-01-11

Fixed

Elixir

  • Fixed NIF binaries not uploaded to GitHub releases, breaking rustler_precompiled. (#279)

Python

  • Fixed kreuzberg-tesseract missing from PyPI source distributions, causing builds from source to fail. (#277)

Homebrew

  • Fixed bottle publishing workflow to publish releases from draft state.

Ruby

  • Updated RBS type definitions to match keyword argument signatures.

WASM

  • Fixed Svelte 5 variable naming and removed call to non-existent detectMimeType() API.

4.0.0 - 2026-01-10

Highlights

First stable release of Kreuzberg v4, a complete rewrite with a Rust core and polyglot bindings for Python, TypeScript, Ruby, PHP, Java, Go, C#, Elixir, and WebAssembly.

Added

FFI & Language Bindings

  • Python FFI error handling via get_last_error_code() and get_last_panic_context().
  • PHP custom extractor support with metadata and tables flowing through to results.
  • Dynamic Tesseract language discovery from installation.

Removed

Legacy Support

  • Completely removed v3 legacy Python package and infrastructure.

4.0.0-rc.29 - 2026-01-08

Added

Documentation

  • Added comprehensive platform support documentation to all READMEs.

4.0.0-rc.28 - 2026-01-07

Added

API Server

  • Added POST /embed endpoint for generating embeddings from text. (#266)
  • Added ServerConfig type for file-based server configuration (TOML/YAML/JSON) with environment variable overrides.

Observability

  • Added OpenTelemetry tracing instrumentation to all API endpoints.

Fixed

API Server & CLI

  • Fixed CLI to properly use ServerConfig from config files (CORS origins, upload size limits).

Configuration Examples

  • Fixed YAKE/RAKE parameter examples to match actual source code.
  • Changed default host from 0.0.0.0 to 127.0.0.1 for safer defaults.

PHP

  • Fixed extract_tables config flag to properly filter table results.

4.0.0-rc.27 - 2026-01-04

Fixed

  • Fixed WASM npm package initialization failure caused by incorrect import paths in minified output.

4.0.0-rc.26 - 2026-01-03

Fixed

  • Fixed Node.js macOS ARM64 builds missing from publish workflow.
  • Fixed WASM npm package missing WASM binaries.
  • Fixed Elixir hex.pm publishing with correct public configuration.
  • Fixed Homebrew bottle upload pattern.

4.0.0-rc.25 - 2026-01-03

Fixed

  • Added comprehensive chunking config validation to Go binding (negative values, excessive sizes, overlap constraints).
  • Fixed Java FFI to use Arena.global() for thread-safe C string reads.

Changed

  • Updated C# target framework to .NET 10.0.

4.0.0-rc.24 - 2026-01-01

Fixed

  • Fixed Go Windows CGO directives to bypass pkg-config.
  • Fixed Ruby Windows build with proper platform handling and embeddings feature.
  • Fixed Node Windows tests with proper symlink resolution.
  • Fixed Homebrew formula bottle naming and source sha256 fetching.

4.0.0-rc.23 - 2026-01-01

Added

Java

  • Added EmbeddingConfig class with builder pattern for embedding generation.

C

  • Added EmbeddingConfig sealed class as type-safe replacement for Dictionary-based configuration.

Node.js (NAPI-RS)

  • Added Worker Thread Pool APIs: createWorkerPool, extractFileInWorker, batchExtractFilesInWorker, closeWorkerPool.

Fixed

  • Fixed page markers to include page 1 (previously only inserted for page > 1).
  • Fixed Go concurrency crashes (segfaults/SIGTRAP) by adding mutex for thread-safe FFI calls.

4.0.0-rc.22 - 2025-12-27

Added

  • PHP bindings with comprehensive FFI bindings and E2E test suite.
  • Root composer.json for Packagist publishing.
  • HTML metadata extraction: headers, links, images, structured data (JSON-LD, Microdata, RDFa), language, text_direction, meta_tags.

Fixed

  • Fixed C# target framework from net10.0 (preview) to net8.0 LTS.
  • Fixed Ruby vendor script missing workspace dependency inlining for lzma-rust2 and parking_lot.

Changed

  • BREAKING: HTML metadata structure - Replaced YAML frontmatter parsing with single-pass metadata extraction. See docs/migration/v4.0-html-metadata.md.

4.0.0-rc.21 - 2025-12-26

Fixed

  • Fixed PDF initialization race conditions causing segfaults and concurrency errors across all language bindings.
  • Fixed EPUB metadata extraction with incorrect field mapping (created_at mapped to subject).

Added

  • CLI test app for validating kreuzberg-cli published to crates.io.

4.0.0-rc.20 - 2025-12-25

Added

  • Font configuration API with configurable font provider, custom directory support, and automatic path expansion.

4.0.0-rc.19 - 2025-12-24

Added

  • Homebrew bottle support for faster macOS installation.
  • Environment variable configuration for API size limits (KREUZBERG_MAX_REQUEST_BODY_BYTES, KREUZBERG_MAX_MULTIPART_FIELD_BYTES).
  • Config file caching for TOML/YAML/JSON loading.

Fixed

  • Fixed large file uploads rejected above 2MB; now configurable up to 100MB. (#248)
  • Fixed browser package Vite compatibility with missing pdfium.js. (#249)
  • Fixed Node.js missing binaries in Docker and pnpm monorepo environments. (#241)
  • Fixed Ruby gem native extension build with proper linker path resolution.
  • Fixed font provider thread safety race condition.
  • Added custom font path validation with symlink resolution and canonicalization.

Changed

  • BREAKING: Custom font provider now enabled by default.
  • Default API size limit increased to 100MB.
  • TypeScript serialization replaced MessagePack + Base64 with direct JSON.

Performance

  • 15-25% overall execution improvement, 30-45% memory reduction.
  • Memory pool improvements (35-50% reduction).

Removed

  • Removed deprecated backward compatibility: TypeScript KREUZBERG_LEGACY_SERIALIZATION, Go legacy error codes, Ruby Ocr = OCR alias, Rust Metadata.date field, Cargo legacy feature aliases.

Security

  • Custom font directories validated with canonicalization and symlink resolution.

4.0.0-rc.18 - 2025-12-23

Fixed

  • Fixed Ruby gem missing kreuzberg-ffi crate in vendored dependencies.
  • Fixed Ruby gem macOS linker errors.
  • Fixed Python wheels macOS ImportError from hardcoded dylib paths.

4.0.0-rc.17 - 2025-12-22

Added

  • Docker ARM64 support with multi-architecture images.

Fixed

  • Fixed Python wheels macOS ImportError.
  • Fixed Ruby gems macOS linker errors.
  • Fixed TypeScript plugin registration TypeError with JavaScript-style plugins.

Performance

  • Improved Go ConfigMerge performance with native field copying.

4.0.0-rc.16 - 2025-12-21

Added

  • Batch processing APIs with 4-6x throughput improvement for high-volume extraction.

Fixed

  • Fixed Python IDE support; type stub files now included in wheel distributions.
  • Fixed Go Windows linking with duplicate linker flags.
  • Fixed Ruby gem compilation with missing link search paths.
  • Fixed Ruby gem publishing with artifact corruption.

Performance

  • 2-3x batch throughput gains with FFI streaming.
  • C# JSON serialization with source generation.

4.0.0-rc.15 - 2025-12-20

Fixed

  • Fixed Node.js Windows x64 platform packages not publishing to npm.

4.0.0-rc.14 - 2025-12-20

Fixed

  • Fixed LibreOffice in Docker (updated to version 25.8.4).
  • Fixed Python IDE type hints; type stub files now included in wheels.
  • Fixed Ruby gem compilation with Rust crate vendoring.
  • Fixed Python ExtractionResult missing pages field in IDE autocomplete.

4.0.0-rc.13 - 2025-12-19

Fixed

  • Fixed PDF bundled feature flag (corrected to bundled-pdfium).
  • Fixed Go Windows linking with missing system libraries.
  • Fixed Ruby gem packaging with missing TOML dependency.
  • Fixed WASM distribution with compiled binaries for npm publishing.

4.0.0-rc.12 - 2025-12-19

Fixed

  • Fixed Python wheels PDFium bundling with correct feature flag.
  • Fixed C# MSBuild target for native assets.
  • Fixed Ruby bindings unsafe keyword for Rust 2024 edition.
  • Fixed Docker ONNX Runtime package name for Debian Trixie.

4.0.0-rc.11 - 2025-12-18

Fixed

  • Fixed PDFium bundling now correctly included in all language bindings.
  • Fixed C# native libraries build target for platform-specific copies.
  • Fixed Ruby gem publishing with double-compression validation errors.
  • Fixed Go Windows linking with duplicate CGO linker flags.
  • Added WASM PDF extraction support for browser and Node.js.

4.0.0-rc.10 - 2025-12-16

Breaking Changes

  • PDFium feature names changed: pdf-static -> static-pdfium, pdf-bundled -> bundled-pdfium, pdf-system -> system-pdfium.
  • Default PDFium linking changed to bundled-pdfium.
  • Go module path moved to github.com/kreuzberg-dev/kreuzberg/packages/go/v4.

Fixed

  • Fixed Windows CLI to include bundled PDFium runtime.
  • Added Go ExtractFileWithContext() and batch variants.
  • Replaced TypeScript any types with proper definitions.

4.0.0-rc.9 - 2025-12-15

Added

  • PDFIUM_STATIC_LIB_PATH environment variable for custom static PDFium paths in Docker builds.

Fixed

  • Fixed Python wheels to include typing metadata (.pyi stubs).
  • Fixed Java Maven packages to bundle platform-specific native libraries.
  • Fixed Node npm platform packages to contain compiled .node binaries.
  • Fixed WASM Node.js runtime crash with self is not defined.
  • Fixed PDFium static linking to correctly search for libpdfium.a.

4.0.0-rc.8 - 2025-12-14

Added

  • MCP HTTP Stream transport with SSE support.

Fixed

  • Fixed Go CGO library path configuration for Linux and macOS.
  • Fixed Python wheels manylinux compatibility.
  • Fixed Ruby gems to remove embedding model cache from distribution.
  • Fixed Maven Central publishing to use modern Sonatype Central API.

4.0.0-rc.7 - 2025-12-12

Added

  • Configurable PDFium linking: pdf-static, pdf-bundled, pdf-system Cargo features.
  • WebAssembly bindings with full TypeScript API for browser, Cloudflare Workers, and Deno.
  • RTF extractor improvements with structured table extraction and metadata support.
  • Page tracking redesign with byte-accurate page boundaries and per-page metadata.

Changed

  • BREAKING: ChunkMetadata fields renamed: char_start -> byte_start, char_end -> byte_end. (#226)

Fixed

  • Fixed Ruby gem corruption from embedding model cache in distribution.
  • Fixed Java FFM SIGSEGV from struct alignment on macOS ARM64.
  • Fixed C# variable shadowing compilation errors.

4.0.0-rc.6 - 2025-12-10

Added

  • core feature for lightweight FFI build without ONNX Runtime.

Fixed

  • Fixed ODT table extraction with duplicate content.
  • Fixed ODT metadata extraction to match Office Open XML capabilities.
  • Fixed Go Windows MinGW builds by disabling embeddings feature.
  • Fixed Ruby rb-sys conflict by removing vendoring.
  • Fixed Python text extraction missing format_type metadata field.

4.0.0-rc.5 - 2025-12-01

Breaking Changes

  • Removed all Pandoc dependencies; native Rust extractors now handle all 12 previously Pandoc-supported formats (LaTeX, EPUB, BibTeX, Typst, Jupyter, FictionBook, DocBook, JATS, OPML, Org-mode, reStructuredText, RTF).

Fixed

  • Fixed macOS CLI binary missing libpdfium.dylib at runtime.
  • Fixed Windows Go builds with GNU toolchain detection.
  • Fixed Ruby Bundler 4.0 gem installation failures.

4.0.0-rc.4 - 2025-12-01

Fixed

  • Fixed crates.io and Maven Central publishing authentication.
  • Fixed ONNX Runtime mutex errors and deadlocks.

4.0.0-rc.3 - 2025-12-01

Fixed

  • Fixed NuGet publishing authentication.
  • Fixed CLI binary packages to include libpdfium shared library.

4.0.0-rc.2 - 2025-11-30

Breaking Changes

  • TypeScript/Node.js package renamed from kreuzberg to @kreuzberg/node.

Added

  • C#/.NET bindings using .NET 9+ FFM API.
  • MkDocs documentation site with multi-language examples and API reference.

Fixed

  • Fixed Tesseract OCR API call ordering.
  • Fixed Go Windows CGO MinGW linking.
  • Fixed embeddings model cache lock poisoning recovery.

[4.0.0-rc.1] - 2025-11-23

Major Release - Complete Rewrite

Complete architectural rewrite from Python-only to Rust-core with polyglot bindings.

Architecture

  • Rust core with all extraction logic for performance.
  • Polyglot bindings: Python (PyO3), TypeScript/Node.js (NAPI-RS), Ruby (Magnus), Java (FFM API), Go (CGO).
  • 10-50x performance improvements with streaming parsers for multi-GB files.

Added

  • Plugin system: PostProcessor, Validator, Custom OCR, Custom Document Extractors.
  • Language detection with automatic multi-language support.
  • RAG and embeddings with 4 presets (fast/balanced/quality/multilingual).
  • Image extraction from PDFs and PowerPoint with metadata.
  • Stopwords system for 64 languages.
  • Comprehensive format-specific metadata for PDF, Office, Email, Images, XML, HTML.
  • MCP server for Claude integration.
  • Docker support with multi-variant images and OCR backends.

Changed

  • Async-first API; sync variants have _sync suffix.
  • Strongly-typed config and metadata.
  • New API: extract() -> extract_file(), added extract_bytes(), batch_extract_files().

Removed

  • Pure-Python API, Pandoc dependency, GMFT, spaCy entity extraction, KeyBERT, document classification.

Breaking Changes

  • Python 3.10+, Node.js 18+, Rust 1.75+ required.
  • Binary wheels only.
  • TypeScript/Node.js package renamed to @kreuzberg/node.
  • char_start/char_end -> byte_start/byte_end.

See the API Reference for details.


3.22.0 - 2025-11-27

Fixed

  • Fixed EasyOCR import error handling.
  • Hardened HTML regexes for script/style stripping.

3.21.0 - 2025-11-05

Added

  • Complete Rust core library with document extraction pipeline, plugin system, PDF/Office/HTML/XML extraction, OCR subsystem, image processing, text processing, cache, embeddings, MCP server, and CLI.
  • Language bindings: Python (PyO3), TypeScript (NAPI-RS), Ruby (Magnus), Java (FFM API), Go (CGO), C# (FFI).
  • REST API server and MCP server for Claude integration.

Changed

  • Architecture restructured around Rust core with thin language-specific wrappers.
  • Build system upgraded to Rust Edition 2024 with Cargo workspace.

Removed

  • Old v3 codebase superseded by v4.

Security

  • All dependencies audited, sandboxed subprocess execution, input validation, memory safety via Rust.

Performance

  • Streaming PDF extraction, zero-copy patterns, SIMD optimizations, ONNX Runtime for embeddings, async-first design.

3.20.2 - 2025-10-11

Fixed

  • Fixed missing optional dependency errors in GMFT extractor.

3.20.1 - 2025-10-11

Changed

  • Optimized sdist size by excluding unnecessary files.

3.20.0 - 2025-10-11

Added

  • Python 3.14 support.

Changed

  • Migrated HTML extractor to html-to-markdown v2.

3.19.1 - 2025-09-30

Fixed

  • Fixed Windows Tesseract 5.5.0 HOCR output compatibility.
  • Fixed TypedDict configs with type narrowing and cast.

3.19.0 - 2025-09-29

Added

  • Context-aware exception handling with critical system error policy.

Fixed

  • Aligned sync/async OCR pipelines and fixed Tesseract PSM enum handling.
  • Removed magic library dependency.
  • Added Windows-safe fallbacks for CLI progress.
  • Fixed ValidationError handling in batch processing.

3.18.0 - 2025-09-27

Added

  • API server configuration via environment variables.
  • Auto-download missing spaCy models for entity extraction.
  • Regression tests for German image PDF extraction. (#149)

Changed

  • Updated html-to-markdown to latest version.

Fixed

  • Fixed HOCR parsing issues.

3.17.0 - 2025-09-17

Added

  • Token reduction for text optimization with streaming support.

Fixed

  • Fixed excessive markdown escaping in OCR output. (#133)

3.16.0 - 2025-09-16

Added

  • Enhanced JSON extraction with schema analysis and custom field detection.

Fixed

  • Fixed HTMLToMarkdownConfig not exported in public API.
  • Fixed EasyOCR module-level variable issues.
  • Fixed Windows-specific path issues.

3.15.0 - 2025-09-14

Added

  • Comprehensive image extraction support.
  • Polars DataFrame and PIL Image serialization for API responses.

Fixed

  • Fixed TypeError with unhashable dict in API config merging.

3.14.0 - 2025-09-13

Added

  • DPI configuration system for OCR processing.

Changed

  • Enhanced API with 1GB upload limit and comprehensive OpenAPI documentation.
  • Completed pandas to polars migration.

3.13.0 - 2025-09-04

Added

  • Runtime configuration API with query parameters and header support.
  • OCR caching system for EasyOCR and PaddleOCR backends.

Changed

  • Replaced pandas with polars for table extraction.

Fixed

  • Fixed Tesseract TSV output format and table extraction.
  • Fixed UTF-8 encoding handling across document processing.
  • Fixed HTML-to-Markdown configuration externalization.
  • Fixed regression in PDF extraction and XLS file handling.

3.12.0 - 2025-08-30

Added

  • Multilingual OCR support in Docker images with flexible backend selection.

Changed

  • Simplified Docker images to base and core variants.

Fixed

  • Fixed naming conflict in CLI config command.

3.11.1 - 2025-08-13

Fixed

  • Fixed EasyOCR device-related parameters passed to readtext() calls.
  • Optimized numpy import to only load inside process_image_sync for faster startup.

3.11.0 - 2025-08-01

Changed

  • Implemented Python 3.10+ syntax optimizations.

Fixed

  • Fixed image extractor async delegation.
  • Fixed timezone assertion in spreadsheet metadata.
  • Fixed ExceptionGroup import for Python 3.10+ compatibility.
  • Fixed _parse_date_string bug.

3.10.0 - 2025-07-29

Added

  • PDF password support through new crypto extra feature.

3.9.0 - 2025-07-17

Added

  • Initial release of v3.9.0 series.

[3.8.0] - 2025-07-16

Added

  • Foundation for v3.8.0 release.

3.7.0 - 2025-07-11

Added

  • MCP server for AI integration enabling Claude integration with document extraction.

Fixed

  • Fixed chunk parameters to prevent overlap validation errors.
  • Fixed HTML test compatibility with html-to-markdown v1.6.0.

3.6.0 - 2025-07-04

Added

  • Language detection integrated into extraction pipeline.

Fixed

  • Completed entity extraction migration from gliner to spaCy.

Changed

  • spaCy now used for entity extraction replacing gliner.

3.5.0 - 2025-07-04

Added

  • Language detection with configurable backends.
  • Full synchronous support for PaddleOCR and EasyOCR backends.

Fixed

  • Fixed chunking default configuration.
  • Fixed PaddleOCR sync implementation for v3.x API.

Changed

  • Python 3.10+ now required (3.9 support dropped).

3.4.0 - 2025-07-03

Added

  • API support with Litestar framework for web-based document extraction.
  • EasyOCR and GMFT Docker build variants.

Fixed

  • Fixed race condition in GMFT caching.
  • Fixed race condition in Tesseract caching.

3.3.0 - 2025-06-23

Added

  • Isolated process wrapper for GMFT table extraction.
  • CLI support with Click framework.
  • Pure synchronous extractors without anyio dependencies.
  • Document-level caching with per-file locks and parallel batch processing.
  • Thread lock for pypdfium2 to prevent macOS segfaults.

Fixed

  • Fixed Windows-specific multiprocessing and utils failures.
  • Fixed file existence validation in extraction functions.

Changed

  • Replaced msgspec JSON with msgpack for 5x faster cache serialization.

3.2.0 - 2025-06-23

Added

  • GPU acceleration support for OCR and ML operations.

Fixed

  • Fixed EasyOCR byte string issues.
  • Fixed Pandoc version issues.
  • Added multiple language support to EasyOCR.

3.1.0 - 2025-03-28

Added

  • GMFT (Give Me Formatted Tables) support for vision-based table extraction.

Changed

  • Image extraction now non-optional in results.

3.0.0 - 2025-03-23

Added

  • Chunking functionality for document segmentation.
  • Extractor registry for managing format-specific extractors.
  • Hooks system for pre/post-processing.
  • OCR backend abstraction with EasyOCR and PaddleOCR support.
  • Multiple language support in OCR backends.

Fixed

  • Fixed Windows error message handling.
  • Fixed PaddleOCR integration issues.

Changed

  • Refactored structure for improved organization.
  • OCR integration with configurable backends.

See Also

Edit this page on GitHub