Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
4.0.0 - 2026-01-10¶
Highlights¶
This is the first stable release of Kreuzberg v4, a complete rewrite of the document intelligence library with a Rust core and polyglot bindings for Python, TypeScript, Ruby, PHP, Java, Go, C#, Elixir, and WebAssembly.
Added¶
FFI & Language Bindings¶
- Python FFI Error Handling:
get_last_error_code()andget_last_panic_context()now properly call kreuzberg-ffi panic shield instead of returning stubs - PHP Custom Extractor Support: Metadata and tables from custom PHP extractors now flow through to extraction results via new
php_zval_to_json_value()andphp_array_to_table()helpers - Dynamic Tesseract Language Discovery: OCR backend now queries Tesseract installation for available languages instead of using hardcoded list, with lazy caching via
OnceLockand fallback to default language set
CI/CD¶
- ARM64 npm Indexing: Enabled npm indexing wait step for linux-arm64-gnu packages in publish workflow
Fixed¶
Test Alignment¶
- PHP Batch Error Handling: Aligned PHP tests with other language bindings - batch operations now return error results in metadata instead of throwing exceptions (matching Ruby, TypeScript, Python behavior)
- PHP Embedding Tests: Made embedding tests skip gracefully when ONNX runtime unavailable on ARM platforms
- PHP Image Extraction Tests: Made image extraction tests skip when pdfium unavailable
- TypeScript OCR Tests: Made OCR configuration tests skip gracefully when Tesseract initialization fails on Windows
Removed¶
Legacy Support¶
- V3 Code Removal: Completely removed v3 legacy Python package and infrastructure
- Deleted
/v3/directory containing standalone v3.22.0 LTS package - Removed
.github/workflows/publish-v3.yamlworkflow - Removed v3 documentation build scripts and tasks
- Cleaned up workspace configuration references
- V3 users should migrate to v4 using the migration guide
- V3 package remains available on PyPI for legacy installations
Fixed¶
CI/Build¶
- WASM TypeScript Typecheck: Fixed TypeScript typecheck failing in CI when WASM module not yet built by using dynamic imports
- Go Windows CI: Added Windows platform support for Go task commands using PowerShell equivalents
- Documentation Build: Removed benchmark visualizer from docs deployment (temporarily disabled)
4.0.0-rc.29 - 2026-01-08¶
Added¶
Documentation¶
- Platform Support Documentation: Added comprehensive platform support sections to all READMEs
- Main README now includes platform support table showing Linux x86_64, Linux aarch64, macOS ARM64, and Windows x64 coverage for all language bindings
- Python, Node.js, TypeScript, Ruby, Elixir, Go, Java, C#, and PHP READMEs now document precompiled binary availability
- Installation guide includes architecture support section highlighting aarch64 availability
- CONTRIBUTING.md includes new cross-architecture testing section explaining ubuntu-24.04-arm CI runners
- All documentation correctly reflects macOS support is Apple Silicon only (no Intel)
- Corrected Ruby platform documentation (no Windows support currently)
4.0.0-rc.28 - 2026-01-07¶
Added¶
API Server¶
- Embedding Endpoint: Added new
POST /embedendpoint for generating embeddings from text without document extraction (#266) - Accepts JSON body with
textsarray and optionalconfig(embedding model, batch size, cache directory) - Returns embeddings with model info, dimensions, and embedding count
- Supports all embedding presets (fast, balanced, quality, multilingual)
- Includes comprehensive test coverage and Docker integration tests
- Server Configuration: Added comprehensive
ServerConfigtype for API server settings - File-based configuration (TOML/YAML/JSON) for host, port, CORS origins, and upload limits
- Environment variable overrides for all server settings (KREUZBERG_HOST, KREUZBERG_PORT, KREUZBERG_CORS_ORIGINS, etc.)
- Configuration precedence: CLI args > Environment variables > Config file > Defaults
- Separate server config examples (server.toml/yaml/json)
Observability¶
- OpenTelemetry: Added tracing instrumentation to all API endpoints (
api.extract,api.embed,api.health,api.info,api.cache_stats,api.cache_clear) - Tracks request counts, model usage, and file counts
- Compatible with OpenTelemetry collectors and distributed tracing systems
Fixed¶
API Server & CLI¶
- CLI ServerConfig Integration: Fixed CLI to properly use ServerConfig from config files
- CLI now respects CORS origins from configuration files (previously ignored)
- CLI now respects upload size limits from configuration files (previously re-parsed from env only)
- Added new
serve_with_server_config()function for proper config handling - Clippy Warnings: Fixed manual ceiling division by using stdlib
div_ceil()method - Integration Tests: Added
#[cfg(feature = "api")]gate to prevent import errors when api feature disabled - Linting: Resolved all clippy
field_reassign_with_defaultwarnings in test code - ShellCheck: Fixed all shell script warnings in Docker test script (variable declarations, quoting)
Docker & CI¶
- Docker Test Script: Fixed critical infinite recursion bug in
get_image_name()function - Docker Entrypoint: Fixed wrong binary path in custom config test (/app/kreuzberg → /usr/local/bin/kreuzberg)
- Docker Command: Added missing 'serve' subcommand to custom config test
- CI Artifact Upload: Made test results artifact upload conditional on file existence
Configuration Examples¶
- Server Section: Removed misplaced
[server]section from extraction config examples - Keyword Params: Fixed YAKE/RAKE parameter examples to match actual source code
- YAKE: Only
window_sizeparameter exists (removed non-existent deduplication fields) - RAKE: Only
min_word_lengthandmax_words_per_phraseexist (removed non-existent stopwords/delimiters) - Security: Changed default host from
0.0.0.0to127.0.0.1for safer defaults
Documentation¶
- ServerConfig: Removed incorrect
show_download_progressfield from ServerConfig example (belongs to EmbeddingConfig) - PdfConfig: Added missing
hierarchyfield to PdfConfig documentation table - TypeScript Snippets: Converted all WASM snippets from JavaScript to TypeScript with proper type annotations
- Go Installation: Updated README with correct
go getinstructions for monorepo structure (#264) - Clarified that
@latestdoesn't work due to Go module discovery limitations - Added explicit version tag examples and version discovery command
- Recommended automated installer script as primary installation method
PHP¶
- Table Extraction: Fixed
extract_tablesconfig flag to properly filter table results
4.0.0-rc.27 - 2026-01-04¶
Fixed¶
Publishing & Distribution¶
- WASM npm Package: Fixed module initialization failure caused by incorrect import paths in minified output - updated copy-pkg.js script to handle multi-line import statements generated by tsup
4.0.0-rc.26 - 2026-01-03¶
Fixed¶
Publishing & Distribution¶
- Node.js macOS Bindings: Re-enabled macOS ARM64 builds in publish workflow (fixes npm package installation on macOS)
- WASM npm Package: Fixed missing WASM binaries by removing .gitignore from dist/pkg directory during build
- Elixir hex.pm: Removed private organization configuration and added version validation to enable public publishing
- Homebrew Bottles: Fixed bottle upload pattern from double-dash to single-dash to match build filename format
CI/CD¶
- PHP CI: Fixed core extension loading by removing -n flag from test commands
- Ruby Windows CI: Added bindgen blocklists for intrinsic headers to resolve GCC/Clang conflicts
- Node.js macOS CI: Increased test timeout from 30s to 60s for embedding tests
- Elixir CI: Fixed try-catch-rescue clause ordering in async operations tests
- C# Windows CI: Skipped flaky timing-based concurrent test
- Docker Build: Added kreuzberg_rustler to sed exclusion list in Dockerfile
- Benchmark Harness: Fixed execute permissions on downloaded artifacts
- Rust Linting: Resolved Rust 2024 edition compatibility with pre-commit hooks
- PHP PIE Builds: Fixed macOS linker flags and Windows sha256sum command compatibility
Version Management¶
- Elixir Validation: Added packages/elixir/mix.exs to version consistency validation script
4.0.0-rc.25 - 2026-01-03¶
Fixed¶
CI/CD¶
- PHP macOS CI: Added dynamic lookup linker flags to test-php job to resolve undefined symbol errors
- PHP Fingerprint Cache: Removed macOS-only condition from cargo cache cleanup to prevent fingerprint conflicts on all platforms
- Go Binding: Added comprehensive chunking config validation (negative values, excessive sizes, overlap constraints)
- Go Binding: Removed overly strict empty data validation to allow empty strings
- Go Error Tests: Corrected error path test expectations for batch operations
- Ruby Windows Build: Changed to PowerShell and added explicit C:\t directory cleanup to resolve path mismatches
- C# E2E Tests: Fixed inverted logic in legacy Office format skip detection (KREUZBERG_SKIP_LEGACY_OFFICE)
- Python CI: Corrected binary name from kreuzberg-cli to kreuzberg to match Cargo.toml configuration
- Elixir Tests: Fixed image assertion logic to correctly handle nil results
- Elixir Tests: Changed table struct lookups from string keys to atom keys (:cells, :rows, :headers)
- Elixir Tests: Added timeout handling to async operations test with 100ms fast-fail
Benchmarks¶
- sccache Resilience: Added DNS error detection and automatic fallback to direct compilation
- Pandoc Timeout: Added 60-second timeout to prevent indefinite hangs during extraction
- Workflow Optimization: Removed 60 unnecessary steps from 12 third-party framework jobs (saves 48-72 minutes per run, 12GB bandwidth)
- Artifact Management: Split benchmark harness binary (5MB) from full target directory (500MB) for faster downloads
Publish Workflow¶
- NuGet Artifacts: Upgraded upload-artifact action from v6 to v7 to match download version
- Job Cleanup: Removed redundant upload-csharp-release job (C# toolchain only requires NuGet.org publication)
Language Bindings¶
- Java FFI: Use Arena.global() for thread-safe C string reads across all FFI functions
- Ruby Safety: Removed unnecessary unsafe blocks and fixed type conversions for safer code
Changed¶
- C# Target Framework: Updated all projects and documentation to .NET 10.0
4.0.0-rc.24 - 2026-01-01¶
Fixed¶
- Go Windows CI: Added explicit CGO directives to bypass pkg-config on Windows
- Ruby Windows Build: Added Windows platform handling in build.rs, enabled embeddings feature on Windows GNU
- Node Windows Tests: Fixed symlink resolution using realpathSync for Windows compatibility
- C# Tests: Fixed null reference warnings in config tests
- WASM CI: Fixed artifact download path to include pkg/ subdirectory
- Homebrew Formula: Fixed bottle naming convention, added source sha256 fetching
- PHP PIE Build: Corrected task name and extension filename
- C# NuGet Upload: Added proper conditional check for artifact existence
- Python CI: Fixed test failures and compatibility issues
- Elixir CI: Fixed build and compilation warnings
- WASM Deno: Fixed type definitions and Ruby Windows build
4.0.0-rc.23 - 2026-01-01¶
Added¶
Java¶
- EmbeddingConfig class: New type-safe configuration class with builder pattern for embedding generation
- 7 configurable fields: model, normalize, batchSize, dimensions, useCache, showDownloadProgress, cacheDir
- Comprehensive test suite with 27 test methods (321 lines)
- Full integration with ExtractionConfig
- toMap/fromMap serialization support
C¶
- EmbeddingConfig sealed class: Type-safe replacement for Dictionary-based embedding configuration
- 5 properties with init-only accessors
- JSON serialization with snake_case mapping
- Comprehensive test suite with 50 test methods
- Updated ChunkingConfig to use EmbeddingConfig instead of Dictionary
Node.js (NAPI-RS)¶
- Worker Thread Pool APIs: Complete concurrent extraction system
createWorkerPool(size?): Create worker pool with configurable sizegetWorkerPoolStats(pool): Monitor pool utilizationextractFileInWorker(pool, ...): Extract single file in worker threadbatchExtractFilesInWorker(pool, ...): Extract multiple files concurrentlycloseWorkerPool(pool): Graceful pool shutdown- 17 test methods (468 lines) covering all APIs
- Auto-generated TypeScript type definitions via NAPI-RS
Test Coverage¶
- Node.js: 54 new tests (batch operations, worker pool, 15 config types)
- WASM: 122 new tests (batch operations, embeddings, keywords, tables, 8 config suites)
- TypeScript: 62 new tests (async operations, batch, 19 config types)
- Java: 27 new EmbeddingConfig tests, 13 new config type tests
- C#: 50 new EmbeddingConfig tests, 14 new config type tests
- Python: 14 new config type tests, batch operations, embeddings advanced tests
- Ruby: 14 new config type tests, async operations, batch operations
- Go: Comprehensive config tests, mutex safety tests
- Total: 200+ new tests across all bindings
Documentation¶
- README Template System: Template-based generation for all binding READMEs
- Created
scripts/readme_templates/with Jinja2 templates - Created
scripts/readme_config.yamlfor language-specific configurations - Added snippet system in
docs/snippets/for code examples - Template partials for badges, installation, features, quick start
- Worker Pool Documentation: Complete examples and best practices
- Code snippet:
docs/snippets/typescript/advanced/worker_pool.md - Performance benefits and usage patterns documented
- Config Discovery Documentation: Automatic config file loading examples
- Code snippet:
docs/snippets/typescript/config/config_discovery.md - NAPI-RS Implementation Details: Technical documentation for Node.js binding
- Template partial:
scripts/readme_templates/partials/napi_implementation.md.jinja - Threading model, memory management, performance characteristics
Fixed¶
- Page Marker Bug: Fixed page markers to include page 1 (previously only inserted for page > 1)
- Modified
crates/kreuzberg/src/pdf/text.rs:292to fix insertion logic - Fixed C# default marker format in
packages/csharp/Kreuzberg/Serialization.cs:109 - Fixed C# config serialization in
packages/csharp/Kreuzberg/KreuzbergClient.cs:1274 -
Added comprehensive test suite:
crates/kreuzberg/tests/page_markers.rs(13 tests) -
Go Concurrency Crashes: Fixed segfaults and SIGTRAP errors in concurrent operations
- Added
ffiMutex sync.Mutexinpackages/go/v4/binding.gofor thread-safe FFI calls - PDFium is not thread-safe; all FFI calls now protected by mutex
- Verified with
-raceflag: zero race conditions - All 410 tests now pass consistently without crashes
Changed¶
- Code Formatting: Standardized formatting across all 10 language bindings
- Rust: Applied
rustfmtto all crates - Java: Applied Spotless (Google Java Format)
- C#: Applied
dotnet format - PHP: Applied PHP CS Fixer
- Shell: Applied
shfmtformatting -
All pre-commit hooks now passing
-
README Updates: Regenerated all binding READMEs from templates
- Node.js: Added worker pool section, LibreOffice notes, NAPI-RS details
- TypeScript: Updated with all new config types
- All bindings: Consistent structure and formatting
Performance¶
- Node.js Worker Pools:
- Parallel document processing across CPU cores
- Configurable pool size (defaults to CPU count)
- Queue management for efficient task distribution
- Prevents thread exhaustion with bounded concurrency
4.0.0-rc.22 - 2025-12-27¶
Added¶
- PHP bindings - New PHP extension with comprehensive FFI bindings
- PHP E2E test suite - Generated 65 comprehensive E2E tests from fixtures
- Email extraction tests
- HTML processing tests
- Image extraction tests
- OCR functionality tests (5 scenarios)
- Office document tests (16 formats)
- PDF extraction tests (16 scenarios)
- Plugin API tests (14 API functions)
- Smoke tests (7 formats)
- Structured data tests (JSON/YAML)
- XML extraction tests
- Root composer.json - Added composer.json at repository root for Packagist publishing
- HTML metadata extraction - Rich structured metadata from HTML documents
- Headers extraction with hierarchy (level, text, id, depth, html_offset)
- Links extraction with type classification (anchor, internal, external, email, phone)
- Images extraction with dimensions and type detection (data-uri, inline-svg, external, relative)
- Structured data extraction (JSON-LD, Microdata, RDFa)
- New fields:
language,text_direction,meta_tags
Fixed¶
- C# target framework - Changed from net10.0 (preview) to net8.0 LTS
- .NET 10 preview caused NuGet restore hangs
- .NET 8 is latest stable LTS version with FFM API support
- Homebrew check timeout - Added timeouts to prevent 55+ minute hangs
- Job timeout: 5 minutes
- Step timeout: 3 minutes
- Command timeout: 120 seconds
- Documentation - Standardized all README badges and removed AI-generated content
- Consistent blue badge colors across all language bindings
- Added Packagist badge to PHP README
- Removed emojis and marketing language
- Converted all relative links to absolute GitHub URLs
- Ruby vendor script - Added missing workspace dependency inlining for
lzma-rust2andparking_lot
Changed¶
- Version sync - Updated scripts/sync_versions.py to include root composer.json
- BREAKING: HTML metadata structure - Replaced YAML frontmatter parsing with single-pass metadata extraction
- keywords: Changed from
Option<String>(comma-separated) toVec<String>(array) - canonical: Renamed to
canonical_urlfor clarity - Open Graph fields: Consolidated
og_*fields intoopen_graph: BTreeMap<String, String> - Twitter Card fields: Consolidated
twitter_*fields intotwitter_card: BTreeMap<String, String> - New structured types:
headers: Vec<HeaderMetadata>,links: Vec<LinkMetadata>,images: Vec<ImageMetadataType>,structured_data: Vec<StructuredData> - Migration guide: See
docs/migration/v4.0-html-metadata.mdfor upgrade instructions
4.0.0-rc.21 - 2025-12-26¶
Fixed¶
- PDF initialization race conditions - Resolved segfaults, JVM crashes, and concurrency errors across all language bindings
- Fixed Pdfium singleton initialization pattern that caused test failures
- Rust E2E tests: Fixed segfaults by returning Pdfium instance from bind_pdfium()
- Java tests: Removed incompatible JVM flags (
-XX:-ClassUnloading) causing crashes on macOS ARM64 with Java 25 FFI - C# tests: Fixed concurrency errors (improved from 0/29 to 17/29 passing tests)
- Go tests: Applied consistent error message formatting to all PDF error paths
- Test app infrastructure - Fixed published package validation test apps
- Go: Removed invalid local path replace directive from go.mod
- Ruby: Added proper gem installation step before running tests
- Rust: Added workspace exclusion to prevent cargo workspace errors
- EPUB metadata extraction - Fixed incorrect field mapping (created_at → subject) in EPUB parser
Added¶
- CLI test app - Comprehensive test suite for validating kreuzberg-cli published to crates.io
- Installation verification from crates.io
- Extraction tests (PDF, DOCX, XLSX with JSON/markdown output)
- HTTP API server tests (health endpoint, POST /extract)
- MCP server tests (startup and responsiveness)
Changed¶
- CI workflows - All language CI workflows now only run on pull requests to main
- Removed push event triggers from all ci-* workflows
- Preserved workflow_dispatch triggers where needed
- Reduced unnecessary CI runs while maintaining PR quality checks
4.0.0-rc.20 - 2025-12-25¶
- Font configuration API - Configurable font provider with custom directory support and automatic path expansion
4.0.0-rc.19 - 2025-12-24¶
Added¶
- Homebrew bottle support - Pre-built macOS bottles for faster installation
- Environment variable configuration -
KREUZBERG_MAX_REQUEST_BODY_BYTESandKREUZBERG_MAX_MULTIPART_FIELD_BYTESfor API size limits - Config file caching - Improved performance for TOML/YAML/JSON config file loading
Fixed¶
- Large file uploads (issue #248) - Files larger than 2MB are now accepted (configurable up to 100MB)
- Browser package Vite compatibility (issue #249) - Fixed missing
pdfium.jsin dist bundle - Node.js missing binaries (issue #241) - Fixed resolution in Docker and pnpm monorepo environments
- Ruby gem native extension build - Simplified build system with proper linker path resolution
- Java E2E test compatibility - Regenerated tests for Java 25
- Docker ONNX Runtime - Pinned to version 1.23 for compatibility
- Font provider thread safety - Fixed race condition and graceful lock poisoning handling
- Custom font path validation - Added security hardening with symlink resolution and canonicalization
Changed¶
- BREAKING: Custom font provider now enabled by default (set
enabled = falseto disable) - Default API size limit - Increased to 100MB (configurable via
KREUZBERG_MAX_UPLOAD_SIZE_MB) - TypeScript serialization - Replaced MessagePack + Base64 with direct JSON serialization
- PDF dependency - Temporarily using
kreuzberg-pdfium-renderfork while awaiting upstream PR merge
Performance¶
- 15-25% overall execution improvement, 30-45% memory reduction
- String allocation optimization (-2.57% CPU, -0.81% memory)
- Memory pool improvements (35-50% reduction: 60-135 MB → 30-80 MB)
- Post-processing pipeline optimizations (10-17% CPU improvement)
- TypeScript thread pool dynamically sized by CPU count
Removed¶
- Legacy code cleanup - Removed deprecated backward compatibility
- TypeScript:
KREUZBERG_LEGACY_SERIALIZATIONenvironment variable - Go: 7 legacy error codes
- Ruby:
Ocr = OCRalias - Rust: Deprecated
Metadata.datefield (replaced bycreated_at- see Migration Guide) - Cargo: 3 legacy feature aliases
Security¶
- Custom font directories validated with canonicalization
- Symlinks resolved to prevent path traversal attacks
- All custom paths validated before use
4.0.0-rc.18 - 2025-12-23¶
Fixed¶
- Ruby gem packaging - Fixed missing kreuzberg-ffi crate in vendored dependencies
- Ruby gem macOS - Fixed linker errors during gem build and installation
- Python wheels macOS - Fixed ImportError from hardcoded dylib paths
4.0.0-rc.17 - 2025-12-22¶
Added¶
- Docker ARM64 support - Multi-architecture Docker images now support linux/arm64
Fixed¶
- Python wheels macOS - Fixed ImportError when installing from wheel
- Ruby gems macOS - Fixed linker errors during gem installation
- TypeScript plugin registration - Fixed TypeError with JavaScript-style plugins
Performance¶
- Go bindings - Improved ConfigMerge performance with native field copying
4.0.0-rc.16 - 2025-12-21¶
Added¶
- Batch processing APIs - 4-6x throughput improvement for high-volume document extraction
Fixed¶
- Python IDE support - Type stub files (
.pyi) now included in wheel distributions - Java Maven compatibility - Fixed CI builds with Maven 4.0.0-rc-4+ support
- Go Windows linking - Resolved duplicate linker flags causing compilation failures
- Ruby gem compilation - Fixed missing link search paths on Linux and Windows
- Ruby gem publishing - Fixed artifact corruption during publication process
Performance¶
- Batch operations - 2-3x batch throughput gains with FFI streaming
- C# optimizations - JSON serialization with source generation
- TypeScript/Node.js - Config validation and batch operation integration
4.0.0-rc.15 - 2025-12-20¶
Fixed¶
- Node.js Windows publishing - Windows x64 platform packages now publish correctly to npm
4.0.0-rc.14 - 2025-12-20¶
Added¶
- Comprehensive test suites - End-to-end tests for all language bindings
Fixed¶
- C# NuGet publishing - Switched to direct API key authentication
- LibreOffice in Docker - Updated to version 25.8.4
- Python IDE type hints - Type stub files now included in wheels
- Ruby gem compilation - Fixed Rust crate vendoring
- Python ExtractionResult - Fixed missing
pagesfield in IDE autocomplete
4.0.0-rc.13 - 2025-12-19¶
Fixed¶
- PDF bundled feature flag - Corrected flag to
bundled-pdfium - Go Windows linking - Fixed missing system libraries
- Ruby gem packaging - Added missing TOML dependency
- WASM distribution - Added compiled binaries for proper NPM publishing
4.0.0-rc.12 - 2025-12-19¶
Fixed¶
- Python wheels PDFium bundling - Corrected conditional compilation feature flag
- C# bindings - Fixed MSBuild target for CI-downloaded native assets
- Ruby bindings - Added missing
unsafekeyword for Rust 2024 edition - Docker - Corrected ONNX Runtime package name for Debian Trixie
4.0.0-rc.11 - 2025-12-18¶
Fixed¶
- PDFium bundling - Now correctly included in all language bindings
- C# native libraries - Build target properly copies platform-specific libraries
- Ruby gem publishing - Fixed validation errors from double-compression
- Go Windows linking - Removed duplicate CGO linker flags
- WASM - Added PDF extraction support for browser and Node.js
4.0.0-rc.10 - 2025-12-16¶
Breaking Changes¶
- PDFium feature names -
pdf-static→static-pdfium,pdf-bundled→bundled-pdfium,pdf-system→system-pdfium - Default PDFium linking -
pdffeature now defaults tobundled-pdfium - Go module path - Moved to
github.com/kreuzberg-dev/kreuzberg/packages/go/v4(update imports and rungo mod tidy)
Fixed¶
- Windows CLI - Now includes bundled PDFium runtime
- WASM - PDFium support for native targets
- Go bindings - Added
ExtractFileWithContext()and batch variants - TypeScript - Replaced
anytypes with proper definitions - Ruby bindings - Complete YARD documentation
- C# bindings - Complete XML documentation
4.0.0-rc.9 - 2025-12-15¶
Added¶
PDFIUM_STATIC_LIB_PATHenvironment variable - Enables custom static PDFium paths for Docker builds
Fixed¶
- Python - Wheels now include typing metadata (
.pyistubs) for IDE support - Java - Maven packages now bundle platform-specific native libraries
- Node - npm platform packages now contain compiled
.nodebinaries - WASM - Node.js runtime no longer crashes with
self is not defined - PDFium static linking - Fixed to correctly search for
libpdfium.a
4.0.0-rc.8 - 2025-12-14¶
Added¶
- MCP HTTP Stream transport - HTTP Stream transport for MCP server with SSE support
Fixed¶
- Go bindings - Fixed CGO library path configuration for Linux and macOS
- Python wheels - Now built with correct manylinux compatibility
- Ruby gems - Removed embedding model cache from distribution
- Maven Central - Updated publishing to use modern Sonatype Central API
4.0.0-rc.7 - 2025-12-12¶
Added¶
- Configurable PDFium linking -
pdf-static,pdf-bundled,pdf-systemCargo features - WebAssembly bindings - Full TypeScript API with sync/async extraction for browser, Cloudflare Workers, Deno
- RTF extractor improvements - Structured table extraction and metadata support
- Page tracking redesign - Byte-accurate page boundaries and per-page metadata
Changed¶
- BREAKING:
ChunkMetadatafield renames (#226) char_start→byte_startchar_end→byte_end- See migration guide for details
Fixed¶
- Ruby gem corruption - Excluded embedding model cache from distribution
- Java FFM SIGSEGV - Fixed struct alignment on macOS ARM64
- C# compilation errors - Fixed variable shadowing
- Python CI timeouts - Marked slow office document tests for proper test selection
4.0.0-rc.6 - 2025-12-10¶
Added¶
corefeature - Lightweight FFI build without ONNX Runtime (enables Windows MinGW compatibility)
Fixed¶
- ODT table extraction - Fixed duplicate content extraction
- ODT metadata extraction - Enhanced to match Office Open XML capabilities
- Go Windows MinGW builds - Disabled embeddings feature for Windows compatibility
- Ruby rb-sys conflict - Removed vendoring, now uses crates.io version
- Python text extraction metadata - Fixed missing
format_typefield - C# E2E tests - Fixed OCR tests with empty content
4.0.0-rc.5 - 2025-12-01¶
Breaking Changes¶
- Removed all Pandoc dependencies - Native Rust extractors now handle all 12 previously Pandoc-supported formats
- LaTeX, EPUB, BibTeX, Typst, Jupyter, FictionBook, DocBook, JATS, OPML, Org-mode, reStructuredText, RTF
- Benefits: No system dependencies, smaller Docker images, pure Rust codebase
Fixed¶
- macOS CLI binary - Fixed missing libpdfium.dylib at runtime
- Windows Go builds - Fixed GNU toolchain detection issues
- Ruby Bundler 4.0 - Fixed gem installation failures
4.0.0-rc.4 - 2025-12-01¶
Fixed¶
- Publishing workflow - Fixed crates.io and Maven Central authentication
- Language bindings - Resolved test failures across Ruby, Java, C#, and Node
- ONNX Runtime - Fixed mutex errors and deadlocks
4.0.0-rc.3 - 2025-12-01¶
Fixed¶
- NuGet publishing - Switched to API key authentication
- CLI binary packages - Included libpdfium shared library
4.0.0-rc.2 - 2025-11-30¶
Breaking Changes¶
- TypeScript/Node.js package - Renamed from
kreuzbergto@kreuzberg/node(scoped package) - TypeScript migration - Replace
import { ... } from 'kreuzberg'withimport { ... } from '@kreuzberg/node'
Added¶
- C#/.NET bindings - Complete C# bindings using .NET 9+ FFM API
- MkDocs documentation site - Multi-language examples and API reference
Fixed¶
- Tesseract OCR - Corrected API call ordering
- Go Windows CGO linking - Fixed MinGW compatibility
- Python async tests - Fixed pytest-asyncio configuration
- Embeddings model cache - Added lock poisoning recovery
4.0.0-rc.1 - 2025-11-23¶
Major Release - Complete Rewrite¶
Complete architectural rewrite from Python-only to Rust-core with polyglot bindings.
Architecture¶
- Rust core - All extraction logic implemented in Rust for performance
- Polyglot bindings - Python (PyO3), TypeScript/Node.js (NAPI-RS), Ruby (Magnus), Java (FFM API), Go (CGO)
- 10-50x performance improvements - Streaming parsers for multi-GB files with constant memory
Added¶
- Plugin system - PostProcessor, Validator, Custom OCR, Custom Document Extractors
- Language detection - Automatic multi-language detection
- RAG & embeddings - Automatic embedding generation with 4 presets (fast/balanced/quality/multilingual)
- Image extraction - Native extraction from PDFs and PowerPoint with metadata
- Stopwords system - 64 language support, compile-time embedded
- Comprehensive metadata - Format-specific extraction for PDF, Office, Email, Images, XML, HTML
- MCP server - Model Context Protocol integration for Claude
- Docker support - Multi-variant images with OCR backends
- CLI improvements - Rust-based CLI with better performance
Changed¶
- Async-first API - Primary extraction functions are async (sync variants have
_syncsuffix) - Strongly-typed config - All configuration uses typed structs/classes
- Strongly-typed metadata - Format-specific TypedDict/struct metadata
- New API -
extract()→extract_file(), addedextract_bytes(),batch_extract_files()
Removed¶
- Pure-Python API - No longer available
- Pandoc - Native Rust extractors for all formats
- GMFT - Vision-based table extraction (replaced with Tesseract-based)
- spaCy entity extraction - Use external NER with postprocessors
- KeyBERT - Use external keyword extraction with postprocessors
- Document classification - Use external classifiers with postprocessors
Breaking Changes¶
- Python 3.10+, Node.js 18+, Rust 1.75+ required
- Binary wheels only (no pure-Python installation)
- TypeScript/Node.js package renamed to
@kreuzberg/node char_start/char_end→byte_start/byte_end
See Migration Guide for details.
3.22.0 - 2025-11-27¶
Fixed¶
- Always attempt EasyOCR import before raising
MissingDependencyErrorto improve error handling - Hardened HTML regexes for script/style stripping to prevent edge cases
Added¶
- Test coverage for EasyOCR import path edge cases
Changed¶
- Updated v3 dependencies to current versions
3.21.0 - 2025-11-05¶
Major Release: Kreuzberg v4 Migration¶
This release introduces Kreuzberg v4, a complete rewrite with Rust core, polyglot bindings (Python/TypeScript/Ruby/Java/Go), and enhanced architecture. v3 remains available for legacy projects.
Added¶
Core (Rust)¶
- Complete Rust core library with comprehensive document extraction pipeline
- Plugin system architecture for extensible extractors, OCR backends, postprocessors, and validators
- PDF extraction and rendering with advanced table detection
- Format extraction for Office documents (DOCX, XLSX, PPTX), HTML, XML, and plain text
- OCR subsystem with pluggable backend support (Tesseract, EasyOCR, PaddleOCR)
- Image processing utilities with EXIF extraction and format conversion
- Text processing with token reduction, chunking, keyword extraction, and language detection
- Stopwords utilities for 50+ languages
- Cache system with in-memory and persistent storage
- Embeddings support via FastEmbed
- MCP (Model Context Protocol) server integration
- CLI binary for command-line document extraction
Language Bindings¶
- Python: PyO3 FFI bindings with Python-idiomatic API and async support
- TypeScript/Node.js: NAPI-RS bindings with full type definitions
- Ruby: Magnus FFI bindings with RBS type definitions
- Java: Java 25 Foreign Function & Memory API (FFM/Panama) bindings
- Go: CGO bindings with Go 1.25+ support
- C#: FFI bindings (baseline implementation)
API Server & MCP¶
- REST API server for document extraction
- MCP server for Claude integration
- Comprehensive configuration system with builder patterns
Testing & Infrastructure¶
- 95%+ test coverage across all components
- End-to-end test suites for all language bindings (auto-generated from fixtures)
- Comprehensive Rust test suite with unit/integration/doc tests
- Multi-language CI/CD with GitHub Actions
- Docker builds for containerized deployment
- Benchmark harness for performance analysis
Documentation¶
- New documentation site: https://docs.kreuzberg.dev
- Language-specific guides for all 6+ bindings
- Plugin development documentation
- API reference with code examples
- Architecture documentation
- Migration guide from v3 to v4
Changed¶
- Architecture restructured around Rust core with thin language-specific wrappers
- Build system upgraded to Rust Edition 2024 with Cargo workspace
- Dependency management via Cargo, npm/pnpm, PyPI, Maven, and Go modules
- Task automation via Taskfile.yaml
- Enhanced error handling with typed exceptions across all languages
- Configuration system redesigned for consistency across languages
Removed¶
- Old v3 codebase superseded by v4 (v3 remains available in separate branch)
- Legacy Python implementation details replaced by PyO3 bindings
- Previous Node.js implementation replaced by NAPI-RS
Security¶
- All dependencies audited for vulnerabilities
- Sandboxed subprocess execution (LibreOffice, Tesseract)
- Input validation on all user-provided data
- Memory safety via Rust
Performance¶
- Streaming PDF extraction for memory efficiency
- Zero-copy patterns throughout Rust core
- SIMD optimizations where applicable
- ONNX Runtime for embeddings
- Async-first design for I/O operations
3.20.2 - 2025-10-11¶
Fixed¶
- Surface missing optional dependency errors in GMFT extractor
- Stabilize aggregation data loading in benchmarks
- Make Docker E2E API checks resilient to transient failures
Changed¶
- Make comparative benchmarks manual-only to reduce resource consumption
Dependencies¶
- Updated dependencies to latest versions
3.20.1 - 2025-10-11¶
Fixed¶
- Correct publish docker workflow include path
Changed¶
- Optimize sdist size by excluding unnecessary files
3.20.0 - 2025-10-11¶
Added¶
- Python 3.14 support
Changed¶
- Migrate HTML extractor to html-to-markdown v2 for improved compatibility and performance
Fixed¶
- Comparative benchmarks: fix frozen dataclass mutation and aggregation workflow
- CI improvements and coverage adjustments
- Add pytest-benchmark to dev dependencies for benchmark workflow
Dependencies¶
- Bump astral-sh/setup-uv from 6 to 7
- Bump peter-evans/dockerhub-description from 4 to 5
- Bump actions/download-artifact from 4 to 5
- Bump actions/setup-python from 5 to 6
- Bump actions/checkout from 4 to 5
Documentation¶
- Document Python 3.14 support limitations
3.19.1 - 2025-09-30¶
Fixed¶
- Replace mocked Tesseract test with real file-based test
- Remove Rust build step from kreuzberg benchmarks workflow
- Resolve prek and mypy issues in comparative-benchmarks
- Add ruff config to comparative-benchmarks to allow benchmark patterns
- Ensure Windows Tesseract 5.5.0 HOCR output compatibility
- Properly type TypedDict configs with type narrowing and cast
Changed¶
- Optimize extractor configurations for fair comparison in benchmarks
- Complete comparative-benchmarks workspace with visualization and docs generation
Dependencies¶
- Updated dependencies to latest versions
Testing¶
- Add regression test for Issue #149 Windows Tesseract HOCR compatibility
3.19.0 - 2025-09-29¶
Added¶
- Enforce critical system error policy with context-aware exception handling
- Implement systematic exception handling audit across all modules
- Add German language pack support for Windows CI
Fixed¶
- Align sync/async OCR pipelines and fix Tesseract PSM enum handling
- Remove magic library dependency completely
- Prevent magic import crashes in Windows tests
- Properly mock magic import in CLI tests
- Correct asyncio.gather result assertion in profiler test
- Resolve benchmark test failures and naming cleanup
- Add Windows-safe fallbacks for CLI progress and magic
- Make OCR test less brittle for character recognition
- Eliminate all Rich Console instantiations at import time
- Prevent Windows access violation in benchmarks CLI
- Update OCR test assertion to match actual error message format
- Handle ValidationError gracefully in batch processing contexts
- Correct test fixture file paths and remove hardcoded paths
- Ensure coverage job only runs when all tests pass
Changed¶
- Remove verbose AI-pattern naming from components
- Enable coverage job for branches and update changelog
Dependencies¶
- Add missing needs declaration to python-tests job in workflows
Documentation¶
- Align documentation with v4-dev improvements
- Improve error handling policy documentation
Performance¶
- Parallelize comparative benchmarks with 6-hour timeout
3.18.0 - 2025-09-27¶
Added¶
- Make API server configuration configurable via environment variables
- Improve spaCy model auto-download with uv fallback
- Add regression tests for German image PDF extraction (issue #149)
Changed¶
- Use uv to install prek for latest version
- Replace pre-commit commands with Prek in Taskfile.yml
- Update pre-commit instructions to use Prek
- Update html-to-markdown to latest version
- Auto-download missing spaCy models for entity extraction
- Fix HOCR parsing issues
Fixed¶
- Resolve mypy type checking issues
- Prevent DeepSource upload when tests fail
- Make DeepSource upload optional when DEEPSOURCE_DSN is missing
- Prevent coverage-pr from running when test-pr fails
Dependencies¶
- Remove pre-commit dependency from dev requirements
- Updated dependencies to latest versions
CI/CD¶
- Use prek instead of pre-commit in validation workflow
Documentation¶
- Update contributing guide to use prek instead of pre-commit
- Update uv.lock to reflect dependency changes and remove upload-time attributes
3.17.0 - 2025-09-17¶
Added¶
- Add token reduction for text optimization
- Add concurrency settings to cancel in-progress workflow runs
- Optimize token reduction and update dependencies
- Optimize token reduction performance and add streaming support
Fixed¶
- Resolve excessive markdown escaping in OCR output (fixes #133)
- Remove invalid table extraction tests for non-existent functions
- Ensure comprehensive test coverage in CI
- Resolve Windows path compatibility and improve test coverage
- Critical issues from second review of token reduction
Changed¶
- Complete token reduction implementation overhaul
Testing¶
- Improve coverage and add pragma no cover annotations
3.16.0 - 2025-09-16¶
Added¶
- Enhance JSON extraction with schema analysis and custom field detection
- Add internal streaming optimization for html-to-markdown conversions
- Comprehensive test coverage improvements
Fixed¶
- Export HTMLToMarkdownConfig in public API
- Resolve EasyOCR module-level variable issues and adjust coverage requirement
- Resolve mypy type errors and test failures
- Add xfail markers to chunking, ML-dependent, cache, and language detection tests
- Add xfail markers for EasyOCR device validation tests
- Resolve CI test failures with targeted fixes
- Add missing Any import for type annotations
- Update docker e2e workflow to use correct test filename
- Prevent docker_e2e.py from being discovered by pytest
- Resolve Windows-specific path issues in tests
Changed¶
- Remove coverage fail_under threshold completely
- Remove unnecessary xfail markers
Dependencies¶
- Bump actions/download-artifact from 4 to 5
Testing¶
- Add comprehensive tests for API configuration caching
- Increase test coverage to meet CI requirements
3.15.0 - 2025-09-14¶
Added¶
- Add comprehensive image extraction support
- Add polars DataFrame and PIL Image serialization for API responses
- Improve test coverage across core modules
Fixed¶
- Resolve mypy type errors in CI
- Remove unused flame_config parameter from profile_benchmark
- Resolve CI formatting issues by ignoring generated AGENTS.md
- Resolve pre-commit formatting issues and ruff violations
- Resolve all test failures and achieve full test suite compliance
- Add polars DataFrame and PIL Image serialization for API responses
- Resolve TypeError unhashable type dict in API config merging
- Address linting and type issues from PR #130 review
Changed¶
- Apply ruff formatting across the codebase
Documentation¶
- Add comprehensive image extraction documentation
- Update documentation to use ImageOCRConfig instead of deprecated fields
Cleanup¶
- Remove INTERFACE_PARITY.md causing mdformat issues
3.14.0 - 2025-09-13¶
Added¶
- Comprehensive DPI configuration system for OCR processing with fine-grained control over resolution settings
Changed¶
- Enhanced API with 1GB upload limit and comprehensive OpenAPI documentation
- Completed pandas to polars migration across entire codebase for improved performance and memory efficiency
Fixed¶
- Improved lcov coverage combining robustness in CI pipeline
- DPI configuration tests moved to proper test directory for correct CI coverage calculation
- Pre-commit formatting issues resolved
- PDF content handling differences between CI and local tests
- PaddleOCR test isolation and fixture path corrections
3.13.0 - 2025-09-04¶
Added¶
- Runtime configuration API with query parameters and header support for flexible request handling
- Comprehensive runtime configuration documentation with practical examples
- OCR caching system for EasyOCR and PaddleOCR backends to improve performance
Changed¶
- Replaced pandas with polars for table extraction (significant performance improvement)
- Consolidated benchmarks CLI into unified interface with improved structure
- Converted all class-based tests to function-based tests for consistency
- Restructured benchmarks package layout for better organization
- Removed all module-level docstrings from Python files for cleaner codebase
Fixed¶
- MyPy errors and type annotations throughout codebase
- Tesseract TSV output format and table extraction implementation
- UTF-8 encoding handling across document processing pipeline
- Config loading with proper error handling and validation
- HTML-to-Markdown configuration now externalized for flexibility
- All ruff and mypy linting errors resolved
- Test failures in CI environment
- Regression in PDF extraction and XLS file handling
- Subprocess error analysis for CalledProcessError handling
- Empty DataFrames in GMFT table extraction (pandas.errors.EmptyDataError prevention)
Performance¶
- Optimized PDF processing and HTML conversion performance
- Improved TSV integration test quality with concrete assertions
3.12.0 - 2025-08-30¶
Added¶
- Docker E2E testing infrastructure with multi-image matrix strategy
- Multilingual OCR support to Docker images with flexible backend selection
- Docker documentation with clarity on image contents, sizes, and OCR backend selection
Changed¶
- Simplified Docker images to base and core variants (removed intermediate images)
- Docker strategy updated to 2-image approach for optimized distribution
- Docker E2E tests aligned with CI patterns using matrix strategy for parallel execution
- Improved Docker workflow with manual trigger options for flexibility
Fixed¶
- Docker image naming conventions (kreuzberg-core:latest)
- Boolean condition syntax in Docker workflow
- Docker permissions and documentation accuracy
- Docker E2E test failure detection (ensures proper test exit code propagation)
- Docker workflow disk space management and optimization
- Grep exit code handling in Docker test workflows
- Naming conflict in CLI config command
- grep exit code failures in shell scripts
- EasyOCR test image selection (use text-containing image instead of flower)
- Test fixture image file naming for clarity
Infrastructure¶
- Enhanced disk space cleanup in Docker E2E workflow
- Improved Docker workflow disk space management
- Updated test-docker-builds.yml for new image naming scheme
3.11.1 - 2025-08-13¶
Fixed¶
- EasyOCR device-related parameters removed from readtext() calls
- Numpy import optimization: only import numpy inside process_image_sync rather than at module level to improve startup time
- Table attribute access in documentation
- Pre-commit formatting for table documentation
Infrastructure¶
- Updated uv.lock after version bump
- AI-rulez pre-commit hooks compatibility with Go setup in CI workflow
- Temporary ai-rulez hooks disabled to unblock CI
- Reverted to ai-ruyles v1.4.3 with Go setup for CI stability
- Dependency updates including amannn/action-semantic-pull-request v5→v6 and actions/checkout v4→v5
3.11.0 - 2025-08-01¶
Added¶
- Comprehensive tests for Tesseract OCR backend with edge case coverage
- Comprehensive tests for API module with full functionality coverage
- Comprehensive tests for config and image extractor modules
- Comprehensive Click-based tests for CLI module with proper command testing
- Comprehensive tests for structured data extractor module
- Comprehensive tests for extraction and document classification modules
- Comprehensive tests for pandoc extractor module
- Comprehensive tests for email extractor module
- Comprehensive tests for _config.py module
- Comprehensive tests for _utils._errors module with error handling validation
- Comprehensive tests for spreadsheet metadata extraction
- Retry mechanisms for CI test reliability to handle transient failures
Changed¶
- Implemented Python 3.10+ syntax optimizations (match/case, union types where applicable)
- Merged comprehensive test files with regular test files for consolidated coverage
- Optimized pragma no cover usage to reduce false negatives
- Converted test classes to functional tests for better pytest compatibility
- Updated coverage requirements documentation from 95% to 85%
Fixed¶
- Linting issues in merged comprehensive test files
- Image extractor async delegation tests
- CI test failures with improved retry mechanisms
- Timezone assertion in spreadsheet metadata test
- MyPy errors in test files and config calls
- ExceptionGroup import errors for Python 3.10+ compatibility
- Syntax errors in test files
- Bug in _parse_date_string with proper test mocks
- Pre-commit formatting issues
- DeepSource coverage reporting with restructured CI workflow
Infrastructure¶
- Updated uv.lock to revision 3 to fix CI validation issues
- Fixed CI workflow structure for improved DeepSource coverage reporting
- Coverage upload now only triggers on push events (not pull requests)
- Version bump to 3.10.1 with lock file updates
3.10.0 - 2025-07-29¶
Added¶
- PDF password support through new crypto extra feature
Documentation¶
- Updated quick start section to use available configuration options
3.9.0 - 2025-07-17¶
Added¶
- Initial release of v3.9.0 series with foundational features
3.8.0 - 2025-07-16¶
Added¶
- Foundation for v3.8.0 release
3.7.0 - 2025-07-11¶
Added¶
- MCP server for AI integration enabling Claude integration with document extraction capabilities
- MCP server configuration and optional dependencies support
Fixed¶
- Chunk parameters adjusted to prevent overlap validation errors
- MCP server linting issues resolved
- HTML test compatibility with html-to-markdown v1.6.0 behavior
Changed¶
- MCP server tests converted to function-based format
- MCP server configuration documentation clarified
3.6.0 - 2025-07-04¶
Added¶
- Language detection functionality integrated into extraction pipeline
Fixed¶
- Entity extraction migration from gliner to spaCy completed
- Docker workflow reliability improved with updated ai-rulez configuration
- Optional imports refactored to use classes for better type handling
- Validation and post-process logic abstracted into helper functions for sync and async paths
Changed¶
- spaCy now used for entity extraction replacing gliner
- Optional imports moved inside functions for proper error handling
- Entity and keyword extraction validation/post-processing refactored
3.5.0 - 2025-07-04¶
Added¶
- Language detection functionality with configurable backends
- Performance optimization guidelines and documentation
- Full synchronous support for PaddleOCR and EasyOCR backends
- Docker documentation and Python 3.10+ compatibility
- Benchmarks submodule integration
Fixed¶
- Flaky OCR tests marked as xfail in CI environments
- Typing errors resolved across codebase
- Chunking default configuration reverted to stable settings
- PaddleOCR sync implementation updated to correct v3.x API format
- Python 3.9 support dropped and version bumped to v3.4.2
- Docker workflow version detection and dependencies fixed
- Docs workflow configuration and Docker manual dispatch enabled
- CI documentation dependencies resolved
Changed¶
- Default configurations optimized for modern document extraction
- Tesseract OCR configuration optimized based on benchmark analysis
- Documentation link moved to top of README
- Python 3.10+ now required (3.9 support dropped)
3.4.0 - 2025-07-03¶
Added¶
- API support with Litestar framework for web-based document extraction
- API tests aligned with best practices
- EasyOCR and GMFT Docker build variants
- Docker support with comprehensive CI integration
- API and Docker documentation
Fixed¶
- 'all' extra dependency configuration simplified
- Docker workflow extras configuration corrected
- Race condition in GMFT caching tests resolved
- Race condition in Tesseract caching tests resolved
- Docker Hub repository configuration fixed
- CLI integration test using tmp_path fixture for isolation
- gmft module mypy type errors resolved
Changed¶
- Docker module migrated to api module with Litestar framework
- Docker workflow matrix simplified
- Performance documentation enhanced with competitive benchmarks
3.3.0 - 2025-06-23¶
Added¶
- Isolated process wrapper for GMFT table extraction
- Comprehensive statistical benchmarks proving msgpack superiority
- Comprehensive CLI support with Click framework
- Comprehensive benchmarking suite for sync vs async performance
- Pure synchronous extractors without anyio dependencies
- Document-level caching with simplified concurrency handling
- Per-file locks and parallel batch processing for sync performance
- Python version matrix CI testing (3.9-3.13)
- Thread lock for pypdfium2 to prevent macOS segfaults
Fixed¶
- All remaining mypy and pydoclint issues ensuring CI passes
- Python 3.13 mock compatibility issues in process pool tests
- ExceptionGroup compatibility in sync tests
- GMFT config updated and test failures resolved
- All Windows-specific test failures in multiprocessing and utils modules
- Windows compatibility issues resolved
- Test coverage improvements from 76% to 83%
- Rust formatting issues with edition 2024 let-chain syntax
- File existence validation added to extraction functions
- Python linting issues (ruff complexity, mypy type parameters) resolved
- All clippy warnings resolved
- C# generated file formatting fixed
Changed¶
- msgspec JSON replaced with msgpack for 5x faster cache serialization
- Caching migrated from msgspec JSON to msgpack for improved performance
- Async performance optimized with semaphore-based concurrency control
- Sync performance optimized with per-file locks and parallel batch processing
- Code formatting cleanup and non-essential comments removed
- Benchmark workflows removed from CI
- Python version cache key updated to include full version
3.2.0 - 2025-06-23¶
Added¶
- GPU acceleration support for enhanced OCR and ML operations
Fixed¶
- EasyOCR byte string issues resolved
- Pandoc version issues fixed
- PaddleOCR configuration updated for optimal performance
- Multiple language support added to EasyOCR
Changed¶
- playa-pdf pinned to version 0.4.3 for stability
- PaddleOCR configuration optimized
- Dependencies updated to latest compatible versions
- Pandoc version string resolution improved
3.1.0 - 2025-03-28¶
Added¶
- GMFT (Give Me Formatted Tables) support for vision-based table extraction
Fixed¶
- Bundling issue corrected
- Wrong link in README fixed
- Dependencies updated and GMFT testing issues resolved
Changed¶
- Image extraction now non-optional in results
- Test imports and structure updated
- Concurrency test added for validation
3.0.0 - 2025-03-23¶
Added¶
- Chunking functionality for document segmentation
- Extractor registry for managing format-specific extractors
- Hooks system for pre/post-processing
- OCR backend abstraction with EasyOCR and PaddleOCR support
- OCR configuration system
- Multiple language support in OCR backends
- Comprehensive documentation
Fixed¶
- Pre-commit hooks configuration
- Windows error message handling
- PaddleOCR integration issues
- Dependencies updated to stable versions
Changed¶
- Structure refactored for improved organization
- Metadata extraction approach updated
- OCR integration with configurable backends
- Documentation added and improved
See Also¶
- Configuration Reference - Detailed configuration options
- Migration Guide - v3 to v4 migration instructions
- Format Support - Supported file formats
- Extraction Guide - Extraction examples