Audio and Video Transcription¶

The transcription Cargo feature adds speech-to-text extraction for audio and video MIME types via Whisper ONNX models v5.0. Enable the feature and set a TranscriptionConfig block in your ExtractionConfig to produce transcripts from audio and video files.

Supported MIME types¶

MIME type	Extensions	Container
`audio/mpeg`	`.mp3`, `.mpga`	MP3
`audio/mp4`	`.m4a`	M4A / AAC in MP4
`audio/wav`	`.wav`	WAV / RIFF
`audio/webm`	`.webm`	WebM audio
`video/mp4`	`.mp4`, `.mpeg`	MP4 video (audio track only)
`video/webm`	`.webm`	WebM video (audio track only)

Model sizes¶

Variant	Cache footprint	RAM at inference	Mel bins
`Tiny`	Smallest	Lowest memory	80
`Base`	Small	Low memory	80
`Small`	Medium	Medium memory	80
`Medium`	Large	High memory	80
`LargeV3`	Largest	Highest memory	128

Models are downloaded from onnx-community/whisper-{size} on HuggingFace Hub on first use and cached under {KREUZBERG_CACHE_DIR}/whisper/{size}/ when KREUZBERG_CACHE_DIR is set, or under the platform cache directory such as ~/.cache/kreuzberg/whisper/{size}/ on Linux.

Configuration knobs¶

Field	Type	Default	Description
`enabled`	`bool`	`true`	The extractor activates only when the `transcription` block is present and `enabled` is true.
`model`	`WhisperModel`	`Tiny`	Size variant to use.
`language`	`Option<String>`	`None`	ISO-639-1 code (e.g. `"en"`, `"de"`). The current engine falls back to English when unset; set this explicitly for deterministic output.
`timestamps`	`bool`	`false`	Accepted for forward-compatibility; segment timestamps are not yet emitted.
`max_bytes`	`Option<u64>`	`512 MiB`	Reject input larger than this many bytes before decoding.
`max_duration_ms`	`Option<u64>`	`30 min`	Reject audio longer than this many milliseconds after decode.
`timeout_ms`	`Option<u64>`	`10 min`	Reserved wall-clock timeout for the full inference call. The current extractor does not enforce it yet.
`model_cache_dir`	`Option<PathBuf>`	`None`	Override the default cache location.
`allow_network`	`bool`	`true`	Set to `false` to disable automatic downloads; returns `ModelMissing` if the model is not already cached.
`verify_hash`	`bool`	`true`	Hash verification is reserved for a future work item; currently a no-op with a warning.

First-run download¶

On the first call with allow_network = true, the extractor downloads the required ONNX files and tokenizer from HuggingFace Hub. The download is serialised per process via a cross-process advisory file lock so concurrent first-time callers do not race. Subsequent calls use the local cache.

Set allow_network = false and pre-populate the cache directory if you need air-gapped deployments. When the model is absent and allow_network = false, extraction returns a KreuzbergError::Transcription with the message "network access disabled and model not cached".

Usage¶

Add the feature to Cargo.toml:

kreuzberg = { version = "5", features = ["transcription"] }

Async¶

use kreuzberg::extract_bytes;
use kreuzberg::core::config::ExtractionConfig;
use kreuzberg::core::config::transcription::{TranscriptionConfig, WhisperModel};

let config = ExtractionConfig {
    transcription: Some(TranscriptionConfig {
        enabled: true,
        model: WhisperModel::Tiny,
        language: Some("en".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};

let bytes = std::fs::read("recording.wav")?;
let result = extract_bytes(&bytes, "audio/wav", &config).await?;
println!("{}", result.content); // transcript

Sync¶

use kreuzberg::extract_bytes_sync;
use kreuzberg::core::config::ExtractionConfig;
use kreuzberg::core::config::transcription::{TranscriptionConfig, WhisperModel};

let config = ExtractionConfig {
    transcription: Some(TranscriptionConfig {
        enabled: true,
        model: WhisperModel::Tiny,
        ..Default::default()
    }),
    ..Default::default()
};

let bytes = std::fs::read("recording.mp3")?;
let result = extract_bytes_sync(&bytes, "audio/mpeg", &config)?;
println!("{}", result.content);

Notes¶

Audio longer than 30 seconds is split into 30-second chunks; each chunk is transcribed independently and the results are joined with a space.
The extractor always resamples to 16 kHz mono before inference; source sample rate and channel layout are handled automatically.
Engine instances are cached per process keyed by model paths, so the ONNX sessions are loaded once and reused across calls.
Async inference calls are bounded by a semaphore sized to resolve_thread_budget, matching the same limit used by the embedding and reranking pipelines. Sync calls run on the caller thread.

Edit this page on GitHub