Audio and Video Transcription¶
The transcription Cargo feature adds speech-to-text extraction for audio and
video MIME types via Whisper ONNX models v5.0. Enable the feature and set a
TranscriptionConfig block in your ExtractionConfig to produce transcripts
from audio and video files.
Supported MIME types¶
| MIME type | Extensions | Container |
|---|---|---|
audio/mpeg |
.mp3, .mpga |
MP3 |
audio/mp4 |
.m4a |
M4A / AAC in MP4 |
audio/wav |
.wav |
WAV / RIFF |
audio/webm |
.webm |
WebM audio |
video/mp4 |
.mp4, .mpeg |
MP4 video (audio track only) |
video/webm |
.webm |
WebM video (audio track only) |
Model sizes¶
| Variant | Cache footprint | RAM at inference | Mel bins |
|---|---|---|---|
Tiny |
Smallest | Lowest memory | 80 |
Base |
Small | Low memory | 80 |
Small |
Medium | Medium memory | 80 |
Medium |
Large | High memory | 80 |
LargeV3 |
Largest | Highest memory | 128 |
Models are downloaded from onnx-community/whisper-{size} on HuggingFace Hub
on first use and cached under {KREUZBERG_CACHE_DIR}/whisper/{size}/ when
KREUZBERG_CACHE_DIR is set, or under the platform cache directory such as
~/.cache/kreuzberg/whisper/{size}/ on Linux.
Configuration knobs¶
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool |
true |
The extractor activates only when the transcription block is present and enabled is true. |
model |
WhisperModel |
Tiny |
Size variant to use. |
language |
Option<String> |
None |
ISO-639-1 code (e.g. "en", "de"). The current engine falls back to English when unset; set this explicitly for deterministic output. |
timestamps |
bool |
false |
Accepted for forward-compatibility; segment timestamps are not yet emitted. |
max_bytes |
Option<u64> |
512 MiB |
Reject input larger than this many bytes before decoding. |
max_duration_ms |
Option<u64> |
30 min |
Reject audio longer than this many milliseconds after decode. |
timeout_ms |
Option<u64> |
10 min |
Reserved wall-clock timeout for the full inference call. The current extractor does not enforce it yet. |
model_cache_dir |
Option<PathBuf> |
None |
Override the default cache location. |
allow_network |
bool |
true |
Set to false to disable automatic downloads; returns ModelMissing if the model is not already cached. |
verify_hash |
bool |
true |
Hash verification is reserved for a future work item; currently a no-op with a warning. |
First-run download¶
On the first call with allow_network = true, the extractor downloads the
required ONNX files and tokenizer from HuggingFace Hub. The download is
serialised per process via a cross-process advisory file lock so concurrent
first-time callers do not race. Subsequent calls use the local cache.
Set allow_network = false and pre-populate the cache directory if you need
air-gapped deployments. When the model is absent and allow_network = false,
extraction returns a KreuzbergError::Transcription with the message
"network access disabled and model not cached".
Usage¶
Add the feature to Cargo.toml:
Async¶
use kreuzberg::extract_bytes;
use kreuzberg::core::config::ExtractionConfig;
use kreuzberg::core::config::transcription::{TranscriptionConfig, WhisperModel};
let config = ExtractionConfig {
transcription: Some(TranscriptionConfig {
enabled: true,
model: WhisperModel::Tiny,
language: Some("en".to_string()),
..Default::default()
}),
..Default::default()
};
let bytes = std::fs::read("recording.wav")?;
let result = extract_bytes(&bytes, "audio/wav", &config).await?;
println!("{}", result.content); // transcript
Sync¶
use kreuzberg::extract_bytes_sync;
use kreuzberg::core::config::ExtractionConfig;
use kreuzberg::core::config::transcription::{TranscriptionConfig, WhisperModel};
let config = ExtractionConfig {
transcription: Some(TranscriptionConfig {
enabled: true,
model: WhisperModel::Tiny,
..Default::default()
}),
..Default::default()
};
let bytes = std::fs::read("recording.mp3")?;
let result = extract_bytes_sync(&bytes, "audio/mpeg", &config)?;
println!("{}", result.content);
Notes¶
- Audio longer than 30 seconds is split into 30-second chunks; each chunk is transcribed independently and the results are joined with a space.
- The extractor always resamples to 16 kHz mono before inference; source sample rate and channel layout are handled automatically.
- Engine instances are cached per process keyed by model paths, so the ONNX sessions are loaded once and reused across calls.
- Async inference calls are bounded by a semaphore sized to
resolve_thread_budget, matching the same limit used by the embedding and reranking pipelines. Sync calls run on the caller thread.