Reranking¶

New in v5.0

Cross-encoder reranking is a new feature for query-time document reordering.

Reranking takes a query and a list of candidate documents, then scores them jointly to reorder by relevance. Unlike vector similarity, which independently embeds the query and documents, reranking models score the (query, document) pair together. This yields significantly better ranking quality at the cost of higher latency.

Bi-encoders vs cross-encoders¶

Bi-encoders (the embedding models used for vector similarity) encode the query and each document independently, then compare via dot product or cosine similarity. They are fast and embarrassingly parallel — well suited to first-pass retrieval over millions of documents.

Cross-encoders feed (query, document) pairs through a transformer that scores them together. The query and document attend to each other across every layer, producing dramatically more accurate relevance scores. The trade-off is computational cost: every candidate document requires a separate forward pass.

When to use it¶

Use reranking as the second pass in a retrieval pipeline:

Retrieve a candidate set (e.g. top-100) cheaply via vector similarity or BM25.
Rerank that set with a cross-encoder.
Pass the top-k reranked documents into your LLM context.

This pattern preserves the recall of vector search while sharpening the precision of what reaches the model.

Backend variants¶

Four backend variants are supported, mirroring the embedding API:

Variant	Source	Best for
Preset	Bundled ONNX cross-encoders, downloaded from HuggingFace on first use	Production RAG, standard cross-encoder needs
Custom	Any ONNX cross-encoder from HuggingFace	Tuned models, niche domains
Llm	Provider-hosted rerankers (Cohere, Jina, Voyage) via `liter-llm`	Managed APIs, no local model
Plugin	Caller-supplied backend registered via `register_reranker_backend`	sentence-transformers, in-process tuned models

The Preset and Custom variants require the reranker Cargo feature, which depends on ONNX Runtime. The Llm variant requires the liter-llm feature. Plugin works on every target including WASM.

Presets¶

Four cross-encoder presets are bundled:

Name	Model	Size	Languages	Max length
`fast`	`Xenova/ms-marco-MiniLM-L-6-v2`	22M params (quantized)	English	512
`balanced`	`Xenova/bge-reranker-base`	278M params	English, Chinese	512
`quality`	`Xenova/bge-reranker-large`	560M params	English, Chinese	512
`multilingual`	`BAAI/bge-reranker-v2-m3`	568M params	100+ languages	8192

Pick the smallest preset that meets your quality bar — larger models add latency.

Architecture — Where reranking sits in the broader retrieval flow.
Reranking guide — Code examples per language.
Plugin system — Registering a custom reranker backend.

Edit this page on GitHub