Named-Entity Recognition¶
Detect named entities (people, organisations, locations, dates, money amounts, emails, phones, URLs, plus caller-supplied custom labels) in extracted text. Result populates ExtractionResult.entities.
Feature gate
The result types ship in the ner Cargo feature (included in no-ort-target, wasm-target, android-target, and full). Choose a backend: ner-onnx (kreuzberg-gliner-rs ONNX) or ner-llm (liter-llm).
Backends¶
| Backend | Cargo feature | When to use | Status |
|---|---|---|---|
Onnx (kreuzberg-gliner-rs) |
ner-onnx |
High throughput, local inference, deterministic | Available. |
Llm (liter-llm) |
ner-llm |
Domain-specific zero-shot labels, any of 143 providers | Available today. |
When to Use¶
- You need entity tags attached to extracted text for retrieval, faceting, or compliance review.
- You need PII categories surfaced for downstream redaction (NER pairs with the redaction post-processor — see Redaction & Anonymisation).
- You need zero-shot labelling against caller-supplied categories ("Treatment", "Vessel", "Product") that fall outside the GLiNER taxonomy.
When Not to Use¶
- You only need regex-detectable PII (emails, phones, IBANs, SSNs). The redaction pattern engine is 1000× cheaper. See Redaction & Anonymisation.
- You want sub-100ms latency on a hot path with a large LLM. Prefer the ONNX backend (
ner-onnx) for deterministic local inference.
Configuration¶
import asyncio
from kreuzberg import extract_file, ExtractionConfig, NerConfig, LlmConfig
async def main() -> None:
config = ExtractionConfig(
ner=NerConfig(
backend="llm",
llm=LlmConfig(model="openai/gpt-4o-mini"),
),
)
result = await extract_file("contract.pdf", config=config)
for entity in result.entities or []:
print(f"{entity.category}: {entity.text} (confidence={entity.confidence})")
asyncio.run(main())
use kreuzberg::{extract_file, ExtractionConfig, NerConfig, NerBackendKind, LlmConfig};
let config = ExtractionConfig {
ner: Some(NerConfig {
backend: NerBackendKind::Llm,
llm: Some(LlmConfig {
model: "openai/gpt-4o-mini".to_string(),
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
let result = extract_file("contract.pdf", None, &config).await?;
for entity in result.entities.unwrap_or_default() {
println!("{:?}: {} (confidence={:?})", entity.category, entity.text, entity.confidence);
}
Custom Labels (Zero-Shot)¶
Pass arbitrary labels via NerConfig.custom_labels. The LLM backend folds each label into the structured-output schema; the ONNX backend (when available) uses GLiNER's native zero-shot inference.
use kreuzberg::{ExtractionConfig, NerConfig, NerBackendKind, LlmConfig};
let config = ExtractionConfig {
ner: Some(NerConfig {
backend: NerBackendKind::Llm,
llm: Some(LlmConfig {
model: "openai/gpt-4o-mini".to_string(),
..Default::default()
}),
custom_labels: vec!["Treatment".into(), "Vessel".into(), "Product".into()],
..Default::default()
}),
..Default::default()
};
Custom hits surface as EntityCategory::Custom(label) in the resulting Entity stream. Casing of the supplied label is preserved.
Output Shape¶
ExtractionResult.entities is Option<Vec<Entity>>, populated when NER ran and produced at least one detection. JSON shape:
{
"entities": [
{ "category": "person", "text": "Ada Lovelace", "start": 42, "end": 54, "confidence": 0.93 },
{ "category": { "custom": "Treatment" }, "text": "metformin", "start": 120, "end": 129, "confidence": 0.81 }
]
}
Byte offsets refer to result.content. When the redaction post-processor rewrites the document, NER offsets are recomputed against the redacted text — use the audit trail in result.redaction_report to reconstruct positions in the original.
Categories¶
EntityCategory |
Description |
|---|---|
Person |
Person names. |
Organization |
Organisations, companies, institutions. |
Location |
Geographic locations. |
Date |
Date mentions. |
Time |
Time-of-day mentions. |
Money |
Monetary amounts with currency. |
Percent |
Percentages. |
Email |
Email addresses. |
Phone |
Phone numbers. |
Url |
URLs. |
Custom(label) |
Caller-supplied zero-shot label. |
LLM Backend Setup¶
When backend = "llm", configure the model via NerConfig.llm. The API-key precedence chain matches LLM Integration:
NerConfig.llm.api_keyKREUZBERG_LLM_API_KEY- Per-provider env var (
OPENAI_API_KEY,ANTHROPIC_API_KEY, …)
Local engines (Ollama, LM Studio, vLLM) need no key.
Known Limitations¶
- The LLM backend's accuracy depends on the chosen model. Use
gpt-4o-minior larger for production NER.
Related¶
- Redaction & Anonymisation — uses NER for PERSON / ORGANIZATION / LOCATION categories
- LLM Integration — full LLM provider matrix, local engine setup, API-key precedence
- Configuration Reference — full field reference