Skip to content

Named-Entity Recognition

Detect named entities (people, organisations, locations, dates, money amounts, emails, phones, URLs, plus caller-supplied custom labels) in extracted text. Result populates ExtractionResult.entities.

Feature gate

The result types ship in the ner Cargo feature (included in no-ort-target, wasm-target, android-target, and full). Choose a backend: ner-onnx (kreuzberg-gliner-rs ONNX) or ner-llm (liter-llm).

Backends

Backend Cargo feature When to use Status
Onnx (kreuzberg-gliner-rs) ner-onnx High throughput, local inference, deterministic Available.
Llm (liter-llm) ner-llm Domain-specific zero-shot labels, any of 143 providers Available today.

When to Use

  • You need entity tags attached to extracted text for retrieval, faceting, or compliance review.
  • You need PII categories surfaced for downstream redaction (NER pairs with the redaction post-processor — see Redaction & Anonymisation).
  • You need zero-shot labelling against caller-supplied categories ("Treatment", "Vessel", "Product") that fall outside the GLiNER taxonomy.

When Not to Use

  • You only need regex-detectable PII (emails, phones, IBANs, SSNs). The redaction pattern engine is 1000× cheaper. See Redaction & Anonymisation.
  • You want sub-100ms latency on a hot path with a large LLM. Prefer the ONNX backend (ner-onnx) for deterministic local inference.

Configuration

Python
import asyncio
from kreuzberg import extract_file, ExtractionConfig, NerConfig, LlmConfig

async def main() -> None:
    config = ExtractionConfig(
        ner=NerConfig(
            backend="llm",
            llm=LlmConfig(model="openai/gpt-4o-mini"),
        ),
    )
    result = await extract_file("contract.pdf", config=config)
    for entity in result.entities or []:
        print(f"{entity.category}: {entity.text} (confidence={entity.confidence})")

asyncio.run(main())
TypeScript
import { extractFile } from '@kreuzberg/node';

const result = await extractFile("contract.pdf", {
    ner: {
        backend: "llm",
        llm: { model: "openai/gpt-4o-mini" },
    },
});

for (const entity of result.entities ?? []) {
    console.log(`${entity.category}: ${entity.text}`);
}
Rust
use kreuzberg::{extract_file, ExtractionConfig, NerConfig, NerBackendKind, LlmConfig};

let config = ExtractionConfig {
    ner: Some(NerConfig {
        backend: NerBackendKind::Llm,
        llm: Some(LlmConfig {
            model: "openai/gpt-4o-mini".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    }),
    ..Default::default()
};
let result = extract_file("contract.pdf", None, &config).await?;
for entity in result.entities.unwrap_or_default() {
    println!("{:?}: {} (confidence={:?})", entity.category, entity.text, entity.confidence);
}
Terminal
kreuzberg extract contract.pdf \
  --config kreuzberg.toml \
  --api-key "$KREUZBERG_LLM_API_KEY"
kreuzberg.toml
[ner]
backend = "llm"
custom_labels = ["Treatment", "Vessel", "Product"]

[ner.llm]
model = "openai/gpt-4o-mini"

Custom Labels (Zero-Shot)

Pass arbitrary labels via NerConfig.custom_labels. The LLM backend folds each label into the structured-output schema; the ONNX backend (when available) uses GLiNER's native zero-shot inference.

Python
from kreuzberg import ExtractionConfig, NerConfig, LlmConfig

config = ExtractionConfig(
    ner=NerConfig(
        backend="llm",
        llm=LlmConfig(model="openai/gpt-4o-mini"),
        custom_labels=["Treatment", "Vessel", "Product"],
    ),
)
TypeScript
const result = await extractFile("contract.pdf", {
    ner: {
        backend: "llm",
        llm: { model: "openai/gpt-4o-mini" },
        customLabels: ["Treatment", "Vessel", "Product"],
    },
});
Rust
use kreuzberg::{ExtractionConfig, NerConfig, NerBackendKind, LlmConfig};

let config = ExtractionConfig {
    ner: Some(NerConfig {
        backend: NerBackendKind::Llm,
        llm: Some(LlmConfig {
            model: "openai/gpt-4o-mini".to_string(),
            ..Default::default()
        }),
        custom_labels: vec!["Treatment".into(), "Vessel".into(), "Product".into()],
        ..Default::default()
    }),
    ..Default::default()
};

Custom hits surface as EntityCategory::Custom(label) in the resulting Entity stream. Casing of the supplied label is preserved.

Output Shape

ExtractionResult.entities is Option<Vec<Entity>>, populated when NER ran and produced at least one detection. JSON shape:

{
  "entities": [
    { "category": "person", "text": "Ada Lovelace", "start": 42, "end": 54, "confidence": 0.93 },
    { "category": { "custom": "Treatment" }, "text": "metformin", "start": 120, "end": 129, "confidence": 0.81 }
  ]
}

Byte offsets refer to result.content. When the redaction post-processor rewrites the document, NER offsets are recomputed against the redacted text — use the audit trail in result.redaction_report to reconstruct positions in the original.

Categories

EntityCategory Description
Person Person names.
Organization Organisations, companies, institutions.
Location Geographic locations.
Date Date mentions.
Time Time-of-day mentions.
Money Monetary amounts with currency.
Percent Percentages.
Email Email addresses.
Phone Phone numbers.
Url URLs.
Custom(label) Caller-supplied zero-shot label.

LLM Backend Setup

When backend = "llm", configure the model via NerConfig.llm. The API-key precedence chain matches LLM Integration:

  1. NerConfig.llm.api_key
  2. KREUZBERG_LLM_API_KEY
  3. Per-provider env var (OPENAI_API_KEY, ANTHROPIC_API_KEY, …)

Local engines (Ollama, LM Studio, vLLM) need no key.

Known Limitations

  • The LLM backend's accuracy depends on the chosen model. Use gpt-4o-mini or larger for production NER.

Edit this page on GitHub