Named-Entity Recognition¶

Detect named entities (people, organisations, locations, dates, money amounts, emails, phones, URLs, plus caller-supplied custom labels) in extracted text. Result populates ExtractionResult.entities.

Feature gate

The result types ship in the ner Cargo feature (included in no-ort-target, wasm-target, android-target, and full). Choose a backend: ner-onnx (kreuzberg-gliner-rs ONNX) or ner-llm (liter-llm).

Backends¶

Backend	Cargo feature	When to use	Status
`Onnx` (kreuzberg-gliner-rs)	`ner-onnx`	High throughput, local inference, deterministic	Available.
`Llm` (liter-llm)	`ner-llm`	Domain-specific zero-shot labels, any of 143 providers	Available today.

When to Use¶

You need entity tags attached to extracted text for retrieval, faceting, or compliance review.
You need PII categories surfaced for downstream redaction (NER pairs with the redaction post-processor — see Redaction & Anonymisation).
You need zero-shot labelling against caller-supplied categories ("Treatment", "Vessel", "Product") that fall outside the GLiNER taxonomy.

When Not to Use¶

You only need regex-detectable PII (emails, phones, IBANs, SSNs). The redaction pattern engine is 1000× cheaper. See Redaction & Anonymisation.
You want sub-100ms latency on a hot path with a large LLM. Prefer the ONNX backend (ner-onnx) for deterministic local inference.

Configuration¶

PythonTypeScriptRustCLITOML

Python

import asyncio
from kreuzberg import extract_file, ExtractionConfig, NerConfig, LlmConfig

async def main() -> None:
    config = ExtractionConfig(
        ner=NerConfig(
            backend="llm",
            llm=LlmConfig(model="openai/gpt-4o-mini"),
        ),
    )
    result = await extract_file("contract.pdf", config=config)
    for entity in result.entities or []:
        print(f"{entity.category}: {entity.text} (confidence={entity.confidence})")

asyncio.run(main())

TypeScript

import { extractFile } from '@kreuzberg/node';

const result = await extractFile("contract.pdf", {
    ner: {
        backend: "llm",
        llm: { model: "openai/gpt-4o-mini" },
    },
});

for (const entity of result.entities ?? []) {
    console.log(`${entity.category}: ${entity.text}`);
}

Rust

use kreuzberg::{extract_file, ExtractionConfig, NerConfig, NerBackendKind, LlmConfig};

let config = ExtractionConfig {
    ner: Some(NerConfig {
        backend: NerBackendKind::Llm,
        llm: Some(LlmConfig {
            model: "openai/gpt-4o-mini".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    }),
    ..Default::default()
};
let result = extract_file("contract.pdf", None, &config).await?;
for entity in result.entities.unwrap_or_default() {
    println!("{:?}: {} (confidence={:?})", entity.category, entity.text, entity.confidence);
}

Terminal

kreuzberg extract contract.pdf \
  --config kreuzberg.toml \
  --api-key "$KREUZBERG_LLM_API_KEY"

kreuzberg.toml

[ner]
backend = "llm"
custom_labels = ["Treatment", "Vessel", "Product"]

[ner.llm]
model = "openai/gpt-4o-mini"

Custom Labels (Zero-Shot)¶

Pass arbitrary labels via NerConfig.custom_labels. The LLM backend folds each label into the structured-output schema; the ONNX backend (when available) uses GLiNER's native zero-shot inference.

PythonTypeScriptRust

Python

from kreuzberg import ExtractionConfig, NerConfig, LlmConfig

config = ExtractionConfig(
    ner=NerConfig(
        backend="llm",
        llm=LlmConfig(model="openai/gpt-4o-mini"),
        custom_labels=["Treatment", "Vessel", "Product"],
    ),
)

TypeScript

const result = await extractFile("contract.pdf", {
    ner: {
        backend: "llm",
        llm: { model: "openai/gpt-4o-mini" },
        customLabels: ["Treatment", "Vessel", "Product"],
    },
});

Rust

use kreuzberg::{ExtractionConfig, NerConfig, NerBackendKind, LlmConfig};

let config = ExtractionConfig {
    ner: Some(NerConfig {
        backend: NerBackendKind::Llm,
        llm: Some(LlmConfig {
            model: "openai/gpt-4o-mini".to_string(),
            ..Default::default()
        }),
        custom_labels: vec!["Treatment".into(), "Vessel".into(), "Product".into()],
        ..Default::default()
    }),
    ..Default::default()
};

Custom hits surface as EntityCategory::Custom(label) in the resulting Entity stream. Casing of the supplied label is preserved.

Output Shape¶

ExtractionResult.entities is Option<Vec<Entity>>, populated when NER ran and produced at least one detection. JSON shape:

{
  "entities": [
    { "category": "person", "text": "Ada Lovelace", "start": 42, "end": 54, "confidence": 0.93 },
    { "category": { "custom": "Treatment" }, "text": "metformin", "start": 120, "end": 129, "confidence": 0.81 }
  ]
}

Byte offsets refer to result.content. When the redaction post-processor rewrites the document, NER offsets are recomputed against the redacted text — use the audit trail in result.redaction_report to reconstruct positions in the original.

Categories¶

`EntityCategory`	Description
`Person`	Person names.
`Organization`	Organisations, companies, institutions.
`Location`	Geographic locations.
`Date`	Date mentions.
`Time`	Time-of-day mentions.
`Money`	Monetary amounts with currency.
`Percent`	Percentages.
`Email`	Email addresses.
`Phone`	Phone numbers.
`Url`	URLs.
`Custom(label)`	Caller-supplied zero-shot label.

LLM Backend Setup¶

When backend = "llm", configure the model via NerConfig.llm. The API-key precedence chain matches LLM Integration:

NerConfig.llm.api_key
KREUZBERG_LLM_API_KEY
Per-provider env var (OPENAI_API_KEY, ANTHROPIC_API_KEY, …)

Local engines (Ollama, LM Studio, vLLM) need no key.

Known Limitations¶

The LLM backend's accuracy depends on the chosen model. Use gpt-4o-mini or larger for production NER.

Redaction & Anonymisation — uses NER for PERSON / ORGANIZATION / LOCATION categories
LLM Integration — full LLM provider matrix, local engine setup, API-key precedence
Configuration Reference — full field reference

Edit this page on GitHub