Skip to content

Redaction & Anonymisation

Rewrite every textual field of ExtractionResult to remove PII before the result leaves Kreuzberg. The pattern engine covers regex-detectable categories (emails, phones, SSNs, credit cards, IBANs, IP addresses, dates of birth, SWIFT/BIC, postal codes); the optional NER backend adds PERSON / ORGANIZATION / LOCATION. The audit trail lands on ExtractionResult.redaction_report.

Feature gate

Requires the redaction Cargo feature (pattern engine only; ships in no-ort-target, wasm-target, android-target, full). Enable redaction-ml to add the NER backend for name/organisation/location categories.

Original text never leaves the pipeline

Redaction runs as the Late stage. After it runs, the original content is dropped. Only result.redaction_report carries byte offsets back into the original — use it to build audit logs, not to recover the source.

When to Use

  • You ship extracted content to a service that should never see PII.
  • You need a deterministic, local pattern engine (no network calls) for regex-detectable PII.
  • You need tenant-specific tokens (employee IDs, project codenames, internal product names) removed alongside built-in categories.

When Not to Use

  • You need to keep PII in the result for downstream NER or analytics. Run NER first and store the entities; redact in a second pass.
  • You need to redact free-form names and your build doesn't include redaction-ml. The pattern engine cannot find arbitrary names — it covers only regex-detectable categories.

Configuration

Python
import asyncio
from kreuzberg import extract_file, ExtractionConfig, RedactionConfig

async def main() -> None:
    config = ExtractionConfig(
        redaction=RedactionConfig(
            categories=["email", "phone", "ssn", "credit_card", "iban"],
            strategy="mask",
        ),
    )
    result = await extract_file("contract.pdf", config=config)
    print(result.content)
    print(f"Redacted {result.redaction_report.total_redacted} spans")

asyncio.run(main())
TypeScript
import { extractFile } from '@kreuzberg/node';

const result = await extractFile("contract.pdf", {
    redaction: {
        categories: ["email", "phone", "ssn", "credit_card", "iban"],
        strategy: "mask",
    },
});
console.log(result.content);
console.log(`Redacted ${result.redactionReport?.totalRedacted ?? 0} spans`);
Rust
use std::collections::HashSet;
use kreuzberg::{
    extract_file, ExtractionConfig, RedactionConfig, RedactionStrategy,
    types::redaction::PiiCategory,
};

let mut categories = HashSet::new();
categories.insert(PiiCategory::Email);
categories.insert(PiiCategory::Phone);
categories.insert(PiiCategory::Ssn);
categories.insert(PiiCategory::CreditCard);
categories.insert(PiiCategory::Iban);

let config = ExtractionConfig {
    redaction: Some(RedactionConfig {
        categories,
        strategy: RedactionStrategy::Mask,
        ..Default::default()
    }),
    ..Default::default()
};
let result = extract_file("contract.pdf", None, &config).await?;
kreuzberg.toml
[redaction]
categories = ["email", "phone", "ssn", "credit_card", "iban"]
strategy = "mask"

[[redaction.custom_terms]]
label = "Project"
value = "Project Polaris"

[[redaction.custom_patterns]]
label = "InternalId"
pattern = "INT-\\d{6}"

PII Categories

PiiCategory Detection Notes
Email Pattern RFC-5322-ish.
Phone Pattern E.164 + national formats.
Ssn Pattern US SSN with 000/666/9xx exclusions.
CreditCard Pattern 13–19 digits + Luhn check.
PostalCode Pattern Multi-locale.
IpAddress Pattern IPv4 + IPv6.
Iban Pattern ISO country code + length + checksum.
SwiftBic Pattern See "Known limitations" — current regex over-matches plain English words.
DateOfBirth Pattern DOB heuristics.
Person NER Requires RedactionConfig.ner = Some(NerConfig).
Organization NER Same.
Location NER Same.
Custom(label) User-supplied custom_terms or custom_patterns.

Strategies

RedactionStrategy Output Use when
Mask (default) [REDACTED] You only need PII gone.
Hash SHA-256 truncated to 16 hex chars You need equality joins downstream without recovering the source.
TokenReplace [PERSON_1], [PERSON_2], … per category You need to preserve co-reference inside the document.
Drop empty string You need the span gone with no marker.

User-Supplied Terms and Patterns

The most-used surface in production. Pass literal strings or regex patterns the caller already knows are sensitive.

Python
from kreuzberg import (
    ExtractionConfig, RedactionConfig, RedactionTerm, RedactionPattern,
)

config = ExtractionConfig(
    redaction=RedactionConfig(
        strategy="token_replace",
        custom_terms=[
            RedactionTerm(label="Project", value="Project Polaris"),
            RedactionTerm(label="Employee", value="EMP-7421", case_sensitive=True),
        ],
        custom_patterns=[
            RedactionPattern(label="InternalId", pattern=r"INT-\d{6}"),
        ],
    ),
)
TypeScript
const result = await extractFile("contract.pdf", {
    redaction: {
        strategy: "token_replace",
        customTerms: [
            { label: "Project", value: "Project Polaris" },
            { label: "Employee", value: "EMP-7421", caseSensitive: true },
        ],
        customPatterns: [
            { label: "InternalId", pattern: "INT-\\d{6}" },
        ],
    },
});
Rust
use kreuzberg::{
    ExtractionConfig, RedactionConfig, RedactionStrategy, RedactionTerm, RedactionPattern,
};

let config = ExtractionConfig {
    redaction: Some(RedactionConfig {
        strategy: RedactionStrategy::TokenReplace,
        custom_terms: vec![
            RedactionTerm::labeled("Project", "Project Polaris"),
            RedactionTerm { label: "Employee".into(), value: "EMP-7421".into(), case_sensitive: true },
        ],
        custom_patterns: vec![
            RedactionPattern::labeled("InternalId", r"INT-\d{6}"),
        ],
        ..Default::default()
    }),
    ..Default::default()
};

RedactionTerm.value is regex-escaped before matching — pass literal text without escaping. RedactionPattern.pattern uses the Rust regex crate dialect (no look-around). Case-insensitive by default; set case_sensitive = true for exact-byte match. Patterns are validated at config-construction time via RedactionConfig::validate().

User hits always surface as PiiCategory::Custom(label) and are retained even when RedactionConfig.categories filters out the built-in detectors.

Pairing with NER

To redact names, organisations, and locations, attach a NerConfig:

Python
from kreuzberg import (
    ExtractionConfig, RedactionConfig, NerConfig, LlmConfig,
)

config = ExtractionConfig(
    redaction=RedactionConfig(
        categories=["person", "organization", "location", "email"],
        strategy="token_replace",
        ner=NerConfig(
            backend="llm",
            llm=LlmConfig(model="openai/gpt-4o-mini"),
        ),
    ),
)

Choose the NER backend per the NER guide. The LLM backend works today; the gline-rs ONNX backend is pending an upstream ort bump.

Output Shape

{
  "content": "Contact [REDACTED] at [REDACTED]. Reference [PROJECT_1].",
  "redaction_report": {
    "total_redacted": 3,
    "findings": [
      { "start": 8, "end": 24, "category": "person", "strategy": "mask", "replacement_token": "[REDACTED]" },
      { "start": 28, "end": 50, "category": "email", "strategy": "mask", "replacement_token": "[REDACTED]" },
      { "start": 61, "end": 75, "category": { "custom": "Project" }, "strategy": "token_replace", "replacement_token": "[PROJECT_1]" }
    ]
  }
}

Offsets refer to the ORIGINAL pre-redaction content. Use them only for audit-trail reconstruction — the original bytes are gone by the time the result reaches the caller.

Data Handling

The redaction post-processor:

  • Runs locally. The pattern engine makes no network calls.
  • Drops the original text. Only redaction_report carries spans back to the original — and only as numeric offsets, never as the original characters.
  • Adjusts chunk byte ranges in place when preserve_offsets = true (default). Set false to keep chunk offsets pointing at the original document.

The NER backend, when enabled, follows whichever backend you configure — see NER for the network-call surface of ner-llm.

Known Limitations

  • SWIFT/BIC over-matches plain English words. The current regex ([A-Z]{4}[A-Z]{2}[A-Z0-9]{2}(?:[A-Z0-9]{3})?) accepts arbitrary 8/11-letter all-caps tokens after the engine uppercases the input. Until a country-allowlist lands, scope RedactionConfig.categories to the subset you actually need rather than redacting everything.
  • PERSON / ORGANIZATION / LOCATION require NER. Without RedactionConfig.ner, those categories are silently skipped.

Edit this page on GitHub