Page Classification¶

Classify each page of a document against a caller-supplied label set. Single-label (exactly one) or multi-label (any subset). Result populates ExtractionResult.page_classifications.

Feature gate

Requires the classification Cargo feature. Included in full. Requires liter-llm for the underlying provider.

When to Use¶

Routing: assign each page to a downstream queue ("invoice", "contract", "id_document", "receipt").
Filtering: drop or down-rank pages that match a "irrelevant" or "boilerplate" label.
Document triage: bucket multi-page PDFs into per-page categories without writing a custom classifier.

When Not to Use¶

You need whole-document classification, not per-page. Use Structured Extraction with a single string field.
You have a custom-trained classifier already. Wrap it as a post-processor plugin instead.

Configuration¶

PythonTypeScriptRustTOML

Python

from kreuzberg import extract_file, ExtractionConfig, PageClassificationConfig, LlmConfig

config = ExtractionConfig(
    page_classification=PageClassificationConfig(
        labels=["invoice", "contract", "id_document", "receipt"],
        llm=LlmConfig(model="openai/gpt-4o-mini"),
    ),
)
result = await extract_file("packet.pdf", config=config)
for page in result.page_classifications or []:
    chosen = page.labels[0].label
    print(f"page {page.page_number}: {chosen}")

TypeScript

import { extractFile } from '@kreuzberg/node';

const result = await extractFile("packet.pdf", {
    pageClassification: {
        labels: ["invoice", "contract", "id_document", "receipt"],
        llm: { model: "openai/gpt-4o-mini" },
    },
});

for (const page of result.pageClassifications ?? []) {
    console.log(`page ${page.pageNumber}: ${page.labels[0]?.label}`);
}

Rust

use kreuzberg::{extract_file, ExtractionConfig, PageClassificationConfig, LlmConfig};

let config = ExtractionConfig {
    page_classification: Some(PageClassificationConfig {
        labels: vec!["invoice".into(), "contract".into(), "id_document".into(), "receipt".into()],
        multi_label: false,
        prompt_template: None,
        llm: LlmConfig {
            model: "openai/gpt-4o-mini".to_string(),
            ..Default::default()
        },
    }),
    ..Default::default()
};
let result = extract_file("packet.pdf", None, &config).await?;

kreuzberg.toml

[page_classification]
labels = ["invoice", "contract", "id_document", "receipt"]
multi_label = false

[page_classification.llm]
model = "openai/gpt-4o-mini"

Single-Label vs Multi-Label¶

multi_label = false (default) forces the model to return exactly one label per page. multi_label = true lets the model return any subset. Pick the latter when pages can legitimately match more than one category ("invoice" + "purchase_order" on the same page).

Python

from kreuzberg import ExtractionConfig, PageClassificationConfig, LlmConfig

config = ExtractionConfig(
    page_classification=PageClassificationConfig(
        labels=["invoice", "purchase_order", "delivery_note"],
        multi_label=True,
        llm=LlmConfig(model="openai/gpt-4o-mini"),
    ),
)

Custom Prompt (Minijinja)¶

Override the default classification prompt with a Minijinja template:

Python

from kreuzberg import ExtractionConfig, PageClassificationConfig, LlmConfig

config = ExtractionConfig(
    page_classification=PageClassificationConfig(
        labels=["invoice", "contract", "id_document", "receipt"],
        prompt_template=(
            "You are a document triage assistant.\n"
            "Classify the page below using these labels: {{ labels }}.\n"
            "Multi-label: {{ multi_label }}.\n\n"
            "Page text:\n{{ page_text }}"
        ),
        llm=LlmConfig(model="openai/gpt-4o-mini"),
    ),
)

Variable	Description
`{{ labels }}`	The configured label list.
`{{ page_text }}`	The page's extracted text.
`{{ multi_label }}`	Boolean — `true` when multi-label.

The output is JSON-schema-enforced: the response must be a JSON array of strings drawn from the configured labels.

Output Shape¶

ExtractionResult.page_classifications is Option<Vec<PageClassification>>. JSON shape:

{
  "page_classifications": [
    { "page_number": 1, "labels": [{ "label": "invoice", "confidence": 0.94 }] },
    { "page_number": 2, "labels": [{ "label": "purchase_order", "confidence": 0.88 }, { "label": "invoice", "confidence": 0.71 }] }
  ]
}

labels always carries at least one entry in single-label mode. In multi-label mode it may be empty if the model declines to pick anything.

Provider Setup¶

Pick any liter-llm provider. The provider matrix from LLM Integration applies here. For high-volume classification, gpt-4o-mini, claude-3-5-haiku, and google/gemini-2.0-flash give good cost/accuracy trade-offs.

API-key precedence chain:

PageClassificationConfig.llm.api_key
KREUZBERG_LLM_API_KEY
Per-provider env var (OPENAI_API_KEY, ANTHROPIC_API_KEY, …)

LLM Integration — provider matrix, local engines, API-key precedence
Structured Extraction — full-schema LLM extraction
Configuration Reference

Edit this page on GitHub

Page Classification¶

When to Use¶

When Not to Use¶

Configuration¶

Single-Label vs Multi-Label¶

Custom Prompt (Minijinja)¶

Output Shape¶

Provider Setup¶

Related¶