Page Classification¶
Classify each page of a document against a caller-supplied label set. Single-label (exactly one) or multi-label (any subset). Result populates ExtractionResult.page_classifications.
Feature gate
Requires the classification Cargo feature. Included in full. Requires liter-llm for the underlying provider.
When to Use¶
- Routing: assign each page to a downstream queue ("invoice", "contract", "id_document", "receipt").
- Filtering: drop or down-rank pages that match a "irrelevant" or "boilerplate" label.
- Document triage: bucket multi-page PDFs into per-page categories without writing a custom classifier.
When Not to Use¶
- You need whole-document classification, not per-page. Use Structured Extraction with a single string field.
- You have a custom-trained classifier already. Wrap it as a post-processor plugin instead.
Configuration¶
from kreuzberg import extract_file, ExtractionConfig, PageClassificationConfig, LlmConfig
config = ExtractionConfig(
page_classification=PageClassificationConfig(
labels=["invoice", "contract", "id_document", "receipt"],
llm=LlmConfig(model="openai/gpt-4o-mini"),
),
)
result = await extract_file("packet.pdf", config=config)
for page in result.page_classifications or []:
chosen = page.labels[0].label
print(f"page {page.page_number}: {chosen}")
import { extractFile } from '@kreuzberg/node';
const result = await extractFile("packet.pdf", {
pageClassification: {
labels: ["invoice", "contract", "id_document", "receipt"],
llm: { model: "openai/gpt-4o-mini" },
},
});
for (const page of result.pageClassifications ?? []) {
console.log(`page ${page.pageNumber}: ${page.labels[0]?.label}`);
}
use kreuzberg::{extract_file, ExtractionConfig, PageClassificationConfig, LlmConfig};
let config = ExtractionConfig {
page_classification: Some(PageClassificationConfig {
labels: vec!["invoice".into(), "contract".into(), "id_document".into(), "receipt".into()],
multi_label: false,
prompt_template: None,
llm: LlmConfig {
model: "openai/gpt-4o-mini".to_string(),
..Default::default()
},
}),
..Default::default()
};
let result = extract_file("packet.pdf", None, &config).await?;
Single-Label vs Multi-Label¶
multi_label = false (default) forces the model to return exactly one label per page. multi_label = true lets the model return any subset. Pick the latter when pages can legitimately match more than one category ("invoice" + "purchase_order" on the same page).
Custom Prompt (Minijinja)¶
Override the default classification prompt with a Minijinja template:
from kreuzberg import ExtractionConfig, PageClassificationConfig, LlmConfig
config = ExtractionConfig(
page_classification=PageClassificationConfig(
labels=["invoice", "contract", "id_document", "receipt"],
prompt_template=(
"You are a document triage assistant.\n"
"Classify the page below using these labels: {{ labels }}.\n"
"Multi-label: {{ multi_label }}.\n\n"
"Page text:\n{{ page_text }}"
),
llm=LlmConfig(model="openai/gpt-4o-mini"),
),
)
| Variable | Description |
|---|---|
{{ labels }} |
The configured label list. |
{{ page_text }} |
The page's extracted text. |
{{ multi_label }} |
Boolean — true when multi-label. |
The output is JSON-schema-enforced: the response must be a JSON array of strings drawn from the configured labels.
Output Shape¶
ExtractionResult.page_classifications is Option<Vec<PageClassification>>. JSON shape:
{
"page_classifications": [
{ "page_number": 1, "labels": [{ "label": "invoice", "confidence": 0.94 }] },
{ "page_number": 2, "labels": [{ "label": "purchase_order", "confidence": 0.88 }, { "label": "invoice", "confidence": 0.71 }] }
]
}
labels always carries at least one entry in single-label mode. In multi-label mode it may be empty if the model declines to pick anything.
Provider Setup¶
Pick any liter-llm provider. The provider matrix from LLM Integration applies here. For high-volume classification, gpt-4o-mini, claude-3-5-haiku, and google/gemini-2.0-flash give good cost/accuracy trade-offs.
API-key precedence chain:
PageClassificationConfig.llm.api_keyKREUZBERG_LLM_API_KEY- Per-provider env var (
OPENAI_API_KEY,ANTHROPIC_API_KEY, …)
Related¶
- LLM Integration — provider matrix, local engines, API-key precedence
- Structured Extraction — full-schema LLM extraction
- Configuration Reference