Type Reference¶
Complete type definitions and documentation for Kreuzberg across all language bindings.
ExtractionResult¶
Primary extraction result containing document content, metadata, and structured data elements. All extraction operations return this unified type with format-agnostic content and format-specific metadata.
Rust¶
pub struct ExtractionResult {
pub content: String,
pub mime_type: String,
pub metadata: Metadata,
pub tables: Vec<Table>,
pub detected_languages: Option<Vec<String>>,
pub chunks: Option<Vec<Chunk>>,
pub images: Option<Vec<ExtractedImage>>,
#[serde(skip_serializing_if = "Option::is_none")]
pub pages: Option<Vec<PageContent>>,
}
Python¶
class ExtractionResult(TypedDict):
"""Main result containing extracted content, metadata, and structured data."""
content: str
mime_type: str
metadata: Metadata
tables: list[Table]
detected_languages: list[str] | None
chunks: list[Chunk] | None
images: list[ExtractedImage] | None
pages: list[PageContent] | None
TypeScript¶
export interface ExtractionResult {
content: string;
mimeType: string;
metadata: Metadata;
tables: Table[];
detectedLanguages: string[] | null;
chunks: Chunk[] | null;
images: ExtractedImage[] | null;
pages?: PageContent[];
}
Ruby¶
class Kreuzberg::Result
attr_reader :content, :mime_type, :metadata, :tables
attr_reader :detected_languages, :chunks, :images, :pages
end
Java¶
public record ExtractionResult(
String content,
String mimeType,
Metadata metadata,
List<Table> tables,
List<String> detectedLanguages,
List<Chunk> chunks,
List<ExtractedImage> images,
List<PageContent> pages
) {}
Go¶
type ExtractionResult struct {
Content string `json:"content"`
MimeType string `json:"mime_type"`
Metadata Metadata `json:"metadata"`
Tables []Table `json:"tables"`
DetectedLanguages []string `json:"detected_languages,omitempty"`
Chunks []Chunk `json:"chunks,omitempty"`
Images []ExtractedImage `json:"images,omitempty"`
Pages []PageContent `json:"pages,omitempty"`
}
Metadata¶
Document metadata with discriminated union pattern. The format_type field determines which format-specific fields are populated, enabling type-safe access to PDF, Excel, Email, and other format-specific metadata.
Rust¶
pub struct Metadata {
pub title: Option<String>,
pub subject: Option<String>,
pub authors: Option<Vec<String>>,
pub keywords: Option<Vec<String>>,
pub language: Option<String>,
pub created_at: Option<String>,
pub modified_at: Option<String>,
pub created_by: Option<String>,
pub modified_by: Option<String>,
pub pages: Option<PageStructure>,
pub format: Option<FormatMetadata>,
pub image_preprocessing: Option<ImagePreprocessingMetadata>,
pub json_schema: Option<serde_json::Value>,
pub error: Option<ErrorMetadata>,
pub additional: HashMap<String, serde_json::Value>,
}
pub enum FormatMetadata {
#[cfg(feature = "pdf")]
Pdf(PdfMetadata),
Excel(ExcelMetadata),
Email(EmailMetadata),
Pptx(PptxMetadata),
Archive(ArchiveMetadata),
Image(ImageMetadata),
Xml(XmlMetadata),
Text(TextMetadata),
Html(Box<HtmlMetadata>),
Ocr(OcrMetadata),
}
Python¶
class Metadata(TypedDict, total=False):
"""Document metadata with format-specific fields and processing info."""
title: str | None
subject: str | None
authors: list[str] | None
keywords: list[str] | None
language: str | None
created_at: str | None
modified_at: str | None
created_by: str | None
modified_by: str | None
pages: PageStructure | None
format_type: Literal["pdf", "excel", "email", "pptx", "archive", "image", "xml", "text", "html", "ocr"]
# Format-specific fields are included at root level based on format_type
image_preprocessing: ImagePreprocessingMetadata | None
json_schema: dict[str, Any] | None
error: ErrorMetadata | None
TypeScript¶
export interface Metadata {
title?: string | null;
subject?: string | null;
authors?: string[] | null;
keywords?: string[] | null;
language?: string | null;
createdAt?: string | null;
modifiedAt?: string | null;
createdBy?: string | null;
modifiedBy?: string | null;
pages?: PageStructure | null;
format_type?: "pdf" | "excel" | "email" | "pptx" | "archive" | "image" | "xml" | "text" | "html" | "ocr";
// Format-specific fields are included at root level based on format_type
image_preprocessing?: ImagePreprocessingMetadata | null;
json_schema?: Record<string, unknown> | null;
error?: ErrorMetadata | null;
[key: string]: any;
}
Ruby¶
# Metadata is returned as a Hash from the native extension
result.metadata # Hash with string keys and mixed values
# Check format_type to determine which format-specific fields are available
result.metadata["format_type"] # "pdf", "excel", "email", etc.
Java¶
public final class Metadata {
private final Optional<String> title;
private final Optional<String> subject;
private final Optional<List<String>> authors;
private final Optional<List<String>> keywords;
private final Optional<String> language;
private final Optional<String> createdAt;
private final Optional<String> modifiedAt;
private final Optional<String> createdBy;
private final Optional<String> modifiedBy;
private final Optional<PageStructure> pages;
private final Optional<FormatMetadata> format;
private final Optional<ImagePreprocessingMetadata> imagePreprocessing;
private final Optional<Map<String, Object>> jsonSchema;
private final Optional<ErrorMetadata> error;
}
public final class FormatMetadata {
private final FormatType type;
private final Optional<PdfMetadata> pdf;
private final Optional<ExcelMetadata> excel;
private final Optional<EmailMetadata> email;
// Additional Optional fields for each supported format type
}
Go¶
type Metadata struct {
Title *string `json:"title,omitempty"`
Subject *string `json:"subject,omitempty"`
Authors []string `json:"authors,omitempty"`
Keywords []string `json:"keywords,omitempty"`
Language *string `json:"language,omitempty"`
CreatedAt *string `json:"created_at,omitempty"`
ModifiedAt *string `json:"modified_at,omitempty"`
CreatedBy *string `json:"created_by,omitempty"`
ModifiedBy *string `json:"modified_by,omitempty"`
Pages *PageStructure `json:"pages,omitempty"`
Format FormatMetadata `json:"-"`
ImagePreprocessing *ImagePreprocessingMetadata `json:"image_preprocessing,omitempty"`
JSONSchema json.RawMessage `json:"json_schema,omitempty"`
Error *ErrorMetadata `json:"error,omitempty"`
Additional map[string]json.RawMessage `json:"-"`
}
type FormatMetadata struct {
Type FormatType
Pdf *PdfMetadata
Excel *ExcelMetadata
Email *EmailMetadata
// Additional pointer fields for each supported format type
}
Metadata.pages Field¶
Contains page structure information when page tracking is available. This field provides detailed boundaries and metadata for individual pages/slides/sheets within multi-page documents.
Type: Option<PageStructure> (Rust), PageStructure | None (Python), PageStructure | null (TypeScript), Optional<PageStructure> (Java), *PageStructure (Go)
When populated: Only when the document format supports page tracking (PDF, PPTX, DOCX, XLSX) and extraction is successful.
Available fields: - total_count: Total number of pages/slides/sheets in the document - unit_type: Type of paginated unit ("page", "slide", or "sheet") - boundaries: Byte offset boundaries for each page (enables O(1) lookups from byte positions to page numbers) - pages: Detailed per-page metadata including dimensions, titles, and content counts
Example usage:
if let Some(page_structure) = metadata.pages {
println!("Document has {} pages", page_structure.total_count);
if let Some(boundaries) = page_structure.boundaries {
for boundary in boundaries {
println!("Page {}: bytes {} to {}", boundary.page_number, boundary.byte_start, boundary.byte_end);
}
}
}
if metadata.get("pages"):
page_structure = metadata["pages"]
print(f"Document has {page_structure['total_count']} pages")
if page_structure.get("boundaries"):
for boundary in page_structure["boundaries"]:
print(f"Page {boundary['page_number']}: bytes {boundary['byte_start']}-{boundary['byte_end']}")
if (metadata.pages) {
console.log(`Document has ${metadata.pages.totalCount} pages`);
if (metadata.pages.boundaries) {
for (const boundary of metadata.pages.boundaries) {
console.log(`Page ${boundary.pageNumber}: bytes ${boundary.byteStart}-${boundary.byteEnd}`);
}
}
}
metadata.pages().ifPresent(pageStructure -> {
System.out.println("Document has " + pageStructure.getTotalCount() + " pages");
pageStructure.getBoundaries().ifPresent(boundaries -> {
for (PageBoundary boundary : boundaries) {
System.out.println("Page " + boundary.pageNumber() + ": bytes " +
boundary.byteStart() + "-" + boundary.byteEnd());
}
});
});
if metadata.Pages != nil {
fmt.Printf("Document has %d pages\n", metadata.Pages.TotalCount)
if metadata.Pages.Boundaries != nil {
for _, boundary := range metadata.Pages.Boundaries {
fmt.Printf("Page %d: bytes %d-%d\n", boundary.PageNumber, boundary.ByteStart, boundary.ByteEnd)
}
}
}
PageStructure¶
Unified representation of page/slide/sheet structure with byte-accurate boundaries. Tracks the logical structure of multi-page documents, enabling precise page-to-content mapping and efficient chunk-to-page lookups.
Rust¶
pub struct PageStructure {
pub total_count: usize,
pub unit_type: PageUnitType,
pub boundaries: Option<Vec<PageBoundary>>,
pub pages: Option<Vec<PageInfo>>,
}
Python¶
class PageStructure(TypedDict, total=False):
total_count: int
unit_type: str # "page", "slide", "sheet"
boundaries: list[PageBoundary] | None
pages: list[PageInfo] | None
TypeScript¶
interface PageStructure {
totalCount: number;
unitType: "page" | "slide" | "sheet";
boundaries?: PageBoundary[];
pages?: PageInfo[];
}
Ruby¶
class PageStructure < Dry::Struct
attribute :total_count, Types::Integer
attribute :unit_type, Types::String.enum("page", "slide", "sheet")
attribute :boundaries, Types::Array.of(PageBoundary).optional
attribute :pages, Types::Array.of(PageInfo).optional
end
Java¶
public final class PageStructure {
private final long totalCount;
private final PageUnitType unitType;
private final List<PageBoundary> boundaries;
private final List<PageInfo> pages;
public Optional<List<PageBoundary>> getBoundaries() { }
public Optional<List<PageInfo>> getPages() { }
}
Go¶
type PageStructure struct {
TotalCount int `json:"total_count"`
UnitType string `json:"unit_type"`
Boundaries []PageBoundary `json:"boundaries,omitempty"`
Pages []PageInfo `json:"pages,omitempty"`
}
C¶
public record PageStructure
{
public required int TotalCount { get; init; }
public required string UnitType { get; init; }
public List<PageBoundary>? Boundaries { get; init; }
public List<PageInfo>? Pages { get; init; }
}
Fields: - total_count: Total number of pages/slides/sheets - unit_type: Distinction between Page/Slide/Sheet - boundaries: Byte offset ranges for each page (enables O(1) lookups) - pages: Per-page metadata (dimensions, counts, visibility)
PageBoundary¶
Byte offset range for a single page/slide/sheet. Enables O(1) page lookups and precise chunk-to-page mapping.
WARNING: Byte offsets are UTF-8 safe boundaries. Do not use them as character indices.
Rust¶
pub struct PageBoundary {
pub byte_start: usize,
pub byte_end: usize,
pub page_number: usize,
}
Python¶
TypeScript¶
Ruby¶
class PageBoundary < Dry::Struct
attribute :byte_start, Types::Integer
attribute :byte_end, Types::Integer
attribute :page_number, Types::Integer
end
Java¶
Go¶
type PageBoundary struct {
ByteStart int `json:"byte_start"`
ByteEnd int `json:"byte_end"`
PageNumber int `json:"page_number"`
}
C¶
public record PageBoundary
{
public required int ByteStart { get; init; }
public required int ByteEnd { get; init; }
public required int PageNumber { get; init; }
}
Fields: - byte_start: UTF-8 byte offset (inclusive) - byte_end: UTF-8 byte offset (exclusive) - page_number: 1-indexed page number
PageInfo¶
Detailed per-page metadata. Contains format-specific metadata for individual pages/slides/sheets.
Rust¶
pub struct PageInfo {
pub number: usize,
pub title: Option<String>,
pub dimensions: Option<(f64, f64)>,
pub image_count: Option<usize>,
pub table_count: Option<usize>,
pub hidden: Option<bool>,
}
Python¶
class PageInfo(TypedDict, total=False):
number: int
title: str | None
dimensions: tuple[float, float] | None
image_count: int | None
table_count: int | None
hidden: bool | None
TypeScript¶
interface PageInfo {
number: number;
title?: string;
dimensions?: [number, number];
imageCount?: number;
tableCount?: number;
hidden?: boolean;
}
Ruby¶
class PageInfo < Dry::Struct
attribute :number, Types::Integer
attribute :title, Types::String.optional
attribute :dimensions, Types::Array.of(Types::Float).optional
attribute :image_count, Types::Integer.optional
attribute :table_count, Types::Integer.optional
attribute :hidden, Types::Bool.optional
end
Java¶
public record PageInfo(
int number,
Optional<String> title,
Optional<double[]> dimensions,
Optional<Integer> imageCount,
Optional<Integer> tableCount,
Optional<Boolean> hidden
) {}
Go¶
type PageInfo struct {
Number int `json:"number"`
Title *string `json:"title,omitempty"`
Dimensions []float64 `json:"dimensions,omitempty"`
ImageCount *int `json:"image_count,omitempty"`
TableCount *int `json:"table_count,omitempty"`
Hidden *bool `json:"hidden,omitempty"`
}
C¶
public record PageInfo
{
public required int Number { get; init; }
public string? Title { get; init; }
public (double Width, double Height)? Dimensions { get; init; }
public int? ImageCount { get; init; }
public int? TableCount { get; init; }
public bool? Hidden { get; init; }
}
Fields: - number: 1-indexed page number - title: Page/slide title (PPTX) - dimensions: Width and height in points (PDF, PPTX) - image_count: Number of images on page - table_count: Number of tables on page - hidden: Whether page/slide is hidden (PPTX)
PageUnitType¶
Enum distinguishing page types across document formats.
Rust¶
Python¶
TypeScript¶
Ruby¶
Java¶
Go¶
type PageUnitType string
const (
PageUnitTypePage PageUnitType = "page"
PageUnitTypeSlide PageUnitType = "slide"
PageUnitTypeSheet PageUnitType = "sheet"
)
C¶
Values: - Page: Standard document pages (PDF, DOCX) - Slide: Presentation slides (PPTX) - Sheet: Spreadsheet sheets (XLSX)
Format-Specific Metadata¶
PDF Metadata¶
Document properties extracted from PDF files including title, author, creation dates, and page count. Available when format_type == "pdf".
Rust¶
pub struct PdfMetadata {
pub title: Option<String>,
pub author: Option<String>,
pub subject: Option<String>,
pub keywords: Option<String>,
pub creator: Option<String>,
pub producer: Option<String>,
pub creation_date: Option<String>,
pub modification_date: Option<String>,
pub page_count: Option<usize>,
}
Python¶
class PdfMetadata(TypedDict, total=False):
title: str | None
author: str | None
subject: str | None
keywords: str | None
creator: str | None
producer: str | None
creation_date: str | None
modification_date: str | None
page_count: int
TypeScript¶
export interface PdfMetadata {
title?: string | null;
author?: string | null;
subject?: string | null;
keywords?: string | null;
creator?: string | null;
producer?: string | null;
creationDate?: string | null;
modificationDate?: string | null;
pageCount?: number;
}
Java¶
public record PdfMetadata(
Optional<String> title,
Optional<String> author,
Optional<String> subject,
Optional<String> keywords,
Optional<String> creator,
Optional<String> producer,
Optional<String> creationDate,
Optional<String> modificationDate,
Optional<Integer> pageCount
) {}
Go¶
type PdfMetadata struct {
Title *string `json:"title,omitempty"`
Author *string `json:"author,omitempty"`
Subject *string `json:"subject,omitempty"`
Keywords []string `json:"keywords,omitempty"`
Creator *string `json:"creator,omitempty"`
Producer *string `json:"producer,omitempty"`
CreatedAt *string `json:"created_at,omitempty"`
ModifiedAt *string `json:"modified_at,omitempty"`
PageCount *int `json:"page_count,omitempty"`
}
Excel Metadata¶
Spreadsheet workbook information including sheet count and sheet names. Available when format_type == "excel".
Rust¶
Python¶
class ExcelMetadata(TypedDict, total=False):
sheet_count: int
sheet_names: list[str]
TypeScript¶
Java¶
Go¶
type ExcelMetadata struct {
SheetCount int `json:"sheet_count"`
SheetNames []string `json:"sheet_names"`
}
Email Metadata¶
Email message headers and recipient information including sender, recipients, message ID, and attachment lists. Available when format_type == "email".
Rust¶
pub struct EmailMetadata {
pub from_email: Option<String>,
pub from_name: Option<String>,
pub to_emails: Vec<String>,
pub cc_emails: Vec<String>,
pub bcc_emails: Vec<String>,
pub message_id: Option<String>,
pub attachments: Vec<String>,
}
Python¶
class EmailMetadata(TypedDict, total=False):
from_email: str | None
from_name: str | None
to_emails: list[str]
cc_emails: list[str]
bcc_emails: list[str]
message_id: str | None
attachments: list[str]
TypeScript¶
export interface EmailMetadata {
fromEmail?: string | null;
fromName?: string | null;
toEmails?: string[];
ccEmails?: string[];
bccEmails?: string[];
messageId?: string | null;
attachments?: string[];
}
Java¶
public record EmailMetadata(
Optional<String> fromEmail,
Optional<String> fromName,
List<String> toEmails,
List<String> ccEmails,
List<String> bccEmails,
Optional<String> messageId,
List<String> attachments
) {}
Go¶
type EmailMetadata struct {
FromEmail *string `json:"from_email,omitempty"`
FromName *string `json:"from_name,omitempty"`
ToEmails []string `json:"to_emails"`
CcEmails []string `json:"cc_emails"`
BccEmails []string `json:"bcc_emails"`
MessageID *string `json:"message_id,omitempty"`
Attachments []string `json:"attachments"`
}
Archive Metadata¶
Archive file properties including format type, file count, file list, and compression information. Available when format_type == "archive".
Rust¶
pub struct ArchiveMetadata {
pub format: String,
pub file_count: usize,
pub file_list: Vec<String>,
pub total_size: usize,
pub compressed_size: Option<usize>,
}
Python¶
class ArchiveMetadata(TypedDict, total=False):
format: str
file_count: int
file_list: list[str]
total_size: int
compressed_size: int | None
TypeScript¶
export interface ArchiveMetadata {
format?: string;
fileCount?: number;
fileList?: string[];
totalSize?: number;
compressedSize?: number | null;
}
Java¶
public record ArchiveMetadata(
String format,
int fileCount,
List<String> fileList,
int totalSize,
Optional<Integer> compressedSize
) {}
Go¶
type ArchiveMetadata struct {
Format string `json:"format"`
FileCount int `json:"file_count"`
FileList []string `json:"file_list"`
TotalSize int `json:"total_size"`
CompressedSize *int `json:"compressed_size,omitempty"`
}
Image Metadata¶
Image properties including dimensions, format type, and EXIF metadata extracted from image files. Available when format_type == "image".
Rust¶
pub struct ImageMetadata {
pub width: u32,
pub height: u32,
pub format: String,
pub exif: HashMap<String, String>,
}
Python¶
class ImageMetadata(TypedDict, total=False):
width: int
height: int
format: str
exif: dict[str, str]
TypeScript¶
export interface ImageMetadata {
width?: number;
height?: number;
format?: string;
exif?: Record<string, string>;
}
Java¶
public record ImageMetadata(
int width,
int height,
String format,
Map<String, String> exif
) {}
Go¶
type ImageMetadata struct {
Width uint32 `json:"width"`
Height uint32 `json:"height"`
Format string `json:"format"`
EXIF map[string]string `json:"exif"`
}
HTML Metadata¶
Rich web page metadata including SEO tags, Open Graph fields, Twitter Card properties, structured data, and complex resource links. Available when format_type == "html". Structured fields like headers, links, and images are represented as complex typed objects, not simple arrays of strings.
Rust¶
pub struct HtmlMetadata {
pub title: Option<String>,
pub description: Option<String>,
pub keywords: Vec<String>,
pub author: Option<String>,
pub canonical_url: Option<String>,
pub base_href: Option<String>,
pub language: Option<String>,
pub text_direction: Option<TextDirection>,
pub open_graph: BTreeMap<String, String>,
pub twitter_card: BTreeMap<String, String>,
pub meta_tags: BTreeMap<String, String>,
pub headers: Vec<HeaderMetadata>,
pub links: Vec<LinkMetadata>,
pub images: Vec<ImageMetadataType>,
pub structured_data: Vec<StructuredData>,
}
Python¶
class HtmlMetadata(TypedDict, total=False):
title: str | None
description: str | None
keywords: list[str]
author: str | None
canonical_url: str | None
base_href: str | None
language: str | None
text_direction: str | None
open_graph: dict[str, str]
twitter_card: dict[str, str]
meta_tags: dict[str, str]
headers: list[HeaderMetadata]
links: list[LinkMetadata]
images: list[ImageMetadataType]
structured_data: list[StructuredData]
TypeScript¶
export interface HtmlMetadata {
title?: string | null;
description?: string | null;
keywords: string[];
author?: string | null;
canonicalUrl?: string | null;
baseHref?: string | null;
language?: string | null;
textDirection?: string | null;
openGraph: Record<string, string>;
twitterCard: Record<string, string>;
metaTags: Record<string, string>;
headers: HeaderMetadata[];
links: LinkMetadata[];
images: ImageMetadataType[];
structuredData: StructuredData[];
}
Java¶
public record HtmlMetadata(
Optional<String> title,
Optional<String> description,
List<String> keywords,
Optional<String> author,
Optional<String> canonicalUrl,
Optional<String> baseHref,
Optional<String> language,
Optional<TextDirection> textDirection,
Map<String, String> openGraph,
Map<String, String> twitterCard,
Map<String, String> metaTags,
List<HeaderMetadata> headers,
List<LinkMetadata> links,
List<ImageMetadataType> images,
List<StructuredData> structuredData
) {}
Go¶
type HtmlMetadata struct {
Title *string `json:"title,omitempty"`
Description *string `json:"description,omitempty"`
Keywords []string `json:"keywords"`
Author *string `json:"author,omitempty"`
CanonicalURL *string `json:"canonical_url,omitempty"`
BaseHref *string `json:"base_href,omitempty"`
Language *string `json:"language,omitempty"`
TextDirection *string `json:"text_direction,omitempty"`
OpenGraph map[string]string `json:"open_graph"`
TwitterCard map[string]string `json:"twitter_card"`
MetaTags map[string]string `json:"meta_tags"`
Headers []HeaderMetadata `json:"headers"`
Links []LinkMetadata `json:"links"`
Images []ImageMetadataType `json:"images"`
StructuredData []StructuredData `json:"structured_data"`
}
C¶
public record HtmlMetadata
{
public string? Title { get; init; }
public string? Description { get; init; }
public List<string> Keywords { get; init; } = new();
public string? Author { get; init; }
public string? CanonicalUrl { get; init; }
public string? BaseHref { get; init; }
public string? Language { get; init; }
public string? TextDirection { get; init; }
public Dictionary<string, string> OpenGraph { get; init; } = new();
public Dictionary<string, string> TwitterCard { get; init; } = new();
public Dictionary<string, string> MetaTags { get; init; } = new();
public List<HeaderMetadata> Headers { get; init; } = new();
public List<LinkMetadata> Links { get; init; } = new();
public List<ImageMetadataType> Images { get; init; } = new();
public List<StructuredData> StructuredData { get; init; } = new();
}
HeaderMetadata¶
Metadata for header elements (h1-h6) with hierarchy and positioning information.
Rust¶
pub struct HeaderMetadata {
pub level: u8,
pub text: String,
pub id: Option<String>,
pub depth: usize,
pub html_offset: usize,
}
Python¶
class HeaderMetadata(TypedDict, total=False):
level: int
text: str
id: str | None
depth: int
html_offset: int
TypeScript¶
export interface HeaderMetadata {
level: number;
text: string;
id?: string | null;
depth: number;
htmlOffset: number;
}
Java¶
public record HeaderMetadata(
int level,
String text,
Optional<String> id,
int depth,
int htmlOffset
) {}
Go¶
type HeaderMetadata struct {
Level int `json:"level"`
Text string `json:"text"`
ID *string `json:"id,omitempty"`
Depth int `json:"depth"`
HtmlOffset int `json:"html_offset"`
}
C¶
public record HeaderMetadata
{
public required int Level { get; init; }
public required string Text { get; init; }
public string? Id { get; init; }
public required int Depth { get; init; }
public required int HtmlOffset { get; init; }
}
Fields: - level: Header level 1-6 (h1 through h6) - text: Normalized text content of the header - id: Optional HTML id attribute - depth: Document tree depth at the header element - html_offset: Byte offset in original HTML document
LinkMetadata¶
Metadata for hyperlink elements with classification and relationship information.
Rust¶
pub struct LinkMetadata {
pub href: String,
pub text: String,
pub title: Option<String>,
pub link_type: LinkType,
pub rel: Vec<String>,
pub attributes: HashMap<String, String>,
}
Python¶
class LinkMetadata(TypedDict, total=False):
href: str
text: str
title: str | None
link_type: LinkType
rel: list[str]
attributes: dict[str, str]
TypeScript¶
export interface LinkMetadata {
href: string;
text: string;
title?: string | null;
linkType: LinkType;
rel: string[];
attributes: Record<string, string>;
}
Java¶
public record LinkMetadata(
String href,
String text,
Optional<String> title,
LinkType linkType,
List<String> rel,
Map<String, String> attributes
) {}
Go¶
type LinkMetadata struct {
Href string `json:"href"`
Text string `json:"text"`
Title *string `json:"title,omitempty"`
LinkType LinkType `json:"link_type"`
Rel []string `json:"rel"`
Attributes map[string]string `json:"attributes"`
}
C¶
public record LinkMetadata
{
public required string Href { get; init; }
public required string Text { get; init; }
public string? Title { get; init; }
public required LinkType LinkType { get; init; }
public required List<string> Rel { get; init; }
public required Dictionary<string, string> Attributes { get; init; }
}
Fields: - href: The href URL value - text: Link text content (normalized) - title: Optional title attribute - link_type: Classification of link type - rel: Values from rel attribute - attributes: Additional attributes as key-value pairs
LinkType¶
Link type classification enum.
Rust¶
Python¶
TypeScript¶
export type LinkType = "anchor" | "internal" | "external" | "email" | "phone" | "other";
Java¶
Go¶
type LinkType string
const (
LinkTypeAnchor LinkType = "anchor"
LinkTypeInternal LinkType = "internal"
LinkTypeExternal LinkType = "external"
LinkTypeEmail LinkType = "email"
LinkTypePhone LinkType = "phone"
LinkTypeOther LinkType = "other"
)
C¶
Values: - Anchor: Anchor link (#section) - Internal: Internal link (same domain) - External: External link (different domain) - Email: Email link (mailto:) - Phone: Phone link (tel:) - Other: Other link type
ImageMetadataType¶
Metadata for image elements with source, dimensions, and type classification.
Rust¶
pub struct ImageMetadataType {
pub src: String,
pub alt: Option<String>,
pub title: Option<String>,
pub dimensions: Option<(u32, u32)>,
pub image_type: ImageType,
pub attributes: HashMap<String, String>,
}
Python¶
class ImageMetadataType(TypedDict, total=False):
src: str
alt: str | None
title: str | None
dimensions: tuple[int, int] | None
image_type: ImageType
attributes: dict[str, str]
TypeScript¶
export interface ImageMetadataType {
src: string;
alt?: string | null;
title?: string | null;
dimensions?: [number, number] | null;
imageType: ImageType;
attributes: Record<string, string>;
}
Java¶
public record ImageMetadataType(
String src,
Optional<String> alt,
Optional<String> title,
Optional<int[]> dimensions,
ImageType imageType,
Map<String, String> attributes
) {}
Go¶
type ImageMetadataType struct {
Src string `json:"src"`
Alt *string `json:"alt,omitempty"`
Title *string `json:"title,omitempty"`
Dimensions *[2]int `json:"dimensions,omitempty"`
ImageType ImageType `json:"image_type"`
Attributes map[string]string `json:"attributes"`
}
C¶
public record ImageMetadataType
{
public required string Src { get; init; }
public string? Alt { get; init; }
public string? Title { get; init; }
public (int Width, int Height)? Dimensions { get; init; }
public required ImageType ImageType { get; init; }
public required Dictionary<string, string> Attributes { get; init; }
}
Fields: - src: Image source (URL, data URI, or SVG content) - alt: Alternative text from alt attribute - title: Title attribute - dimensions: Image dimensions as (width, height) if available - image_type: Classification of image source type - attributes: Additional attributes as key-value pairs
ImageType¶
Image type classification enum.
Rust¶
Python¶
TypeScript¶
Java¶
Go¶
type ImageType string
const (
ImageTypeDataUri ImageType = "data-uri"
ImageTypeInlineSvg ImageType = "inline-svg"
ImageTypeExternal ImageType = "external"
ImageTypeRelative ImageType = "relative"
)
C¶
Values: - DataUri: Data URI image - InlineSvg: Inline SVG - External: External image URL - Relative: Relative path image
StructuredData¶
Structured data block metadata (Schema.org, microdata, RDFa) with type classification.
Rust¶
pub struct StructuredData {
pub data_type: StructuredDataType,
pub raw_json: String,
pub schema_type: Option<String>,
}
Python¶
class StructuredData(TypedDict, total=False):
data_type: StructuredDataType
raw_json: str
schema_type: str | None
TypeScript¶
export interface StructuredData {
dataType: StructuredDataType;
rawJson: string;
schemaType?: string | null;
}
Java¶
public record StructuredData(
StructuredDataType dataType,
String rawJson,
Optional<String> schemaType
) {}
Go¶
type StructuredData struct {
DataType StructuredDataType `json:"data_type"`
RawJson string `json:"raw_json"`
SchemaType *string `json:"schema_type,omitempty"`
}
C¶
public record StructuredData
{
public required StructuredDataType DataType { get; init; }
public required string RawJson { get; init; }
public string? SchemaType { get; init; }
}
Fields: - data_type: Type of structured data (JSON-LD, Microdata, RDFa) - raw_json: Raw JSON string representation - schema_type: Schema type if detectable (e.g., "Article", "Event", "Product")
StructuredDataType¶
Structured data type classification enum.
Rust¶
Python¶
TypeScript¶
Java¶
Go¶
type StructuredDataType string
const (
StructuredDataTypeJsonLd StructuredDataType = "json-ld"
StructuredDataTypeMicrodata StructuredDataType = "microdata"
StructuredDataTypeRDFa StructuredDataType = "rdfa"
)
C¶
Values: - JsonLd: JSON-LD structured data - Microdata: Microdata structured data - RDFa: RDFa structured data
Text/Markdown Metadata¶
Text document statistics and structure including line/word/character counts, headers, links, and code blocks. Available when format_type == "text".
Rust¶
pub struct TextMetadata {
pub line_count: usize,
pub word_count: usize,
pub character_count: usize,
pub headers: Option<Vec<String>>,
pub links: Option<Vec<(String, String)>>,
pub code_blocks: Option<Vec<(String, String)>>,
}
Python¶
class TextMetadata(TypedDict, total=False):
line_count: int
word_count: int
character_count: int
headers: list[str] | None
links: list[tuple[str, str]] | None
code_blocks: list[tuple[str, str]] | None
TypeScript¶
export interface TextMetadata {
lineCount?: number;
wordCount?: number;
characterCount?: number;
headers?: string[] | null;
links?: [string, string][] | null;
codeBlocks?: [string, string][] | null;
}
Java¶
public record TextMetadata(
int lineCount,
int wordCount,
int characterCount,
Optional<List<String>> headers,
Optional<List<String[]>> links,
Optional<List<String[]>> codeBlocks
) {}
Go¶
type TextMetadata struct {
LineCount int `json:"line_count"`
WordCount int `json:"word_count"`
CharacterCount int `json:"character_count"`
Headers []string `json:"headers,omitempty"`
Links [][2]string `json:"links,omitempty"`
CodeBlocks [][2]string `json:"code_blocks,omitempty"`
}
PowerPoint Metadata¶
Presentation metadata including title, author, description, summary, and font information. Available when format_type == "pptx".
Rust¶
pub struct PptxMetadata {
pub title: Option<String>,
pub author: Option<String>,
pub description: Option<String>,
pub summary: Option<String>,
pub fonts: Vec<String>,
}
Python¶
class PptxMetadata(TypedDict, total=False):
title: str | None
author: str | None
description: str | None
summary: str | None
fonts: list[str]
TypeScript¶
export interface PptxMetadata {
title?: string | null;
author?: string | null;
description?: string | null;
summary?: string | null;
fonts?: string[];
}
Java¶
public record PptxMetadata(
Optional<String> title,
Optional<String> author,
Optional<String> description,
Optional<String> summary,
List<String> fonts
) {}
Go¶
type PptxMetadata struct {
Title *string `json:"title,omitempty"`
Author *string `json:"author,omitempty"`
Description *string `json:"description,omitempty"`
Summary *string `json:"summary,omitempty"`
Fonts []string `json:"fonts"`
}
OCR Metadata¶
Optical Character Recognition processing metadata including language, page segmentation mode, output format, and table detection results. Available when format_type == "ocr".
Rust¶
pub struct OcrMetadata {
pub language: String,
pub psm: i32,
pub output_format: String,
pub table_count: usize,
pub table_rows: Option<usize>,
pub table_cols: Option<usize>,
}
Python¶
class OcrMetadata(TypedDict, total=False):
language: str
psm: int
output_format: str
table_count: int
table_rows: int | None
table_cols: int | None
TypeScript¶
export interface OcrMetadata {
language?: string;
psm?: number;
outputFormat?: string;
tableCount?: number;
tableRows?: number | null;
tableCols?: number | null;
}
Java¶
public record OcrMetadata(
String language,
int psm,
String outputFormat,
int tableCount,
Optional<Integer> tableRows,
Optional<Integer> tableCols
) {}
Go¶
type OcrMetadata struct {
Language string `json:"language"`
PSM int `json:"psm"`
OutputFormat string `json:"output_format"`
TableCount int `json:"table_count"`
TableRows *int `json:"table_rows,omitempty"`
TableCols *int `json:"table_cols,omitempty"`
}
Table¶
Structured table data extracted from documents with cell contents in 2D array format, markdown representation, and source page number.
Rust¶
pub struct Table {
pub cells: Vec<Vec<String>>,
pub markdown: String,
pub page_number: usize,
}
Python¶
TypeScript¶
Ruby¶
Java¶
Go¶
type Table struct {
Cells [][]string `json:"cells"`
Markdown string `json:"markdown"`
PageNumber int `json:"page_number"`
}
Chunk¶
Text chunk for RAG and vector search applications, containing content segment, optional embedding vector, and position metadata for precise document referencing.
Rust¶
pub struct Chunk {
pub content: String,
pub embedding: Option<Vec<f32>>,
pub metadata: ChunkMetadata,
}
pub struct ChunkMetadata {
pub char_start: usize,
pub char_end: usize,
pub token_count: Option<usize>,
pub chunk_index: usize,
pub total_chunks: usize,
}
Python¶
class ChunkMetadata(TypedDict):
char_start: int
char_end: int
token_count: int | None
chunk_index: int
total_chunks: int
class Chunk(TypedDict, total=False):
content: str
embedding: list[float] | None
metadata: ChunkMetadata
TypeScript¶
export interface ChunkMetadata {
charStart: number;
charEnd: number;
tokenCount?: number | null;
chunkIndex: number;
totalChunks: number;
}
export interface Chunk {
content: string;
embedding?: number[] | null;
metadata: ChunkMetadata;
}
Ruby¶
Kreuzberg::Result::Chunk = Struct.new(
:content, :char_start, :char_end, :token_count,
:chunk_index, :total_chunks, :embedding,
keyword_init: true
)
Java¶
public record ChunkMetadata(
int charStart,
int charEnd,
Optional<Integer> tokenCount,
int chunkIndex,
int totalChunks
) {}
public record Chunk(
String content,
Optional<List<Float>> embedding,
ChunkMetadata metadata
) {}
Go¶
type ChunkMetadata struct {
CharStart int `json:"char_start"`
CharEnd int `json:"char_end"`
TokenCount *int `json:"token_count,omitempty"`
ChunkIndex int `json:"chunk_index"`
TotalChunks int `json:"total_chunks"`
}
type Chunk struct {
Content string `json:"content"`
Embedding []float32 `json:"embedding,omitempty"`
Metadata ChunkMetadata `json:"metadata"`
}
ExtractedImage¶
Binary image data extracted from documents with format metadata, dimensions, colorspace information, and optional nested OCR extraction results.
Rust¶
pub struct ExtractedImage {
pub data: Vec<u8>,
pub format: String,
pub image_index: usize,
pub page_number: Option<usize>,
pub width: Option<u32>,
pub height: Option<u32>,
pub colorspace: Option<String>,
pub bits_per_component: Option<u32>,
pub is_mask: bool,
pub description: Option<String>,
pub ocr_result: Option<Box<ExtractionResult>>,
}
Python¶
class ExtractedImage(TypedDict, total=False):
data: bytes
format: str
image_index: int
page_number: int | None
width: int | None
height: int | None
colorspace: str | None
bits_per_component: int | None
is_mask: bool
description: str | None
ocr_result: ExtractionResult | None
TypeScript¶
export interface ExtractedImage {
data: Uint8Array;
format: string;
imageIndex: number;
pageNumber?: number | null;
width?: number | null;
height?: number | null;
colorspace?: string | null;
bitsPerComponent?: number | null;
isMask: boolean;
description?: string | null;
ocrResult?: ExtractionResult | null;
}
Ruby¶
Kreuzberg::Result::Image = Struct.new(
:data, :format, :image_index, :page_number, :width, :height,
:colorspace, :bits_per_component, :is_mask, :description, :ocr_result,
keyword_init: true
)
Java¶
public record ExtractedImage(
byte[] data,
String format,
int imageIndex,
Optional<Integer> pageNumber,
Optional<Integer> width,
Optional<Integer> height,
Optional<String> colorspace,
Optional<Integer> bitsPerComponent,
boolean isMask,
Optional<String> description,
Optional<ExtractionResult> ocrResult
) {}
Go¶
type ExtractedImage struct {
Data []byte `json:"data"`
Format string `json:"format"`
ImageIndex int `json:"image_index"`
PageNumber *int `json:"page_number,omitempty"`
Width *uint32 `json:"width,omitempty"`
Height *uint32 `json:"height,omitempty"`
Colorspace *string `json:"colorspace,omitempty"`
BitsPerComponent *uint32 `json:"bits_per_component,omitempty"`
IsMask bool `json:"is_mask"`
Description *string `json:"description,omitempty"`
OCRResult *ExtractionResult `json:"ocr_result,omitempty"`
}
Configuration Types¶
ExtractionConfig¶
Comprehensive extraction pipeline configuration controlling OCR, chunking, image processing, language detection, and all processing features.
Rust¶
pub struct ExtractionConfig {
pub use_cache: bool,
pub enable_quality_processing: bool,
pub ocr: Option<OcrConfig>,
pub force_ocr: bool,
pub chunking: Option<ChunkingConfig>,
pub images: Option<ImageExtractionConfig>,
pub pdf_options: Option<PdfConfig>,
pub token_reduction: Option<TokenReductionConfig>,
pub language_detection: Option<LanguageDetectionConfig>,
pub keywords: Option<KeywordConfig>,
pub postprocessor: Option<PostProcessorConfig>,
pub max_concurrent_extractions: Option<usize>,
}
Python¶
@dataclass
class ExtractionConfig:
use_cache: bool = True
enable_quality_processing: bool = True
ocr: OcrConfig | None = None
force_ocr: bool = False
chunking: ChunkingConfig | None = None
images: ImageExtractionConfig | None = None
pdf_options: PdfConfig | None = None
token_reduction: TokenReductionConfig | None = None
language_detection: LanguageDetectionConfig | None = None
keywords: KeywordConfig | None = None
postprocessor: PostProcessorConfig | None = None
max_concurrent_extractions: int | None = None
TypeScript¶
export interface ExtractionConfig {
useCache?: boolean;
enableQualityProcessing?: boolean;
ocr?: OcrConfig;
forceOcr?: boolean;
chunking?: ChunkingConfig;
images?: ImageExtractionConfig;
pdfOptions?: PdfConfig;
tokenReduction?: TokenReductionConfig;
languageDetection?: LanguageDetectionConfig;
keywords?: KeywordConfig;
postprocessor?: PostProcessorConfig;
maxConcurrentExtractions?: number;
}
Java¶
public record ExtractionConfig(
boolean useCache,
boolean enableQualityProcessing,
Optional<OcrConfig> ocr,
boolean forceOcr,
Optional<ChunkingConfig> chunking,
Optional<ImageExtractionConfig> images,
Optional<PdfConfig> pdfOptions,
Optional<TokenReductionConfig> tokenReduction,
Optional<LanguageDetectionConfig> languageDetection,
Optional<KeywordConfig> keywords,
Optional<PostProcessorConfig> postprocessor,
Optional<Integer> maxConcurrentExtractions
) {}
Go¶
type ExtractionConfig struct {
UseCache bool
EnableQualityProcessing bool
OCR *OcrConfig
ForceOCR bool
Chunking *ChunkingConfig
Images *ImageExtractionConfig
PDFOptions *PdfConfig
TokenReduction *TokenReductionConfig
LanguageDetection *LanguageDetectionConfig
Keywords *KeywordConfig
PostProcessor *PostProcessorConfig
MaxConcurrentExtractions *int
}
OcrConfig¶
OCR engine selection and language configuration for Tesseract, EasyOCR, and PaddleOCR backends.
Rust¶
pub struct OcrConfig {
pub backend: String, // "tesseract", "easyocr", "paddleocr"
pub language: String, // e.g., "eng", "deu", "fra"
pub tesseract_config: Option<TesseractConfig>,
}
Python¶
@dataclass
class OcrConfig:
backend: str = "tesseract"
language: str = "eng"
tesseract_config: TesseractConfig | None = None
TypeScript¶
export interface OcrConfig {
backend: string;
language?: string;
tesseractConfig?: TesseractConfig;
}
Java¶
public record OcrConfig(
String backend,
String language,
Optional<TesseractConfig> tesseractConfig
) {}
Go¶
type OcrConfig struct {
Backend string
Language string
TesseractConfig *TesseractConfig
}
TesseractConfig¶
Advanced Tesseract OCR engine parameters including page segmentation mode, preprocessing, table detection, and character whitelisting/blacklisting.
Rust¶
pub struct TesseractConfig {
pub language: String,
pub psm: i32, // Page Segmentation Mode (0-13)
pub output_format: String, // "text" or "markdown"
pub oem: i32, // OCR Engine Mode (0-3)
pub min_confidence: f64,
pub preprocessing: Option<ImagePreprocessingConfig>,
pub enable_table_detection: bool,
pub table_min_confidence: f64,
pub table_column_threshold: i32,
pub table_row_threshold_ratio: f64,
pub use_cache: bool,
pub classify_use_pre_adapted_templates: bool,
pub language_model_ngram_on: bool,
pub tessedit_dont_blkrej_good_wds: bool,
pub tessedit_dont_rowrej_good_wds: bool,
pub tessedit_enable_dict_correction: bool,
pub tessedit_char_whitelist: String,
pub tessedit_char_blacklist: String,
pub tessedit_use_primary_params_model: bool,
pub textord_space_size_is_variable: bool,
pub thresholding_method: bool,
}
ChunkingConfig¶
Text chunking configuration for RAG pipelines with character limits, overlap control, and optional embedding generation.
Rust¶
pub struct ChunkingConfig {
pub max_chars: usize,
pub max_overlap: usize,
pub embedding: Option<EmbeddingConfig>,
pub preset: Option<String>,
}
Python¶
@dataclass
class ChunkingConfig:
max_chars: int = 1000
max_overlap: int = 200
embedding: EmbeddingConfig | None = None
preset: str | None = None
TypeScript¶
export interface ChunkingConfig {
maxChars?: number;
maxOverlap?: number;
embedding?: EmbeddingConfig;
preset?: string;
}
Java¶
public record ChunkingConfig(
int maxChars,
int maxOverlap,
Optional<EmbeddingConfig> embedding,
Optional<String> preset
) {}
Go¶
type ChunkingConfig struct {
MaxChars int
MaxOverlap int
Embedding *EmbeddingConfig
Preset *string
}
EmbeddingConfig¶
Vector embedding configuration supporting FastEmbed models with normalization, batch processing, and custom model selection.
Rust¶
pub struct EmbeddingConfig {
pub model: EmbeddingModelType,
pub normalize: bool,
pub batch_size: usize,
pub show_download_progress: bool,
pub cache_dir: Option<PathBuf>,
}
pub enum EmbeddingModelType {
Preset { name: String },
FastEmbed { model: String, dimensions: usize },
Custom { model_id: String, dimensions: usize },
}
Python¶
@dataclass
class EmbeddingConfig:
model: EmbeddingModelType = field(default_factory=lambda: Preset("balanced"))
normalize: bool = True
batch_size: int = 32
show_download_progress: bool = False
cache_dir: Path | None = None
@dataclass
class EmbeddingModelType:
# Discriminated union: Preset, FastEmbed, or Custom model types
pass
TypeScript¶
export interface EmbeddingConfig {
model: EmbeddingModelType;
normalize?: boolean;
batchSize?: number;
showDownloadProgress?: boolean;
cacheDir?: string;
}
export type EmbeddingModelType =
| { type: "preset"; name: string }
| { type: "fastembed"; model: string; dimensions: number }
| { type: "custom"; modelId: string; dimensions: number };
ImageExtractionConfig¶
Image extraction and preprocessing settings including DPI targeting, dimension limits, and automatic DPI adjustment for OCR quality.
Rust¶
pub struct ImageExtractionConfig {
pub extract_images: bool,
pub target_dpi: i32,
pub max_image_dimension: i32,
pub auto_adjust_dpi: bool,
pub min_dpi: i32,
pub max_dpi: i32,
}
Python¶
@dataclass
class ImageExtractionConfig:
extract_images: bool = True
target_dpi: int = 300
max_image_dimension: int = 4096
auto_adjust_dpi: bool = True
min_dpi: int = 72
max_dpi: int = 600
TypeScript¶
export interface ImageExtractionConfig {
extractImages?: boolean;
targetDpi?: number;
maxImageDimension?: number;
autoAdjustDpi?: boolean;
minDpi?: number;
maxDpi?: number;
}
Java¶
public record ImageExtractionConfig(
boolean extractImages,
int targetDpi,
int maxImageDimension,
boolean autoAdjustDpi,
int minDpi,
int maxDpi
) {}
Go¶
type ImageExtractionConfig struct {
ExtractImages bool
TargetDPI int32
MaxImageDimension int32
AutoAdjustDPI bool
MinDPI int32
MaxDPI int32
}
PdfConfig¶
PDF-specific extraction options including image extraction control, password support for encrypted PDFs, and metadata extraction flags.
Rust¶
pub struct PdfConfig {
pub extract_images: bool,
pub passwords: Option<Vec<String>>,
pub extract_metadata: bool,
}
Python¶
@dataclass
class PdfConfig:
extract_images: bool = False
passwords: list[str] | None = None
extract_metadata: bool = True
TypeScript¶
export interface PdfConfig {
extractImages?: boolean;
passwords?: string[];
extractMetadata?: boolean;
}
Ruby¶
class Kreuzberg::Config::PdfConfig
attr_accessor :extract_images, :passwords, :extract_metadata
end
Java¶
public final class PdfConfig {
private final boolean extractImages;
private final List<String> passwords;
private final boolean extractMetadata;
public static Builder builder() { }
}
Go¶
type PdfConfig struct {
ExtractImages *bool `json:"extract_images,omitempty"`
Passwords []string `json:"passwords,omitempty"`
ExtractMetadata *bool `json:"extract_metadata,omitempty"`
}
TokenReductionConfig¶
Token reduction settings for output optimization while preserving semantically important words and phrases.
Rust¶
pub struct TokenReductionConfig {
pub mode: String,
pub preserve_important_words: bool,
}
Python¶
@dataclass
class TokenReductionConfig:
mode: str = "off"
preserve_important_words: bool = True
TypeScript¶
export interface TokenReductionConfig {
mode?: string;
preserveImportantWords?: boolean;
}
Ruby¶
class Kreuzberg::Config::TokenReductionConfig
attr_accessor :mode, :preserve_important_words
end
Java¶
public final class TokenReductionConfig {
private final String mode;
private final boolean preserveImportantWords;
public static Builder builder() { }
}
Go¶
type TokenReductionConfig struct {
Mode string `json:"mode,omitempty"`
PreserveImportantWords *bool `json:"preserve_important_words,omitempty"`
}
LanguageDetectionConfig¶
Automatic language identification configuration with confidence thresholds and multi-language detection support.
Rust¶
pub struct LanguageDetectionConfig {
pub enabled: bool,
pub min_confidence: f64,
pub detect_multiple: bool,
}
Python¶
@dataclass
class LanguageDetectionConfig:
enabled: bool = True
min_confidence: float = 0.8
detect_multiple: bool = False
TypeScript¶
export interface LanguageDetectionConfig {
enabled?: boolean;
minConfidence?: number;
detectMultiple?: boolean;
}
Ruby¶
class Kreuzberg::Config::LanguageDetectionConfig
attr_accessor :enabled, :min_confidence, :detect_multiple
end
Java¶
public final class LanguageDetectionConfig {
private final boolean enabled;
private final double minConfidence;
private final boolean detectMultiple;
public static Builder builder() { }
}
Go¶
type LanguageDetectionConfig struct {
Enabled *bool `json:"enabled,omitempty"`
MinConfidence *float64 `json:"min_confidence,omitempty"`
DetectMultiple *bool `json:"detect_multiple,omitempty"`
}
KeywordConfig¶
Automatic keyword and keyphrase extraction using YAKE or RAKE algorithms with configurable scoring, n-gram ranges, and language support.
Rust¶
pub struct KeywordConfig {
pub algorithm: KeywordAlgorithm,
pub max_keywords: usize,
pub min_score: f32,
pub ngram_range: (usize, usize),
pub language: Option<String>,
pub yake_params: Option<YakeParams>,
pub rake_params: Option<RakeParams>,
}
Python¶
@dataclass
class YakeParams:
window_size: int = 2
@dataclass
class RakeParams:
min_word_length: int = 1
max_words_per_phrase: int = 3
@dataclass
class KeywordConfig:
algorithm: str = "yake"
max_keywords: int = 10
min_score: float = 0.0
ngram_range: tuple[int, int] = (1, 3)
language: str | None = "en"
yake_params: YakeParams | None = None
rake_params: RakeParams | None = None
TypeScript¶
export interface YakeParams {
windowSize?: number;
}
export interface RakeParams {
minWordLength?: number;
maxWordsPerPhrase?: number;
}
export interface KeywordConfig {
algorithm?: KeywordAlgorithm;
maxKeywords?: number;
minScore?: number;
ngramRange?: [number, number];
language?: string;
yakeParams?: YakeParams;
rakeParams?: RakeParams;
}
Ruby¶
class Kreuzberg::Config::KeywordConfig
attr_accessor :algorithm, :max_keywords, :min_score,
:ngram_range, :language, :yake_params, :rake_params
end
Java¶
public final class KeywordConfig {
private final String algorithm;
private final Integer maxKeywords;
private final Double minScore;
private final int[] ngramRange;
private final String language;
private final YakeParams yakeParams;
private final RakeParams rakeParams;
public static Builder builder() { }
public static final class YakeParams { }
public static final class RakeParams { }
}
Go¶
type YakeParams struct {
WindowSize *int `json:"window_size,omitempty"`
}
type RakeParams struct {
MinWordLength *int `json:"min_word_length,omitempty"`
MaxWordsPerPhrase *int `json:"max_words_per_phrase,omitempty"`
}
type KeywordConfig struct {
Algorithm string `json:"algorithm,omitempty"`
MaxKeywords *int `json:"max_keywords,omitempty"`
MinScore *float64 `json:"min_score,omitempty"`
NgramRange *[2]int `json:"ngram_range,omitempty"`
Language *string `json:"language,omitempty"`
Yake *YakeParams `json:"yake_params,omitempty"`
Rake *RakeParams `json:"rake_params,omitempty"`
}
ImagePreprocessingMetadata¶
Image preprocessing transformation log tracking original and final DPI, scaling factors, dimension changes, and any processing errors.
Rust¶
pub struct ImagePreprocessingMetadata {
pub original_dimensions: (usize, usize),
pub original_dpi: (f64, f64),
pub target_dpi: i32,
pub scale_factor: f64,
pub auto_adjusted: bool,
pub final_dpi: i32,
pub new_dimensions: Option<(usize, usize)>,
pub resample_method: String,
pub dimension_clamped: bool,
pub calculated_dpi: Option<i32>,
pub skipped_resize: bool,
pub resize_error: Option<String>,
}
Python¶
class ImagePreprocessingMetadata(TypedDict, total=False):
original_dimensions: tuple[int, int]
original_dpi: tuple[float, float]
target_dpi: int
scale_factor: float
auto_adjusted: bool
final_dpi: int
new_dimensions: tuple[int, int] | None
resample_method: str
dimension_clamped: bool
calculated_dpi: int | None
skipped_resize: bool
resize_error: str | None
TypeScript¶
export interface ImagePreprocessingMetadata {
originalDimensions?: [number, number];
originalDpi?: [number, number];
targetDpi?: number;
scaleFactor?: number;
autoAdjusted?: boolean;
finalDpi?: number;
newDimensions?: [number, number] | null;
resampleMethod?: string;
dimensionClamped?: boolean;
calculatedDpi?: number | null;
skippedResize?: boolean;
resizeError?: string | null;
}
Ruby¶
class Kreuzberg::Result::ImagePreprocessingMetadata
attr_reader :original_dimensions, :original_dpi, :target_dpi, :scale_factor,
:auto_adjusted, :final_dpi, :new_dimensions, :resample_method,
:dimension_clamped, :calculated_dpi, :skipped_resize, :resize_error
end
Java¶
public record ImagePreprocessingMetadata(
int[] originalDimensions,
double[] originalDpi,
int targetDpi,
double scaleFactor,
boolean autoAdjusted,
int finalDpi,
Optional<int[]> newDimensions,
String resampleMethod,
boolean dimensionClamped,
Optional<Integer> calculatedDpi,
boolean skippedResize,
Optional<String> resizeError
) {}
Go¶
type ImagePreprocessingMetadata struct {
OriginalDimensions [2]int `json:"original_dimensions"`
OriginalDPI [2]float64 `json:"original_dpi"`
TargetDPI int `json:"target_dpi"`
ScaleFactor float64 `json:"scale_factor"`
AutoAdjusted bool `json:"auto_adjusted"`
FinalDPI int `json:"final_dpi"`
NewDimensions *[2]int `json:"new_dimensions,omitempty"`
ResampleMethod string `json:"resample_method"`
DimensionClamped bool `json:"dimension_clamped"`
CalculatedDPI *int `json:"calculated_dpi,omitempty"`
SkippedResize bool `json:"skipped_resize"`
ResizeError *string `json:"resize_error,omitempty"`
}
ImagePreprocessingConfig¶
Image preprocessing configuration for OCR quality enhancement including rotation, deskewing, denoising, contrast adjustment, and binarization methods.
Rust¶
pub struct ImagePreprocessingConfig {
pub target_dpi: i32,
pub auto_rotate: bool,
pub deskew: bool,
pub denoise: bool,
pub contrast_enhance: bool,
pub binarization_method: String,
pub invert_colors: bool,
}
Python¶
@dataclass
class ImagePreprocessingConfig:
target_dpi: int = 300
auto_rotate: bool = True
deskew: bool = True
denoise: bool = False
contrast_enhance: bool = False
binarization_method: str = "otsu"
invert_colors: bool = False
TypeScript¶
export interface ImagePreprocessingConfig {
targetDpi?: number;
autoRotate?: boolean;
deskew?: boolean;
denoise?: boolean;
contrastEnhance?: boolean;
binarizationMethod?: string;
invertColors?: boolean;
}
Ruby¶
class Kreuzberg::Config::ImagePreprocessingConfig
attr_accessor :target_dpi, :auto_rotate, :deskew, :denoise,
:contrast_enhance, :binarization_method, :invert_colors
end
Java¶
public final class ImagePreprocessingConfig {
private final int targetDpi;
private final boolean autoRotate;
private final boolean deskew;
private final boolean denoise;
private final boolean contrastEnhance;
private final String binarizationMethod;
private final boolean invertColors;
public static Builder builder() { }
}
Go¶
type ImagePreprocessingConfig struct {
TargetDPI *int `json:"target_dpi,omitempty"`
AutoRotate *bool `json:"auto_rotate,omitempty"`
Deskew *bool `json:"deskew,omitempty"`
Denoise *bool `json:"denoise,omitempty"`
ContrastEnhance *bool `json:"contrast_enhance,omitempty"`
BinarizationMode string `json:"binarization_method,omitempty"`
InvertColors *bool `json:"invert_colors,omitempty"`
}
ErrorMetadata¶
Error information captured during batch operations providing error type classification and detailed error messages.
Rust¶
Python¶
TypeScript¶
Ruby¶
Java¶
Go¶
type ErrorMetadata struct {
ErrorType string `json:"error_type"`
Message string `json:"message"`
}
XmlMetadata¶
XML document structure statistics including total element count and unique element type inventory.
Rust¶
pub struct XmlMetadata {
pub element_count: usize,
pub unique_elements: Vec<String>,
}
Python¶
class XmlMetadata(TypedDict, total=False):
element_count: int
unique_elements: list[str]
TypeScript¶
Ruby¶
class Kreuzberg::Result::XmlMetadata
attr_reader :element_count, :unique_elements
end
Java¶
Go¶
type XmlMetadata struct {
ElementCount int `json:"element_count"`
UniqueElements []string `json:"unique_elements"`
}
PostProcessorConfig¶
Post-processing pipeline control allowing selective enabling or disabling of individual text processors.
Rust¶
pub struct PostProcessorConfig {
pub enabled: bool,
pub enabled_processors: Option<Vec<String>>,
pub disabled_processors: Option<Vec<String>>,
}
Python¶
@dataclass
class PostProcessorConfig:
enabled: bool = True
enabled_processors: list[str] | None = None
disabled_processors: list[str] | None = None
TypeScript¶
export interface PostProcessorConfig {
enabled?: boolean;
enabledProcessors?: string[];
disabledProcessors?: string[];
}
Ruby¶
class Kreuzberg::Config::PostProcessorConfig
attr_accessor :enabled, :enabled_processors, :disabled_processors
end
Java¶
public final class PostProcessorConfig {
private final boolean enabled;
private final List<String> enabledProcessors;
private final List<String> disabledProcessors;
public static Builder builder() { }
}
Go¶
type PostProcessorConfig struct {
Enabled *bool `json:"enabled,omitempty"`
EnabledProcessors []string `json:"enabled_processors,omitempty"`
DisabledProcessors []string `json:"disabled_processors,omitempty"`
}
HierarchyConfig¶
Document hierarchy detection configuration controlling font size clustering and hierarchy level assignment. Extracts document structure (H1-H6 headings and body text) by analyzing font sizes and spatial positioning of text blocks.
Rust¶
pub struct HierarchyConfig {
/// Enable hierarchy extraction
pub enabled: bool,
/// Number of font size clusters to use for hierarchy levels (1-7)
/// Default: 6 (provides H1-H6 heading levels with body text)
pub k_clusters: usize,
/// Include bounding box information in hierarchy blocks
pub include_bbox: bool,
/// OCR coverage threshold for smart OCR triggering (0.0-1.0)
/// Default: 0.5 (trigger OCR if less than 50% of page has text)
pub ocr_coverage_threshold: Option<f32>,
}
Python¶
class HierarchyConfig:
"""Hierarchy detection configuration for document structure analysis."""
def __init__(
self,
enabled: bool = True,
k_clusters: int = 6,
include_bbox: bool = True,
ocr_coverage_threshold: float | None = None
):
self.enabled = enabled
self.k_clusters = k_clusters
self.include_bbox = include_bbox
self.ocr_coverage_threshold = ocr_coverage_threshold
TypeScript¶
export interface HierarchyConfig {
/** Enable hierarchy extraction. Default: true. */
enabled?: boolean;
/** Number of font size clusters (2-10). Default: 6. */
kClusters?: number;
/** Include bounding box information. Default: true. */
includeBbox?: boolean;
/** OCR coverage threshold (0.0-1.0). Default: null. */
ocrCoverageThreshold?: number | null;
}
Ruby¶
class Kreuzberg::Config::Hierarchy
attr_reader :enabled, :k_clusters, :include_bbox, :ocr_coverage_threshold
def initialize(
enabled: true,
k_clusters: 6,
include_bbox: true,
ocr_coverage_threshold: nil
)
@enabled = enabled
@k_clusters = k_clusters
@include_bbox = include_bbox
@ocr_coverage_threshold = ocr_coverage_threshold
end
end
Java¶
public final class HierarchyConfig {
private final boolean enabled;
private final int kClusters;
private final boolean includeBbox;
private final Double ocrCoverageThreshold;
public static Builder builder() {
return new Builder();
}
public boolean isEnabled() { return enabled; }
public int getKClusters() { return kClusters; }
public boolean isIncludeBbox() { return includeBbox; }
public Double getOcrCoverageThreshold() { return ocrCoverageThreshold; }
public static final class Builder {
private boolean enabled = true;
private int kClusters = 6;
private boolean includeBbox = true;
private Double ocrCoverageThreshold;
public Builder enabled(boolean enabled) { ... }
public Builder kClusters(int kClusters) { ... }
public Builder includeBbox(boolean includeBbox) { ... }
public Builder ocrCoverageThreshold(Double threshold) { ... }
public HierarchyConfig build() { ... }
}
}
Go¶
// HierarchyConfig controls PDF hierarchy extraction based on font sizes.
type HierarchyConfig struct {
// Enable hierarchy extraction. Default: true.
Enabled *bool `json:"enabled,omitempty"`
// Number of font size clusters (2-10). Default: 6.
KClusters *int `json:"k_clusters,omitempty"`
// Include bounding box information. Default: true.
IncludeBbox *bool `json:"include_bbox,omitempty"`
// OCR coverage threshold (0.0-1.0). Default: null.
OcrCoverageThreshold *float64 `json:"ocr_coverage_threshold,omitempty"`
}
C¶
public sealed class HierarchyConfig
{
/// <summary>
/// Whether hierarchy detection is enabled.
/// </summary>
[JsonPropertyName("enabled")]
public bool? Enabled { get; set; }
/// <summary>
/// Number of k clusters for hierarchy detection.
/// </summary>
[JsonPropertyName("k_clusters")]
public int? KClusters { get; set; }
/// <summary>
/// Whether to include bounding box information in hierarchy output.
/// </summary>
[JsonPropertyName("include_bbox")]
public bool? IncludeBbox { get; set; }
/// <summary>
/// OCR coverage threshold for hierarchy detection (0.0-1.0).
/// </summary>
[JsonPropertyName("ocr_coverage_threshold")]
public float? OcrCoverageThreshold { get; set; }
}
Fields:
enabled: Enable or disable hierarchy extraction (Default:true)k_clusters: Number of font size clusters for hierarchy classification (Range: 2-10, Default: 6)- 6 clusters map to H1-H6 heading levels plus body text
- Larger values create finer-grained hierarchy distinctions
- Smaller values group more font sizes together
include_bbox: Include bounding box coordinates in output (Default:true)- When true, each block includes left, top, right, bottom coordinates in PDF units
- When false, reduces output size but loses spatial positioning information
ocr_coverage_threshold: Trigger OCR when text coverage falls below threshold (Range: 0.0-1.0, Default:null)0.5= OCR triggers if less than 50% of page has extractable textnull= OCR triggering controlled by other config settings- Useful for detecting scanned or image-heavy documents
Example Usage:
use kreuzberg::core::config::HierarchyConfig;
let hierarchy = HierarchyConfig {
enabled: true,
k_clusters: 6,
include_bbox: true,
ocr_coverage_threshold: Some(0.5),
};
from kreuzberg import HierarchyConfig, ExtractionConfig, PdfConfig
hierarchy = HierarchyConfig(
enabled=True,
k_clusters=6,
include_bbox=True,
ocr_coverage_threshold=0.5
)
pdf_config = PdfConfig(hierarchy=hierarchy)
config = ExtractionConfig(pdf_options=pdf_config)
const hierarchyConfig: HierarchyConfig = {
enabled: true,
kClusters: 6,
includeBbox: true,
ocrCoverageThreshold: 0.5
};
const pdfConfig: PdfConfig = {
hierarchy: hierarchyConfig
};
HierarchyConfig hierarchyConfig = HierarchyConfig.builder()
.enabled(true)
.kClusters(6)
.includeBbox(true)
.ocrCoverageThreshold(0.5)
.build();
PdfConfig pdfConfig = PdfConfig.builder()
.hierarchy(hierarchyConfig)
.build();
hierarchyConfig := &kreuzberg.HierarchyConfig{
Enabled: kreuzberg.BoolPtr(true),
KClusters: kreuzberg.IntPtr(6),
IncludeBbox: kreuzberg.BoolPtr(true),
OcrCoverageThreshold: kreuzberg.FloatPtr(0.5),
}
pdfConfig := &kreuzberg.PdfConfig{
Hierarchy: hierarchyConfig,
}
PageHierarchy¶
Output structure containing extracted document hierarchy with text blocks and their hierarchy levels. Returned in extraction results when hierarchy extraction is enabled.
Rust¶
pub struct PageHierarchy {
/// Total number of hierarchy blocks extracted from the page
pub block_count: usize,
/// Array of hierarchical text blocks ordered by document position
pub blocks: Vec<HierarchicalBlock>,
}
Python¶
class PageHierarchy(TypedDict):
"""Document hierarchy structure with text blocks and levels."""
block_count: int
blocks: list[HierarchicalBlock]
TypeScript¶
export interface PageHierarchy {
/** Total number of hierarchy blocks extracted from the page */
blockCount: number;
/** Array of hierarchical text blocks ordered by document position */
blocks: HierarchicalBlock[];
}
Ruby¶
Java¶
Go¶
type PageHierarchy struct {
BlockCount int `json:"block_count"`
Blocks []HierarchicalBlock `json:"blocks"`
}
C¶
public record PageHierarchy
{
[JsonPropertyName("block_count")]
public required int BlockCount { get; init; }
[JsonPropertyName("blocks")]
public required List<HierarchicalBlock> Blocks { get; init; }
}
Fields:
block_count: Total number of text blocks in the hierarchy (useful for batch processing)blocks: Array ofHierarchicalBlockobjects in document order (top-to-bottom, left-to-right)
HierarchicalBlock¶
A single text block with assigned hierarchy level and spatial information. Represents a unit of text (heading or body paragraph) within the document structure.
Rust¶
#[derive(Debug, Clone)]
pub struct HierarchicalBlock {
/// The text content of this block
pub text: String,
/// Hierarchy level: "h1", "h2", "h3", "h4", "h5", "h6", or "body"
pub level: HierarchyLevel,
/// Font size in points (derived from PDF or OCR)
pub font_size: f32,
/// Bounding box coordinates in PDF units (if include_bbox=true)
/// Format: (left, top, right, bottom)
pub bbox: Option<BoundingBox>,
/// Index position of this block in the blocks array
pub block_index: usize,
}
pub enum HierarchyLevel {
H1 = 1,
H2 = 2,
H3 = 3,
H4 = 4,
H5 = 5,
H6 = 6,
Body = 0,
}
pub struct BoundingBox {
pub left: f32,
pub top: f32,
pub right: f32,
pub bottom: f32,
}
Python¶
class HierarchicalBlock(TypedDict, total=False):
"""A text block with hierarchy level assignment."""
text: str
level: Literal["h1", "h2", "h3", "h4", "h5", "h6", "body"]
font_size: float
bbox: tuple[float, float, float, float] | None
block_index: int
TypeScript¶
export interface HierarchicalBlock {
/** The text content of this block */
text: string;
/** Hierarchy level: "h1" through "h6" or "body" */
level: "h1" | "h2" | "h3" | "h4" | "h5" | "h6" | "body";
/** Font size in points */
fontSize: number;
/** Bounding box [left, top, right, bottom] in PDF coordinates, or null */
bbox?: [number, number, number, number] | null;
/** Index position of this block in the blocks array */
blockIndex: number;
}
Ruby¶
class Kreuzberg::Result::HierarchicalBlock
attr_reader :text, :level, :font_size, :bbox, :block_index
end
Java¶
public record HierarchicalBlock(
String text,
String level, // "h1", "h2", ..., "h6", "body"
float fontSize,
Optional<BoundingBox> bbox,
int blockIndex
) {}
public record BoundingBox(
float left,
float top,
float right,
float bottom
) {}
Go¶
type HierarchicalBlock struct {
Text string `json:"text"`
Level string `json:"level"` // "h1", "h2", ..., "h6", "body"
FontSize float32 `json:"font_size"`
Bbox *BoundingBox `json:"bbox,omitempty"`
BlockIndex int `json:"block_index"`
}
type BoundingBox struct {
Left float32 `json:"left"`
Top float32 `json:"top"`
Right float32 `json:"right"`
Bottom float32 `json:"bottom"`
}
C¶
public record HierarchicalBlock
{
[JsonPropertyName("text")]
public required string Text { get; init; }
[JsonPropertyName("level")]
public required string Level { get; init; } // "h1", "h2", ..., "h6", "body"
[JsonPropertyName("font_size")]
public required float FontSize { get; init; }
[JsonPropertyName("bbox")]
public BoundingBox? Bbox { get; init; }
[JsonPropertyName("block_index")]
public required int BlockIndex { get; init; }
}
public record BoundingBox
{
[JsonPropertyName("left")]
public required float Left { get; init; }
[JsonPropertyName("top")]
public required float Top { get; init; }
[JsonPropertyName("right")]
public required float Right { get; init; }
[JsonPropertyName("bottom")]
public required float Bottom { get; init; }
}
Fields:
text: Complete text content of the block (normalized and trimmed)- Whitespace is collapsed for consistency
- Preserves original character content
-
Empty strings are included in output
-
level: Hierarchy level classification "h1"through"h6": Heading levels assigned by font size clustering"body": Body text or smaller headings (cluster 6+)-
Assignment based on font size centroid similarity
-
font_size: Average font size of text in this block (in points) - Derived from PDF font metrics or OCR confidence
- Used internally for hierarchy level assignment
-
Useful for downstream styling or filtering
-
bbox: Bounding box in PDF coordinate system (optional) - Format:
[left, top, right, bottom]in PDF units - Top-left origin (0,0), Y increases downward
nullwheninclude_bbox=falsein config- Enables precise text positioning, highlighting, or spatial queries
-
Coordinates are in points (1/72 inch)
-
block_index: Zero-indexed position in the blocks array - Useful for document position tracking
- Enables block-to-source mapping
- Matches order in extraction output
Hierarchy Level Assignment Algorithm:
- Extract font sizes from all text blocks
- Apply K-means clustering with k=
k_clustersparameter - Sort clusters by centroid size (descending)
- Map clusters to hierarchy levels:
- Cluster 0 (largest font) → H1
- Cluster 1 → H2
- Cluster 2 → H3
- Cluster 3 → H4
- Cluster 4 → H5
- Cluster 5 → H6
- Cluster 6+ (smallest font) → Body
- Assign levels based on block's font size similarity to cluster centroids
Example Usage:
use kreuzberg::types::ExtractionResult;
if let Some(pages) = result.pages {
for page in pages {
if let Some(hierarchy) = &page.hierarchy {
println!("Found {} blocks:", hierarchy.block_count);
for block in &hierarchy.blocks {
println!(" [{:?}] {}", block.level, block.text);
if let Some(bbox) = &block.bbox {
println!(" Position: ({}, {}) to ({}, {})",
bbox.left, bbox.top, bbox.right, bbox.bottom);
}
}
}
}
}
from kreuzberg import extract_file
result = extract_file('document.pdf')
if result.get('pages'):
for page in result['pages']:
if 'hierarchy' in page:
hierarchy = page['hierarchy']
print(f"Found {hierarchy['block_count']} blocks:")
for block in hierarchy['blocks']:
print(f" [{block['level']}] {block['text']}")
if block.get('bbox'):
left, top, right, bottom = block['bbox']
print(f" Position: ({left}, {top}) to ({right}, {bottom})")
import { extract } from 'kreuzberg';
const result = await extract('document.pdf');
if (result.pages) {
for (const page of result.pages) {
if (page.hierarchy) {
const { blockCount, blocks } = page.hierarchy;
console.log(`Found ${blockCount} blocks:`);
for (const block of blocks) {
console.log(` [${block.level}] ${block.text}`);
if (block.bbox) {
const [left, top, right, bottom] = block.bbox;
console.log(` Position: (${left}, ${top}) to (${right}, ${bottom})`);
}
}
}
}
}
ExtractionResult result = kreuzberg.extract(new File("document.pdf"));
if (result.pages() != null) {
for (PageContent page : result.pages()) {
page.hierarchy().ifPresent(hierarchy -> {
System.out.println("Found " + hierarchy.blockCount() + " blocks:");
for (HierarchicalBlock block : hierarchy.blocks()) {
System.out.println(" [" + block.level() + "] " + block.text());
block.bbox().ifPresent(bbox -> {
System.out.printf(" Position: (%.1f, %.1f) to (%.1f, %.1f)%n",
bbox.left(), bbox.top(), bbox.right(), bbox.bottom());
});
}
});
}
}
result, _ := kreuzberg.Extract("document.pdf", nil)
if result.Pages != nil {
for _, page := range result.Pages {
if page.Hierarchy != nil {
hierarchy := page.Hierarchy
fmt.Printf("Found %d blocks:\n", hierarchy.BlockCount)
for _, block := range hierarchy.Blocks {
fmt.Printf(" [%s] %s\n", block.Level, block.Text)
if block.Bbox != nil {
bbox := block.Bbox
fmt.Printf(" Position: (%.1f, %.1f) to (%.1f, %.1f)\n",
bbox.Left, bbox.Top, bbox.Right, bbox.Bottom)
}
}
}
}
}
Common Use Cases:
- Document Structure Extraction: Build table of contents from H1-H6 blocks
- Content Filtering: Extract only body text or headings at specific levels
- Spatial Highlighting: Use bbox coordinates for PDF annotation and visual markup
- Semantic Chunking: Group blocks by hierarchy level for AI processing
- Accessibility: Generate proper HTML semantic structure from hierarchy levels
- Document Analysis: Calculate reading complexity and structure metrics
Type Mappings¶
Cross-language type equivalents showing how Kreuzberg types map across Rust, Python, TypeScript, Ruby, Java, and Go:
| Purpose | Rust | Python | TypeScript | Ruby | Java | Go |
|---|---|---|---|---|---|---|
| String | String | str | string | String | String | string |
| Optional/Nullable | Option<T> | T \| None | T \| null | T or nil | Optional<T> | *T |
| Array/List | Vec<T> | list[T] | T[] | Array | List<T> | []T |
| Tuple/Pair | (T, U) | tuple[T, U] | [T, U] | Array | Pair<T,U> | [2]T |
| Dictionary/Map | HashMap<K,V> | dict[K, V] | Record<K, V> | Hash | Map<K, V> | map[K]V |
| Integer | i32, i64, usize | int | number | Integer | int, long | int, int64 |
| Float | f32, f64 | float | number | Float | float, double | float32, float64 |
| Boolean | bool | bool | boolean | Boolean | boolean | bool |
| Bytes | Vec<u8> | bytes | Uint8Array | String (binary) | byte[] | []byte |
| Union/Enum | enum | Literal | union | case statement | sealed class | custom struct |
Nullability and Optionals¶
Language-Specific Optional Field Handling¶
Each language binding uses its idiomatic approach for representing optional and nullable values:
Rust: Uses Option<T> explicitly. None represents absence. Mandatory at compile-time.
Python: Uses T | None type hints. Can be None at runtime. TypedDict with total=False makes all fields optional.
TypeScript: Uses T | null or T | undefined. Properties marked ? are optional. Nullable with null literal.
Ruby: Everything is nullable. Use nil for absence. No type system enforcement.
Java: Uses Optional<T> for explicit optionality. Records with Optional fields. Checked at compile-time for clarity.
Go: Uses pointers (*T) for optional values. nil represents absence. Primitive types can't be nil (use pointers).
Practical Examples: Accessing Optional Metadata Fields¶
Demonstrating idiomatic null-safe field access patterns across all supported languages:
// Rust: Pattern matching for safe optional field access
if let Some(title) = metadata.format.pdf.title {
println!("Title: {}", title);
}
# Python: Dictionary-based metadata access with safe get method
if metadata.get("title"):
print(f"Title: {metadata['title']}")
// TypeScript: Nullish coalescing for default values
console.log(metadata.title ?? "No title");
# Ruby: Conditional output with truthy check
puts "Title: #{result.metadata["title"]}" if result.metadata["title"]