HTML Metadata Structure Changes (v4.0)¶
Summary¶
HTML metadata has been restructured for better organization and type safety. The changes consolidate individual Open Graph and Twitter Card fields into maps, and convert keywords from a single string to an array.
Breaking Changes¶
1. Keywords: String to Array¶
Before (v3.x):
// Option<String> - comma-separated or space-separated
html_meta.keywords // "seo, metadata, html"
After (v4.0):
// Vec<String> - structured array
html_meta.keywords // vec!["seo", "metadata", "html"]
2. Canonical URL: Field Rename¶
Before (v3.x):
After (v4.0):
3. Open Graph: Individual Fields to Map¶
Before (v3.x):
html_meta.og_title // Option<String>
html_meta.og_description // Option<String>
html_meta.og_image // Option<String>
html_meta.og_url // Option<String>
html_meta.og_type // Option<String>
html_meta.og_site_name // Option<String>
After (v4.0):
html_meta.open_graph // BTreeMap<String, String>
html_meta.open_graph.get("title") // Option<&String>
html_meta.open_graph.get("description") // Option<&String>
html_meta.open_graph.get("image") // Option<&String>
html_meta.open_graph.get("url") // Option<&String>
html_meta.open_graph.get("type") // Option<&String>
html_meta.open_graph.get("site_name") // Option<&String>
4. Twitter Card: Individual Fields to Map¶
Before (v3.x):
html_meta.twitter_card // Option<String>
html_meta.twitter_title // Option<String>
html_meta.twitter_description // Option<String>
html_meta.twitter_image // Option<String>
html_meta.twitter_site // Option<String>
html_meta.twitter_creator // Option<String>
After (v4.0):
html_meta.twitter_card // BTreeMap<String, String>
html_meta.twitter_card.get("card") // Option<&String>
html_meta.twitter_card.get("title") // Option<&String>
html_meta.twitter_card.get("description") // Option<&String>
html_meta.twitter_card.get("image") // Option<&String>
html_meta.twitter_card.get("site") // Option<&String>
html_meta.twitter_card.get("creator") // Option<&String>
5. Removed Fields¶
The following link-related fields have been removed: - link_author - link_license - link_alternate
Use the new links field instead for comprehensive link extraction.
6. New Fields¶
HTML metadata now includes rich metadata about page content: - language: Document language (e.g., "en", "fr") - text_direction: Text direction ("ltr", "rtl") - headers: List of page headers/headings with structured metadata - links: List of links with detailed metadata and type classification - images: List of images with alt text, dimensions, and type classification - structured_data: Parsed JSON-LD, microdata, and RDFa data - meta_tags: All meta tags as a map
Migration Guide¶
Rust¶
use kreuzberg::{extract_file_sync, ExtractionConfig};
let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
if let Some(html_meta) = result.metadata.html {
// Keywords as single string
if let Some(keywords) = html_meta.keywords {
let keyword_vec: Vec<&str> = keywords.split(',').map(|s| s.trim()).collect();
println!("Keywords: {:?}", keyword_vec);
}
// Canonical as separate field
if let Some(canonical) = html_meta.canonical {
println!("Canonical: {}", canonical);
}
// Open Graph as individual fields
if let Some(og_title) = html_meta.og_title {
println!("OG Title: {}", og_title);
}
if let Some(og_image) = html_meta.og_image {
println!("OG Image: {}", og_image);
}
// Twitter as individual fields
if let Some(twitter_card) = html_meta.twitter_card {
println!("Twitter Card: {}", twitter_card);
}
}
use kreuzberg::{extract_file_sync, ExtractionConfig};
let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
if let Some(html_meta) = result.metadata.html {
// Keywords as array
if !html_meta.keywords.is_empty() {
println!("Keywords: {:?}", html_meta.keywords);
}
// Canonical renamed
if let Some(canonical_url) = html_meta.canonical_url {
println!("Canonical URL: {}", canonical_url);
}
// Open Graph from map
if let Some(og_title) = html_meta.open_graph.get("title") {
println!("OG Title: {}", og_title);
}
if let Some(og_image) = html_meta.open_graph.get("image") {
println!("OG Image: {}", og_image);
}
// Twitter from map
if let Some(twitter_card) = html_meta.twitter_card.get("card") {
println!("Twitter Card: {}", twitter_card);
}
// New fields
if let Some(lang) = html_meta.language {
println!("Language: {}", lang);
}
if let Some(headers) = html_meta.headers {
println!("Headers: {:?}", headers);
}
if let Some(links) = html_meta.links {
for (url, text) in links {
println!("Link: {} ({})", url, text);
}
}
}
Python¶
from kreuzberg import extract_file_sync, ExtractionConfig
result = extract_file_sync("page.html", config=ExtractionConfig())
html_meta = result.metadata.get("html", {})
# Keywords as single string
if html_meta.get('keywords'):
keyword_list = html_meta['keywords'].split(',')
print(f"Keywords: {keyword_list}")
# Canonical as separate field
if html_meta.get('canonical'):
print(f"Canonical: {html_meta['canonical']}")
# Open Graph as individual fields
if html_meta.get('og_title'):
print(f"OG Title: {html_meta['og_title']}")
if html_meta.get('og_image'):
print(f"OG Image: {html_meta['og_image']}")
# Twitter as individual fields
if html_meta.get('twitter_card'):
print(f"Twitter Card: {html_meta['twitter_card']}")
from kreuzberg import extract_file_sync, ExtractionConfig
result = extract_file_sync("page.html", config=ExtractionConfig())
html_meta = result.metadata.get("html", {})
# Keywords as array
if html_meta.get('keywords'):
print(f"Keywords: {html_meta['keywords']}")
# Canonical renamed
if html_meta.get('canonical_url'):
print(f"Canonical URL: {html_meta['canonical_url']}")
# Open Graph from map
open_graph = html_meta.get('open_graph', {})
if open_graph.get('title'):
print(f"OG Title: {open_graph['title']}")
if open_graph.get('image'):
print(f"OG Image: {open_graph['image']}")
# Twitter from map
twitter_card = html_meta.get('twitter_card', {})
if twitter_card.get('card'):
print(f"Twitter Card: {twitter_card['card']}")
# New fields
if html_meta.get('language'):
print(f"Language: {html_meta['language']}")
if html_meta.get('headers'):
print(f"Headers: {html_meta['headers']}")
if html_meta.get('links'):
for url, text in html_meta['links']:
print(f"Link: {url} ({text})")
TypeScript¶
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('page.html');
const htmlMeta = result.metadata;
// Keywords as single string
if (htmlMeta.keywords) {
const keywordArray = htmlMeta.keywords.split(',');
console.log('Keywords:', keywordArray);
}
// Canonical as separate field
if (htmlMeta.canonical) {
console.log('Canonical:', htmlMeta.canonical);
}
// Open Graph as individual fields
if (htmlMeta.ogTitle) {
console.log('OG Title:', htmlMeta.ogTitle);
}
if (htmlMeta.ogImage) {
console.log('OG Image:', htmlMeta.ogImage);
}
// Twitter as individual fields
if (htmlMeta.twitterCard) {
console.log('Twitter Card:', htmlMeta.twitterCard);
}
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('page.html');
const htmlMeta = result.metadata;
// Keywords as array
if (htmlMeta.keywords?.length > 0) {
console.log('Keywords:', htmlMeta.keywords);
}
// Canonical renamed
if (htmlMeta.canonicalUrl) {
console.log('Canonical URL:', htmlMeta.canonicalUrl);
}
// Open Graph from map
if (htmlMeta.openGraph) {
if (htmlMeta.openGraph['title']) {
console.log('OG Title:', htmlMeta.openGraph['title']);
}
if (htmlMeta.openGraph['image']) {
console.log('OG Image:', htmlMeta.openGraph['image']);
}
}
// Twitter from map
if (htmlMeta.twitterCard) {
if (htmlMeta.twitterCard['card']) {
console.log('Twitter Card:', htmlMeta.twitterCard['card']);
}
}
// New fields
if (htmlMeta.language) {
console.log('Language:', htmlMeta.language);
}
if (htmlMeta.headers?.length > 0) {
console.log('Headers:', htmlMeta.headers);
}
if (htmlMeta.links?.length > 0) {
htmlMeta.links.forEach(([url, text]) => {
console.log(`Link: ${url} (${text})`);
});
}
Java¶
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import java.util.Map;
ExtractionResult result = Kreuzberg.extractFileSync("page.html");
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");
// Keywords as single string
String keywords = (String) htmlMeta.get("keywords");
if (keywords != null) {
String[] keywordArray = keywords.split(",");
System.out.println("Keywords: " + Arrays.toString(keywordArray));
}
// Canonical as separate field
String canonical = (String) htmlMeta.get("canonical");
if (canonical != null) {
System.out.println("Canonical: " + canonical);
}
// Open Graph as individual fields
String ogTitle = (String) htmlMeta.get("og_title");
if (ogTitle != null) {
System.out.println("OG Title: " + ogTitle);
}
// Twitter as individual fields
String twitterCard = (String) htmlMeta.get("twitter_card");
if (twitterCard != null) {
System.out.println("Twitter Card: " + twitterCard);
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import java.util.Map;
import java.util.List;
ExtractionResult result = Kreuzberg.extractFileSync("page.html");
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");
// Keywords as array
@SuppressWarnings("unchecked")
List<String> keywords = (List<String>) htmlMeta.get("keywords");
if (keywords != null && !keywords.isEmpty()) {
System.out.println("Keywords: " + keywords);
}
// Canonical renamed
String canonicalUrl = (String) htmlMeta.get("canonical_url");
if (canonicalUrl != null) {
System.out.println("Canonical URL: " + canonicalUrl);
}
// Open Graph from map
@SuppressWarnings("unchecked")
Map<String, String> openGraph = (Map<String, String>) htmlMeta.get("open_graph");
if (openGraph != null) {
String ogTitle = openGraph.get("title");
if (ogTitle != null) {
System.out.println("OG Title: " + ogTitle);
}
}
// Twitter from map
@SuppressWarnings("unchecked")
Map<String, String> twitterCard = (Map<String, String>) htmlMeta.get("twitter_card");
if (twitterCard != null) {
String card = twitterCard.get("card");
if (card != null) {
System.out.println("Twitter Card: " + card);
}
}
// New fields
String language = (String) htmlMeta.get("language");
if (language != null) {
System.out.println("Language: " + language);
}
@SuppressWarnings("unchecked")
List<String> headers = (List<String>) htmlMeta.get("headers");
if (headers != null && !headers.isEmpty()) {
System.out.println("Headers: " + headers);
}
Go¶
package main
import (
"fmt"
"log"
"strings"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
result, err := kreuzberg.ExtractFileSync("page.html", nil)
if err != nil {
log.Fatalf("extract: %v", err)
}
if html, ok := result.Metadata.HTMLMetadata(); ok {
// Keywords as single string
if html.Keywords != nil {
keywordSlice := strings.Split(*html.Keywords, ",")
fmt.Println("Keywords:", keywordSlice)
}
// Canonical as separate field
if html.Canonical != nil {
fmt.Println("Canonical:", *html.Canonical)
}
// Open Graph as individual fields
if html.OGTitle != nil {
fmt.Println("OG Title:", *html.OGTitle)
}
if html.OGImage != nil {
fmt.Println("OG Image:", *html.OGImage)
}
// Twitter as individual fields
if html.TwitterCard != nil {
fmt.Println("Twitter Card:", *html.TwitterCard)
}
}
}
package main
import (
"fmt"
"log"
"strings"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)
func main() {
result, err := kreuzberg.ExtractFileSync("page.html", nil)
if err != nil {
log.Fatalf("extract: %v", err)
}
if html, ok := result.Metadata.HTMLMetadata(); ok {
// Keywords as array
if len(html.Keywords) > 0 {
fmt.Println("Keywords:", strings.Join(html.Keywords, ", "))
}
// Canonical renamed
if html.CanonicalURL != nil {
fmt.Println("Canonical URL:", *html.CanonicalURL)
}
// Open Graph from map
if len(html.OpenGraph) > 0 {
if ogTitle, ok := html.OpenGraph["title"]; ok {
fmt.Println("OG Title:", ogTitle)
}
if ogImage, ok := html.OpenGraph["image"]; ok {
fmt.Println("OG Image:", ogImage)
}
}
// Twitter from map
if len(html.TwitterCard) > 0 {
if card, ok := html.TwitterCard["card"]; ok {
fmt.Println("Twitter Card:", card)
}
}
// New fields
if html.Language != nil {
fmt.Println("Language:", *html.Language)
}
if len(html.Headers) > 0 {
fmt.Println("Headers:", strings.Join(html.Headers, ", "))
}
if len(html.Links) > 0 {
for _, link := range html.Links {
fmt.Printf("Link: %s (%s)\n", link[0], link[1])
}
}
}
}
Ruby¶
require 'kreuzberg'
result = Kreuzberg.extract_file_sync('page.html')
html_meta = result.metadata['html']
# Keywords as single string
if html_meta['keywords']
keyword_array = html_meta['keywords'].split(',').map(&:strip)
puts "Keywords: #{keyword_array}"
end
# Canonical as separate field
if html_meta['canonical']
puts "Canonical: #{html_meta['canonical']}"
end
# Open Graph as individual fields
if html_meta['og_title']
puts "OG Title: #{html_meta['og_title']}"
end
if html_meta['og_image']
puts "OG Image: #{html_meta['og_image']}"
end
# Twitter as individual fields
if html_meta['twitter_card']
puts "Twitter Card: #{html_meta['twitter_card']}"
end
require 'kreuzberg'
result = Kreuzberg.extract_file_sync('page.html')
html_meta = result.metadata['html']
# Keywords as array
if html_meta['keywords'] && !html_meta['keywords'].empty?
puts "Keywords: #{html_meta['keywords']}"
end
# Canonical renamed
if html_meta['canonical_url']
puts "Canonical URL: #{html_meta['canonical_url']}"
end
# Open Graph from map
open_graph = html_meta['open_graph'] || {}
if open_graph['title']
puts "OG Title: #{open_graph['title']}"
end
if open_graph['image']
puts "OG Image: #{open_graph['image']}"
end
# Twitter from map
twitter_card = html_meta['twitter_card'] || {}
if twitter_card['card']
puts "Twitter Card: #{twitter_card['card']}"
end
# New fields
if html_meta['language']
puts "Language: #{html_meta['language']}"
end
if html_meta['headers'] && !html_meta['headers'].empty?
puts "Headers: #{html_meta['headers'].join(', ')}"
end
if html_meta['links'] && !html_meta['links'].empty?
html_meta['links'].each do |url, text|
puts "Link: #{url} (#{text})"
end
end
API Reference¶
For complete details on all HTML metadata fields and types, see: - HTML Metadata Type Reference
Structured Types Reference¶
HeaderMetadata¶
Header elements extracted from the HTML document with hierarchy information.
pub struct HeaderMetadata {
pub level: u8, // 1-6 (h1-h6)
pub text: String, // Normalized text content
pub id: Option<String>, // HTML id attribute
pub depth: usize, // Document tree depth
pub html_offset: usize, // Byte offset in original HTML
}
Example:
{
"level": 1,
"text": "Welcome to Our Site",
"id": "welcome-section",
"depth": 2,
"html_offset": 512
}
LinkMetadata¶
Link elements with type classification and detailed attributes.
pub struct LinkMetadata {
pub href: String, // The href URL value
pub text: String, // Link text content
pub title: Option<String>, // Title attribute
pub link_type: LinkType, // Classification enum
pub rel: Vec<String>, // Rel attribute values
pub attributes: HashMap<String, String>, // Additional attributes
}
pub enum LinkType {
Anchor, // #section anchors
Internal, // Same domain links
External, // Different domain links
Email, // mailto: links
Phone, // tel: links
Other, // Other link types
}
Example:
{
"href": "https://example.com",
"text": "Visit Example",
"title": "Example Website",
"link_type": "external",
"rel": ["nofollow"],
"attributes": {
"data-tracking": "yes"
}
}
ImageMetadataType¶
Image elements with type classification and dimensions.
pub struct ImageMetadataType {
pub src: String, // Image source (URL, data URI, or SVG)
pub alt: Option<String>, // Alt text
pub title: Option<String>, // Title attribute
pub dimensions: Option<(u32, u32)>, // Width x Height
pub image_type: ImageType, // Classification enum
pub attributes: HashMap<String, String>, // Additional attributes
}
pub enum ImageType {
DataUri, // data: URI
InlineSvg, // Inline <svg> content
External, // External URL
Relative, // Relative path
}
Example:
{
"src": "https://cdn.example.com/image.jpg",
"alt": "Product photo",
"title": "Featured product",
"dimensions": [400, 300],
"image_type": "external",
"attributes": {
"loading": "lazy"
}
}
StructuredData¶
Extracted structured data blocks (JSON-LD, microdata, RDFa).
pub struct StructuredData {
pub data_type: StructuredDataType, // Classification enum
pub raw_json: String, // Raw JSON string
pub schema_type: Option<String>, // Schema type (e.g., "Article")
}
pub enum StructuredDataType {
JsonLd, // JSON-LD
Microdata, // microdata
RDFa, // RDFa
}
Example:
{
"data_type": "json-ld",
"raw_json": "{\"@context\": \"https://schema.org\", \"@type\": \"Article\", ...}",
"schema_type": "Article"
}
Summary of Changes¶
| Field | v3.x | v4.0 |
|---|---|---|
keywords | Option<String> | Vec<String> with #[serde(default)] |
canonical | Option<String> | Renamed to canonical_url |
og_* fields (7 fields) | Individual Option<String> fields | open_graph: BTreeMap<String, String> |
twitter_* fields (6 fields) | Individual Option<String> fields | twitter_card: BTreeMap<String, String> |
link_author, link_license, link_alternate | Individual fields | Removed (use links field) |
New: language | N/A | Option<String> |
New: text_direction | N/A | Option<TextDirection> |
New: headers | N/A | Vec<HeaderMetadata> with #[serde(default)] |
New: links | N/A | Vec<LinkMetadata> with #[serde(default)] |
New: images | N/A | Vec<ImageMetadataType> with #[serde(default)] |
New: structured_data | N/A | Vec<StructuredData> with #[serde(default)] |
New: meta_tags | N/A | BTreeMap<String, String> with #[serde(default)] |
Questions?¶
- See the Types Reference for complete API details
- Check Working with Metadata for examples
- Open an issue on GitHub