Skip to content

HTML Metadata Structure Changes (v4.0)

Summary

HTML metadata has been restructured for better organization and type safety. The changes consolidate individual Open Graph and Twitter Card fields into maps, and convert keywords from a single string to an array.

Breaking Changes

1. Keywords: String to Array

Before (v3.x):

Keywords as Comma-Separated String
// Option<String> - comma-separated or space-separated
html_meta.keywords  // "seo, metadata, html"

After (v4.0):

Keywords as Structured Array
// Vec<String> - structured array
html_meta.keywords  // vec!["seo", "metadata", "html"]

2. Canonical URL: Field Rename

Before (v3.x):

Canonical Field (v3.x)
html_meta.canonical  // Option<String>

After (v4.0):

Canonical URL Field (v4.0)
html_meta.canonical_url  // Option<String>

3. Open Graph: Individual Fields to Map

Before (v3.x):

Open Graph as Individual Fields
html_meta.og_title          // Option<String>
html_meta.og_description    // Option<String>
html_meta.og_image          // Option<String>
html_meta.og_url            // Option<String>
html_meta.og_type           // Option<String>
html_meta.og_site_name      // Option<String>

After (v4.0):

Open Graph as Map Structure
html_meta.open_graph        // BTreeMap<String, String>
html_meta.open_graph.get("title")         // Option<&String>
html_meta.open_graph.get("description")   // Option<&String>
html_meta.open_graph.get("image")         // Option<&String>
html_meta.open_graph.get("url")           // Option<&String>
html_meta.open_graph.get("type")          // Option<&String>
html_meta.open_graph.get("site_name")     // Option<&String>

4. Twitter Card: Individual Fields to Map

Before (v3.x):

Twitter Card as Individual Fields
html_meta.twitter_card          // Option<String>
html_meta.twitter_title         // Option<String>
html_meta.twitter_description   // Option<String>
html_meta.twitter_image         // Option<String>
html_meta.twitter_site          // Option<String>
html_meta.twitter_creator       // Option<String>

After (v4.0):

Twitter Card as Map Structure
html_meta.twitter_card          // BTreeMap<String, String>
html_meta.twitter_card.get("card")          // Option<&String>
html_meta.twitter_card.get("title")         // Option<&String>
html_meta.twitter_card.get("description")   // Option<&String>
html_meta.twitter_card.get("image")         // Option<&String>
html_meta.twitter_card.get("site")          // Option<&String>
html_meta.twitter_card.get("creator")       // Option<&String>

5. Removed Fields

The following link-related fields have been removed: - link_author - link_license - link_alternate

Use the new links field instead for comprehensive link extraction.

6. New Fields

HTML metadata now includes rich metadata about page content: - language: Document language (e.g., "en", "fr") - text_direction: Text direction ("ltr", "rtl") - headers: List of page headers/headings with structured metadata - links: List of links with detailed metadata and type classification - images: List of images with alt text, dimensions, and type classification - structured_data: Parsed JSON-LD, microdata, and RDFa data - meta_tags: All meta tags as a map

Migration Guide

Rust

use kreuzberg::{extract_file_sync, ExtractionConfig};

let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
if let Some(html_meta) = result.metadata.html {
    // Keywords as single string
    if let Some(keywords) = html_meta.keywords {
        let keyword_vec: Vec<&str> = keywords.split(',').map(|s| s.trim()).collect();
        println!("Keywords: {:?}", keyword_vec);
    }

    // Canonical as separate field
    if let Some(canonical) = html_meta.canonical {
        println!("Canonical: {}", canonical);
    }

    // Open Graph as individual fields
    if let Some(og_title) = html_meta.og_title {
        println!("OG Title: {}", og_title);
    }
    if let Some(og_image) = html_meta.og_image {
        println!("OG Image: {}", og_image);
    }

    // Twitter as individual fields
    if let Some(twitter_card) = html_meta.twitter_card {
        println!("Twitter Card: {}", twitter_card);
    }
}
use kreuzberg::{extract_file_sync, ExtractionConfig};

let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
if let Some(html_meta) = result.metadata.html {
    // Keywords as array
    if !html_meta.keywords.is_empty() {
        println!("Keywords: {:?}", html_meta.keywords);
    }

    // Canonical renamed
    if let Some(canonical_url) = html_meta.canonical_url {
        println!("Canonical URL: {}", canonical_url);
    }

    // Open Graph from map
    if let Some(og_title) = html_meta.open_graph.get("title") {
        println!("OG Title: {}", og_title);
    }
    if let Some(og_image) = html_meta.open_graph.get("image") {
        println!("OG Image: {}", og_image);
    }

    // Twitter from map
    if let Some(twitter_card) = html_meta.twitter_card.get("card") {
        println!("Twitter Card: {}", twitter_card);
    }

    // New fields
    if let Some(lang) = html_meta.language {
        println!("Language: {}", lang);
    }
    if let Some(headers) = html_meta.headers {
        println!("Headers: {:?}", headers);
    }
    if let Some(links) = html_meta.links {
        for (url, text) in links {
            println!("Link: {} ({})", url, text);
        }
    }
}

Python

from kreuzberg import extract_file_sync, ExtractionConfig

result = extract_file_sync("page.html", config=ExtractionConfig())
html_meta = result.metadata.get("html", {})

# Keywords as single string
if html_meta.get('keywords'):
    keyword_list = html_meta['keywords'].split(',')
    print(f"Keywords: {keyword_list}")

# Canonical as separate field
if html_meta.get('canonical'):
    print(f"Canonical: {html_meta['canonical']}")

# Open Graph as individual fields
if html_meta.get('og_title'):
    print(f"OG Title: {html_meta['og_title']}")
if html_meta.get('og_image'):
    print(f"OG Image: {html_meta['og_image']}")

# Twitter as individual fields
if html_meta.get('twitter_card'):
    print(f"Twitter Card: {html_meta['twitter_card']}")
from kreuzberg import extract_file_sync, ExtractionConfig

result = extract_file_sync("page.html", config=ExtractionConfig())
html_meta = result.metadata.get("html", {})

# Keywords as array
if html_meta.get('keywords'):
    print(f"Keywords: {html_meta['keywords']}")

# Canonical renamed
if html_meta.get('canonical_url'):
    print(f"Canonical URL: {html_meta['canonical_url']}")

# Open Graph from map
open_graph = html_meta.get('open_graph', {})
if open_graph.get('title'):
    print(f"OG Title: {open_graph['title']}")
if open_graph.get('image'):
    print(f"OG Image: {open_graph['image']}")

# Twitter from map
twitter_card = html_meta.get('twitter_card', {})
if twitter_card.get('card'):
    print(f"Twitter Card: {twitter_card['card']}")

# New fields
if html_meta.get('language'):
    print(f"Language: {html_meta['language']}")

if html_meta.get('headers'):
    print(f"Headers: {html_meta['headers']}")

if html_meta.get('links'):
    for url, text in html_meta['links']:
        print(f"Link: {url} ({text})")

TypeScript

import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('page.html');
const htmlMeta = result.metadata;

// Keywords as single string
if (htmlMeta.keywords) {
    const keywordArray = htmlMeta.keywords.split(',');
    console.log('Keywords:', keywordArray);
}

// Canonical as separate field
if (htmlMeta.canonical) {
    console.log('Canonical:', htmlMeta.canonical);
}

// Open Graph as individual fields
if (htmlMeta.ogTitle) {
    console.log('OG Title:', htmlMeta.ogTitle);
}
if (htmlMeta.ogImage) {
    console.log('OG Image:', htmlMeta.ogImage);
}

// Twitter as individual fields
if (htmlMeta.twitterCard) {
    console.log('Twitter Card:', htmlMeta.twitterCard);
}
import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('page.html');
const htmlMeta = result.metadata;

// Keywords as array
if (htmlMeta.keywords?.length > 0) {
    console.log('Keywords:', htmlMeta.keywords);
}

// Canonical renamed
if (htmlMeta.canonicalUrl) {
    console.log('Canonical URL:', htmlMeta.canonicalUrl);
}

// Open Graph from map
if (htmlMeta.openGraph) {
    if (htmlMeta.openGraph['title']) {
        console.log('OG Title:', htmlMeta.openGraph['title']);
    }
    if (htmlMeta.openGraph['image']) {
        console.log('OG Image:', htmlMeta.openGraph['image']);
    }
}

// Twitter from map
if (htmlMeta.twitterCard) {
    if (htmlMeta.twitterCard['card']) {
        console.log('Twitter Card:', htmlMeta.twitterCard['card']);
    }
}

// New fields
if (htmlMeta.language) {
    console.log('Language:', htmlMeta.language);
}

if (htmlMeta.headers?.length > 0) {
    console.log('Headers:', htmlMeta.headers);
}

if (htmlMeta.links?.length > 0) {
    htmlMeta.links.forEach(([url, text]) => {
        console.log(`Link: ${url} (${text})`);
    });
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import java.util.Map;

ExtractionResult result = Kreuzberg.extractFileSync("page.html");
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");

// Keywords as single string
String keywords = (String) htmlMeta.get("keywords");
if (keywords != null) {
    String[] keywordArray = keywords.split(",");
    System.out.println("Keywords: " + Arrays.toString(keywordArray));
}

// Canonical as separate field
String canonical = (String) htmlMeta.get("canonical");
if (canonical != null) {
    System.out.println("Canonical: " + canonical);
}

// Open Graph as individual fields
String ogTitle = (String) htmlMeta.get("og_title");
if (ogTitle != null) {
    System.out.println("OG Title: " + ogTitle);
}

// Twitter as individual fields
String twitterCard = (String) htmlMeta.get("twitter_card");
if (twitterCard != null) {
    System.out.println("Twitter Card: " + twitterCard);
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import java.util.Map;
import java.util.List;

ExtractionResult result = Kreuzberg.extractFileSync("page.html");
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");

// Keywords as array
@SuppressWarnings("unchecked")
List<String> keywords = (List<String>) htmlMeta.get("keywords");
if (keywords != null && !keywords.isEmpty()) {
    System.out.println("Keywords: " + keywords);
}

// Canonical renamed
String canonicalUrl = (String) htmlMeta.get("canonical_url");
if (canonicalUrl != null) {
    System.out.println("Canonical URL: " + canonicalUrl);
}

// Open Graph from map
@SuppressWarnings("unchecked")
Map<String, String> openGraph = (Map<String, String>) htmlMeta.get("open_graph");
if (openGraph != null) {
    String ogTitle = openGraph.get("title");
    if (ogTitle != null) {
        System.out.println("OG Title: " + ogTitle);
    }
}

// Twitter from map
@SuppressWarnings("unchecked")
Map<String, String> twitterCard = (Map<String, String>) htmlMeta.get("twitter_card");
if (twitterCard != null) {
    String card = twitterCard.get("card");
    if (card != null) {
        System.out.println("Twitter Card: " + card);
    }
}

// New fields
String language = (String) htmlMeta.get("language");
if (language != null) {
    System.out.println("Language: " + language);
}

@SuppressWarnings("unchecked")
List<String> headers = (List<String>) htmlMeta.get("headers");
if (headers != null && !headers.isEmpty()) {
    System.out.println("Headers: " + headers);
}

Go

package main

import (
    "fmt"
    "log"
    "strings"
    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    result, err := kreuzberg.ExtractFileSync("page.html", nil)
    if err != nil {
        log.Fatalf("extract: %v", err)
    }

    if html, ok := result.Metadata.HTMLMetadata(); ok {
        // Keywords as single string
        if html.Keywords != nil {
            keywordSlice := strings.Split(*html.Keywords, ",")
            fmt.Println("Keywords:", keywordSlice)
        }

        // Canonical as separate field
        if html.Canonical != nil {
            fmt.Println("Canonical:", *html.Canonical)
        }

        // Open Graph as individual fields
        if html.OGTitle != nil {
            fmt.Println("OG Title:", *html.OGTitle)
        }
        if html.OGImage != nil {
            fmt.Println("OG Image:", *html.OGImage)
        }

        // Twitter as individual fields
        if html.TwitterCard != nil {
            fmt.Println("Twitter Card:", *html.TwitterCard)
        }
    }
}
package main

import (
    "fmt"
    "log"
    "strings"
    "github.com/kreuzberg-dev/kreuzberg/packages/go/v4"
)

func main() {
    result, err := kreuzberg.ExtractFileSync("page.html", nil)
    if err != nil {
        log.Fatalf("extract: %v", err)
    }

    if html, ok := result.Metadata.HTMLMetadata(); ok {
        // Keywords as array
        if len(html.Keywords) > 0 {
            fmt.Println("Keywords:", strings.Join(html.Keywords, ", "))
        }

        // Canonical renamed
        if html.CanonicalURL != nil {
            fmt.Println("Canonical URL:", *html.CanonicalURL)
        }

        // Open Graph from map
        if len(html.OpenGraph) > 0 {
            if ogTitle, ok := html.OpenGraph["title"]; ok {
                fmt.Println("OG Title:", ogTitle)
            }
            if ogImage, ok := html.OpenGraph["image"]; ok {
                fmt.Println("OG Image:", ogImage)
            }
        }

        // Twitter from map
        if len(html.TwitterCard) > 0 {
            if card, ok := html.TwitterCard["card"]; ok {
                fmt.Println("Twitter Card:", card)
            }
        }

        // New fields
        if html.Language != nil {
            fmt.Println("Language:", *html.Language)
        }

        if len(html.Headers) > 0 {
            fmt.Println("Headers:", strings.Join(html.Headers, ", "))
        }

        if len(html.Links) > 0 {
            for _, link := range html.Links {
                fmt.Printf("Link: %s (%s)\n", link[0], link[1])
            }
        }
    }
}

Ruby

require 'kreuzberg'

result = Kreuzberg.extract_file_sync('page.html')
html_meta = result.metadata['html']

# Keywords as single string
if html_meta['keywords']
    keyword_array = html_meta['keywords'].split(',').map(&:strip)
    puts "Keywords: #{keyword_array}"
end

# Canonical as separate field
if html_meta['canonical']
    puts "Canonical: #{html_meta['canonical']}"
end

# Open Graph as individual fields
if html_meta['og_title']
    puts "OG Title: #{html_meta['og_title']}"
end
if html_meta['og_image']
    puts "OG Image: #{html_meta['og_image']}"
end

# Twitter as individual fields
if html_meta['twitter_card']
    puts "Twitter Card: #{html_meta['twitter_card']}"
end
require 'kreuzberg'

result = Kreuzberg.extract_file_sync('page.html')
html_meta = result.metadata['html']

# Keywords as array
if html_meta['keywords'] && !html_meta['keywords'].empty?
    puts "Keywords: #{html_meta['keywords']}"
end

# Canonical renamed
if html_meta['canonical_url']
    puts "Canonical URL: #{html_meta['canonical_url']}"
end

# Open Graph from map
open_graph = html_meta['open_graph'] || {}
if open_graph['title']
    puts "OG Title: #{open_graph['title']}"
end
if open_graph['image']
    puts "OG Image: #{open_graph['image']}"
end

# Twitter from map
twitter_card = html_meta['twitter_card'] || {}
if twitter_card['card']
    puts "Twitter Card: #{twitter_card['card']}"
end

# New fields
if html_meta['language']
    puts "Language: #{html_meta['language']}"
end

if html_meta['headers'] && !html_meta['headers'].empty?
    puts "Headers: #{html_meta['headers'].join(', ')}"
end

if html_meta['links'] && !html_meta['links'].empty?
    html_meta['links'].each do |url, text|
        puts "Link: #{url} (#{text})"
    end
end

API Reference

For complete details on all HTML metadata fields and types, see: - HTML Metadata Type Reference

Structured Types Reference

HeaderMetadata

Header elements extracted from the HTML document with hierarchy information.

HeaderMetadata Struct Definition
pub struct HeaderMetadata {
    pub level: u8,                    // 1-6 (h1-h6)
    pub text: String,                // Normalized text content
    pub id: Option<String>,           // HTML id attribute
    pub depth: usize,                 // Document tree depth
    pub html_offset: usize,           // Byte offset in original HTML
}

Example:

HeaderMetadata JSON Example
{
  "level": 1,
  "text": "Welcome to Our Site",
  "id": "welcome-section",
  "depth": 2,
  "html_offset": 512
}

LinkMetadata

Link elements with type classification and detailed attributes.

LinkMetadata Struct and LinkType Enum
pub struct LinkMetadata {
    pub href: String,                        // The href URL value
    pub text: String,                        // Link text content
    pub title: Option<String>,               // Title attribute
    pub link_type: LinkType,                 // Classification enum
    pub rel: Vec<String>,                    // Rel attribute values
    pub attributes: HashMap<String, String>, // Additional attributes
}

pub enum LinkType {
    Anchor,    // #section anchors
    Internal,  // Same domain links
    External,  // Different domain links
    Email,     // mailto: links
    Phone,     // tel: links
    Other,     // Other link types
}

Example:

LinkMetadata JSON Example
{
  "href": "https://example.com",
  "text": "Visit Example",
  "title": "Example Website",
  "link_type": "external",
  "rel": ["nofollow"],
  "attributes": {
    "data-tracking": "yes"
  }
}

ImageMetadataType

Image elements with type classification and dimensions.

ImageMetadataType Struct and ImageType Enum
pub struct ImageMetadataType {
    pub src: String,                         // Image source (URL, data URI, or SVG)
    pub alt: Option<String>,                 // Alt text
    pub title: Option<String>,               // Title attribute
    pub dimensions: Option<(u32, u32)>,      // Width x Height
    pub image_type: ImageType,               // Classification enum
    pub attributes: HashMap<String, String>, // Additional attributes
}

pub enum ImageType {
    DataUri,    // data: URI
    InlineSvg,  // Inline <svg> content
    External,   // External URL
    Relative,   // Relative path
}

Example:

ImageMetadataType JSON Example
{
  "src": "https://cdn.example.com/image.jpg",
  "alt": "Product photo",
  "title": "Featured product",
  "dimensions": [400, 300],
  "image_type": "external",
  "attributes": {
    "loading": "lazy"
  }
}

StructuredData

Extracted structured data blocks (JSON-LD, microdata, RDFa).

StructuredData Struct and StructuredDataType Enum
pub struct StructuredData {
    pub data_type: StructuredDataType,  // Classification enum
    pub raw_json: String,               // Raw JSON string
    pub schema_type: Option<String>,    // Schema type (e.g., "Article")
}

pub enum StructuredDataType {
    JsonLd,   // JSON-LD
    Microdata, // microdata
    RDFa,     // RDFa
}

Example:

StructuredData JSON Example
{
  "data_type": "json-ld",
  "raw_json": "{\"@context\": \"https://schema.org\", \"@type\": \"Article\", ...}",
  "schema_type": "Article"
}

Summary of Changes

Field v3.x v4.0
keywords Option<String> Vec<String> with #[serde(default)]
canonical Option<String> Renamed to canonical_url
og_* fields (7 fields) Individual Option<String> fields open_graph: BTreeMap<String, String>
twitter_* fields (6 fields) Individual Option<String> fields twitter_card: BTreeMap<String, String>
link_author, link_license, link_alternate Individual fields Removed (use links field)
New: language N/A Option<String>
New: text_direction N/A Option<TextDirection>
New: headers N/A Vec<HeaderMetadata> with #[serde(default)]
New: links N/A Vec<LinkMetadata> with #[serde(default)]
New: images N/A Vec<ImageMetadataType> with #[serde(default)]
New: structured_data N/A Vec<StructuredData> with #[serde(default)]
New: meta_tags N/A BTreeMap<String, String> with #[serde(default)]

Questions?