Skip to content

Code Intelligence

Kreuzberg integrates tree-sitter-language-pack (TSLP) to parse and analyze source code files. When you extract a source code file, Kreuzberg automatically detects the programming language and produces structured analysis alongside the raw text content.

What You Get

When extracting source code, the metadata.format field contains a ProcessResult (format type "code") with:

  • Structure -- functions, classes, structs, methods, modules, and their nesting hierarchy
  • Imports -- import/include/require statements with source paths and imported items
  • Exports -- exported symbols with their kinds (function, class, variable, type, default)
  • Comments -- inline and block comments with their positions
  • Docstrings -- documentation comments with parsed sections (params, returns, etc.)
  • Symbols -- variable, constant, and type alias definitions
  • Diagnostics -- parse errors and warnings from tree-sitter
  • Chunks -- semantically meaningful code chunks for RAG and embedding pipelines
  • Metrics -- file-level statistics (lines of code, comment lines, empty lines, node count)

Language support covers 300+ programming languages via tree-sitter grammars. See the TSLP documentation for the full language list.

Getting Started

Code intelligence is enabled by default when the tree-sitter feature flag is active. Simply extract a source code file:

basic.rs
use kreuzberg::{extract_file_sync, ExtractionConfig};

let config = ExtractionConfig::default();
let result = extract_file_sync("app.py", None, &config)?;

// The content field has the raw source text
println!("{}", result.content);

// Code intelligence is in metadata.format
if let Some(kreuzberg::types::FormatMetadata::Code(ref code)) = result.metadata.format {
    println!("Language: {}", code.language);
    println!("Structures: {}", code.structure.len());
    println!("Imports: {}", code.imports.len());
}
basic.py
import kreuzberg

config = kreuzberg.ExtractionConfig()
result = kreuzberg.extract_file_sync("app.py", config=config)

# The content field has the raw source text
print(result.content)

# Code intelligence is in metadata["format"]
fmt = result.metadata.get("format")
if fmt and fmt.get("format_type") == "code":
    print(f"Language: {fmt['language']}")
    print(f"Structures: {len(fmt['structure'])}")
    print(f"Imports: {len(fmt['imports'])}")
basic.ts
import { extractFileSync } from "@kreuzberg/node";

const result = extractFileSync("app.ts");

console.log(result.content);

const fmt = result.metadata?.format;
if (fmt?.formatType === "code") {
  console.log(`Language: ${fmt.language}`);
  console.log(`Structures: ${fmt.structure.length}`);
  console.log(`Imports: ${fmt.imports.length}`);
}
basic.go
result, err := kreuzberg.ExtractFileSync("app.py", nil)
if err != nil {
    log.Fatal(err)
}

fmt.Println(result.Content)
// Code intelligence is available in result.Metadata.Format
// when Format.Type == "code"

Configuration

Use TreeSitterConfig to control which analysis features are enabled. Set enabled: false to disable code intelligence entirely. By default, structure, imports, and exports are enabled; comments, docstrings, symbols, and diagnostics are disabled.

config.rs
use kreuzberg::{ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};

let config = ExtractionConfig {
    tree_sitter: Some(TreeSitterConfig {
        process: TreeSitterProcessConfig {
            structure: true,      // functions, classes, etc. (default: true)
            imports: true,        // import statements (default: true)
            exports: true,        // export statements (default: true)
            comments: true,       // comments (default: false)
            docstrings: true,     // docstrings (default: false)
            symbols: true,        // variables, constants (default: false)
            diagnostics: true,    // parse errors/warnings (default: false)
            chunk_max_size: Some(4096),  // max chunk size in bytes
            ..Default::default()
        },
        ..Default::default()
    }),
    ..Default::default()
};
config.py
import kreuzberg

config = kreuzberg.ExtractionConfig(
    tree_sitter={
        "process": {
            "structure": True,
            "imports": True,
            "exports": True,
            "comments": True,
            "docstrings": True,
            "symbols": True,
            "diagnostics": True,
            "chunk_max_size": 4096,
        }
    }
)
config.ts
import { ExtractionConfig } from "@kreuzberg/node";

const config: ExtractionConfig = {
  treeSitter: {
    process: {
      structure: true,
      imports: true,
      exports: true,
      comments: true,
      docstrings: true,
      symbols: true,
      diagnostics: true,
      chunkMaxSize: 4096,
    },
  },
};
kreuzberg.toml
[tree_sitter.process]
structure = true
imports = true
exports = true
comments = true
docstrings = true
symbols = true
diagnostics = true
chunk_max_size = 4096

Configuration Fields

See TreeSitterConfig and TreeSitterProcessConfig for all fields.

ProcessResult Fields

Code intelligence results are returned as a ProcessResult from the upstream tree-sitter-language-pack crate. Top-level fields: language, metrics, structure, imports, exports, chunks, plus comments / docstrings / symbols / diagnostics (populated only when their TreeSitterProcessConfig flag is on). See the upstream crate docs for full field shapes.

Semantic Chunking for RAG

Code chunks produced by tree-sitter are semantically aware -- they split at function, class, and module boundaries rather than fixed line counts. This makes them ideal for retrieval-augmented generation (RAG) pipelines:

rag_chunking.py
import kreuzberg

config = kreuzberg.ExtractionConfig(
    tree_sitter={"process": {"chunk_max_size": 2048}}
)

result = kreuzberg.extract_file_sync("large_module.py", config=config)

fmt = result.metadata.get("format")
if fmt and fmt.get("format_type") == "code":
    for chunk in fmt.get("chunks", []):
        # Each chunk is a semantically coherent piece of code
        embedding = your_embedding_model(chunk["content"])
        store_in_vector_db(
            text=chunk["content"],
            embedding=embedding,
            metadata={
                "language": chunk["language"],
                "start_line": chunk["span"]["start_line"],
                "parent": chunk.get("context", {}).get("parent_name"),
            },
        )

Language Detection

Kreuzberg detects the programming language in two ways:

  1. File extension (fast path) -- when using extract_file, the extension is matched against 248 known language extensions
  2. Shebang line (fallback) -- when using extract_bytes or when the extension is ambiguous, the first line is checked for #!/usr/bin/env python, #!/bin/bash, and so on.

If neither method identifies the language, extraction returns an UnsupportedFormat error.

Language Support

Tree-sitter-language-pack supports 300+ programming languages. For the full list, see the TSLP language reference.

Common languages with full structural analysis:

Language Structure Imports Exports Docstrings
Python Yes Yes Yes Yes
Rust Yes Yes Yes Yes
TypeScript Yes Yes Yes Yes
JavaScript Yes Yes Yes Yes
Go Yes Yes Yes Yes
Java Yes Yes Yes Yes
C/C++ Yes Yes Yes Yes
Ruby Yes Yes Yes Yes
PHP Yes Yes Yes Yes
C# Yes Yes Yes Yes
Swift Yes Yes Yes Yes
Kotlin Yes Yes Yes Yes
Elixir Yes Yes Yes Yes

Edit this page on GitHub