Chunk a document for retrieval

Goal: split a document into chunks before you embed or search it — without cutting sentences in half or losing the link back to the source. kaos-nlp-core offers several strategies; pick the one that preserves the structure that matters.

uv run examples/chunk-a-document.py

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.13"
# dependencies = ["kaos-nlp-core>=0.1.6,<0.2"]
# ///
"""Chunk a document for retrieval — by sentence or by section.

Before you embed or search a document, you split it into chunks. `kaos-nlp-core`
offers several strategies; the right one preserves meaning. `SentenceChunker`
packs whole sentences up to a token budget (never splitting mid-sentence);
`SectionChunker` respects the document's heading structure. Each chunk carries a
char span back to the source, so retrieval results stay traceable. Deterministic,
offline, no API key.

Run it:

    uv run examples/chunk-a-document.py
"""

from __future__ import annotations

from kaos_nlp_core.chunking import SectionChunker, SentenceChunker

LEASE = (
    "ARTICLE I. TERM. The lease term is five years commencing on January 1. "
    "Rent is due monthly on the first business day. "
    "ARTICLE II. MAINTENANCE. The tenant shall maintain the premises in good repair. "
    "No pets are allowed without written consent of the landlord."
)


def main() -> tuple[int, int]:
    # Sentence chunks: whole sentences packed to a token budget.
    sentence_chunks = SentenceChunker(max_tokens=16).chunk(LEASE)
    print(f"SentenceChunker -> {len(sentence_chunks)} chunk(s):")
    for c in sentence_chunks:
        print(f"  [{c.char_span[0]:>3}:{c.char_span[1]:<3}] {c.text}")

    # Section chunks: split on the document's structure (ARTICLE headings).
    section_chunks = SectionChunker().chunk(LEASE)
    print(f"\nSectionChunker -> {len(section_chunks)} chunk(s):")
    for c in section_chunks:
        print(f"  {c.text[:64]}...")

    return len(sentence_chunks), len(section_chunks)


if __name__ == "__main__":
    n_sentence, n_section = main()
    # Multiple sentence chunks (the doc exceeds one 16-token budget)...
    assert n_sentence >= 2
    # ...and every chunk traces back to the source via its char span.
    for c in SentenceChunker(max_tokens=16).chunk(LEASE):
        assert LEASE[c.char_span[0] : c.char_span[1]] == c.text

Notes

SentenceChunker(max_tokens=...) packs whole sentences up to a token budget (never splitting mid-sentence); SectionChunker splits on heading structure. There are also fixed-token, paragraph, and hierarchical chunkers.
Every Chunk carries a char_span into the source text — so a retrieved chunk traces back to exactly where it came from, the basis for grounded results.
overlap_sentences=N adds sentence overlap between chunks when you want context to bleed across boundaries.
Chunking is the step before embeddings and retrieval — see how they compose in the research agent.