Cluster a document corpus

Goal: organize a pile of documents into topics without labels — separate lease clauses from NDA clauses, group similar contracts, triage a corpus before review.

kaos-ml-core provides classical ML over your content. Embed the documents (with the vendored static model, no download), then cluster the vectors with mini-batch k-means. Deterministic via a fixed seed, fully offline.

uv run examples/cluster-documents.py

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.13"
# dependencies = [
#   "kaos-ml-core>=0.1.2,<0.2",
#   "kaos-nlp-transformers[model2vec]>=0.1.5,<0.2",
#   "numpy",
# ]
# ///
"""Cluster documents by topic — embeddings + k-means, offline.

`kaos-ml-core` provides classical ML over your content. Here we embed a handful
of documents (with the vendored static model, no download) and cluster them with
mini-batch k-means — automatically separating lease clauses from NDA clauses.
This is how you organize a corpus without labels. Deterministic via a fixed seed.

Run it:

    uv run examples/cluster-documents.py
"""

from __future__ import annotations

import os

os.environ.setdefault("KAOS_NLP_TRANSFORMERS_OFFLINE", "1")

import numpy as np  # noqa: E402
import kaos_nlp_transformers as knt  # noqa: E402
from kaos_ml_core.cluster import minibatch_kmeans  # noqa: E402

DOCS = [
    "The lease term is five years with rent due monthly.",       # lease
    "Tenant pays rent each month under the lease agreement.",    # lease
    "Confidential information must be protected for three years.",  # nda
    "The receiving party shall keep all confidential data secret.",  # nda
]


def main() -> list[int]:
    model = knt.EmbeddingModel.load("minishlab/potion-base-8M")
    features = np.asarray(model.embed(DOCS), dtype=np.float32)

    result = minibatch_kmeans(features, n_clusters=2, random_state=0)
    labels = result.labels.tolist()

    print("document -> cluster:\n")
    for doc, label in zip(DOCS, labels):
        print(f"  [{label}]  {doc[:48]}...")
    return labels


if __name__ == "__main__":
    labels = main()
    # The grouping is stable: the two lease docs land together, the two NDA docs
    # land together, and the topics separate. (Cluster *ids* may vary; grouping
    # doesn't.)
    assert labels[0] == labels[1], "lease docs should cluster together"
    assert labels[2] == labels[3], "NDA docs should cluster together"
    assert labels[0] != labels[2], "lease and NDA should be different clusters"

What to notice

It composes two packages: embeddings (kaos-nlp-transformers) produce the feature matrix; kaos-ml-core clusters it. That’s the KAOS pattern — small packages that snap together.
The grouping is stable across runs (random_state=0); the cluster ids may swap, the grouping doesn’t.
kaos-ml-core also does feature extraction over the content AST, LLM-assisted labeling, and logistic-regression classification — the supervised path once you have labels.