Skip to content

Cluster a document corpus

Goal: organize a pile of documents into topics without labels — separate lease clauses from NDA clauses, group similar contracts, triage a corpus before review.

kaos-ml-core provides classical ML over your content. Embed the documents (with the vendored static model, no download), then cluster the vectors with mini-batch k-means. Deterministic via a fixed seed, fully offline.

Terminal window
uv run examples/cluster-documents.py
examples/cluster-documents.py
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "kaos-ml-core>=0.1.2,<0.2",
# "kaos-nlp-transformers[model2vec]>=0.1.5,<0.2",
# "numpy",
# ]
# ///
"""Cluster documents by topic — embeddings + k-means, offline.
`kaos-ml-core` provides classical ML over your content. Here we embed a handful
of documents (with the vendored static model, no download) and cluster them with
mini-batch k-means — automatically separating lease clauses from NDA clauses.
This is how you organize a corpus without labels. Deterministic via a fixed seed.
Run it:
uv run examples/cluster-documents.py
"""
from __future__ import annotations
import os
os.environ.setdefault("KAOS_NLP_TRANSFORMERS_OFFLINE", "1")
import numpy as np # noqa: E402
import kaos_nlp_transformers as knt # noqa: E402
from kaos_ml_core.cluster import minibatch_kmeans # noqa: E402
DOCS = [
"The lease term is five years with rent due monthly.", # lease
"Tenant pays rent each month under the lease agreement.", # lease
"Confidential information must be protected for three years.", # nda
"The receiving party shall keep all confidential data secret.", # nda
]
def main() -> list[int]:
model = knt.EmbeddingModel.load("minishlab/potion-base-8M")
features = np.asarray(model.embed(DOCS), dtype=np.float32)
result = minibatch_kmeans(features, n_clusters=2, random_state=0)
labels = result.labels.tolist()
print("document -> cluster:\n")
for doc, label in zip(DOCS, labels):
print(f" [{label}] {doc[:48]}...")
return labels
if __name__ == "__main__":
labels = main()
# The grouping is stable: the two lease docs land together, the two NDA docs
# land together, and the topics separate. (Cluster *ids* may vary; grouping
# doesn't.)
assert labels[0] == labels[1], "lease docs should cluster together"
assert labels[2] == labels[3], "NDA docs should cluster together"
assert labels[0] != labels[2], "lease and NDA should be different clusters"

What to notice

  • It composes two packages: embeddings (kaos-nlp-transformers) produce the feature matrix; kaos-ml-core clusters it. That’s the KAOS pattern — small packages that snap together.
  • The grouping is stable across runs (random_state=0); the cluster ids may swap, the grouping doesn’t.
  • kaos-ml-core also does feature extraction over the content AST, LLM-assisted labeling, and logistic-regression classification — the supervised path once you have labels.