Embed text for semantic search

Goal: turn text into dense vectors so you can measure semantic similarity (meaning, not just word overlap) — the basis of semantic search and clustering.

kaos-nlp-transformers produces embeddings. Most models download on first use, but the vendored static model minishlab/potion-base-8M (the [model2vec] extra) loads with no download, so this example runs offline and deterministically.

uv run examples/embeddings.py

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.13"
# dependencies = ["kaos-nlp-transformers[model2vec]>=0.1.5,<0.2", "numpy"]
# ///
"""Embed text and measure semantic similarity — offline, no download.

`kaos-nlp-transformers` produces dense embeddings for semantic search and
clustering. Most models download on first use, but the vendored static model
`minishlab/potion-base-8M` (the `[model2vec]` extra) loads with **no download**,
so this example runs offline and deterministically.

Run it:

    uv run examples/embeddings.py
"""

from __future__ import annotations

import os

# Force offline so no network model fetch is attempted.
os.environ.setdefault("KAOS_NLP_TRANSFORMERS_OFFLINE", "1")

import numpy as np  # noqa: E402
import kaos_nlp_transformers as knt  # noqa: E402

SENTENCES = [
    "Rent is due monthly on the first.",       # 0
    "The tenant pays rent every month.",       # 1  (similar to 0)
    "The patent covers a novel circuit design.",  # 2  (unrelated)
]


def cosine(a, b) -> float:
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))


def main() -> tuple[float, float]:
    model = knt.EmbeddingModel.load("minishlab/potion-base-8M")
    vectors = model.embed(SENTENCES)
    print(f"embedded {len(SENTENCES)} sentences -> {vectors.shape[1]}-dim vectors\n")

    sim_related = cosine(vectors[0], vectors[1])
    sim_unrelated = cosine(vectors[0], vectors[2])
    print(f"  sim('rent monthly', 'pays rent every month') = {sim_related:.3f}")
    print(f"  sim('rent monthly', 'patent circuit design')  = {sim_unrelated:.3f}")
    return sim_related, sim_unrelated


if __name__ == "__main__":
    related, unrelated = main()
    # Robust semantic check: the related pair is more similar than the unrelated one.
    assert related > unrelated, f"expected related > unrelated, got {related} vs {unrelated}"

Notes

The related sentences score high (~0.74); the unrelated one scores near zero — that’s the semantic signal lexical BM25 can miss.
For larger / more accurate models (BAAI/bge-small-en-v1.5, potion-base-32M), pre-warm the cache once with kaos-nlp-transformers prefetch, then set KAOS_NLP_TRANSFORMERS_OFFLINE=1.
kaos-nlp-transformers also ships cross-encoder reranking, NLI, zero-shot NER, and PII detection — all over the same offline-capable ONNX runtime.
Models are license-vetted and SHA-pinned in a registry; bypass it only with KAOS_NLP_TRANSFORMERS_ALLOW_UNREGISTERED=true (and your own license review).