Embed text for semantic search
Goal: turn text into dense vectors so you can measure semantic similarity (meaning, not just word overlap) — the basis of semantic search and clustering.
kaos-nlp-transformers produces embeddings. Most models download on first use, but the
vendored static model minishlab/potion-base-8M (the [model2vec] extra) loads with no
download, so this example runs offline and deterministically.
uv run examples/embeddings.py#!/usr/bin/env -S uv run --script# /// script# requires-python = ">=3.13"# dependencies = ["kaos-nlp-transformers[model2vec]>=0.1.5,<0.2", "numpy"]# ///"""Embed text and measure semantic similarity — offline, no download.
`kaos-nlp-transformers` produces dense embeddings for semantic search andclustering. Most models download on first use, but the vendored static model`minishlab/potion-base-8M` (the `[model2vec]` extra) loads with **no download**,so this example runs offline and deterministically.
Run it:
uv run examples/embeddings.py"""
from __future__ import annotations
import os
# Force offline so no network model fetch is attempted.os.environ.setdefault("KAOS_NLP_TRANSFORMERS_OFFLINE", "1")
import numpy as np # noqa: E402import kaos_nlp_transformers as knt # noqa: E402
SENTENCES = [ "Rent is due monthly on the first.", # 0 "The tenant pays rent every month.", # 1 (similar to 0) "The patent covers a novel circuit design.", # 2 (unrelated)]
def cosine(a, b) -> float: return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))
def main() -> tuple[float, float]: model = knt.EmbeddingModel.load("minishlab/potion-base-8M") vectors = model.embed(SENTENCES) print(f"embedded {len(SENTENCES)} sentences -> {vectors.shape[1]}-dim vectors\n")
sim_related = cosine(vectors[0], vectors[1]) sim_unrelated = cosine(vectors[0], vectors[2]) print(f" sim('rent monthly', 'pays rent every month') = {sim_related:.3f}") print(f" sim('rent monthly', 'patent circuit design') = {sim_unrelated:.3f}") return sim_related, sim_unrelated
if __name__ == "__main__": related, unrelated = main() # Robust semantic check: the related pair is more similar than the unrelated one. assert related > unrelated, f"expected related > unrelated, got {related} vs {unrelated}"Notes
- The related sentences score high (~0.74); the unrelated one scores near zero — that’s the semantic signal lexical BM25 can miss.
- For larger / more accurate models (
BAAI/bge-small-en-v1.5,potion-base-32M), pre-warm the cache once withkaos-nlp-transformers prefetch, then setKAOS_NLP_TRANSFORMERS_OFFLINE=1. kaos-nlp-transformersalso ships cross-encoder reranking, NLI, zero-shot NER, and PII detection — all over the same offline-capable ONNX runtime.- Models are license-vetted and SHA-pinned in a registry; bypass it only with
KAOS_NLP_TRANSFORMERS_ALLOW_UNREGISTERED=true(and your own license review).