Skip to content

Search text with BM25

Goal: find the most relevant sentences in a block of text for a query — the same lexical retrieval agents use to assemble context. No model, no key, deterministic.

Use kaos-nlp-core’s search_sentences(text, query), which segments the text and ranks the sentences with BM25.

Terminal window
uv run examples/bm25-search.py
examples/bm25-search.py
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.13"
# dependencies = ["kaos-nlp-core>=0.1.6,<0.2"]
# ///
"""Rank sentences by relevance with BM25 — fast, offline retrieval.
`kaos-nlp-core` is a Rust-backed NLP engine. `search_sentences` segments a
block of text into sentences and ranks them against a query with BM25 — the
classic lexical retrieval algorithm agents use to pull relevant context out
of a corpus. No model, no key, fully deterministic.
Run it:
uv run examples/bm25-search.py
"""
from __future__ import annotations
from kaos_nlp_core.search import search_sentences
TEXT = (
"The lease term is five years. "
"Rent is due monthly on the first. "
"The tenant may renew for an additional term. "
"Late rent incurs a five percent fee."
)
def main() -> list:
hits = search_sentences(TEXT, "rent", top_k=3)
print(f'Top matches for "rent":\n')
for h in hits:
# Each hit carries the matched sentence, its BM25 score, and the
# character span it occupies in the source text.
print(f" {h.score:.3f} {h.text!r} (chars {h.start}-{h.end})")
return hits
if __name__ == "__main__":
hits = main()
assert hits, "expected at least one hit"
assert "Rent" in hits[0].text, f"unexpected top hit: {hits[0].text!r}"

Notes

  • search_sentences returns SegmentHits with the matched text, a BM25 score, and the start/end character span in the source.
  • For multi-document corpora, use Searcher.from_documents(...) instead.
  • BM25 is KAOS’s default retrieval strategy in agents — see the concept page on why plain BM25 beats fancier schemes cross-domain (landing soon).