Extract entities with a local NER model

Goal: turn unstructured text into structured records — the people, organizations, amounts, and dates in a document — without an LLM or API key. kaos-nlp-transformers ships a zero-shot NER extractor (GLiNER) that runs locally on a small ONNX model: you just name the labels you want.

uv run examples/extract-entities.py

           date: 'January 5, 2026'  (0.97)
   organization: 'Acme Corporation'  (0.99)
          money: '$2,500,000'  (0.95)
         person: 'Jane Doe'  (0.99)

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.13"
# dependencies = ["kaos-nlp-transformers>=0.1.5,<0.2"]
# ///
"""Extract entities from text with a local NER model — people, orgs, money, dates.

`kaos-nlp-transformers` ships a zero-shot NER extractor (GLiNER) that pulls typed
entities out of text *without an LLM or API key* — you just name the labels you
want. It runs locally on a small ONNX model. This is the offline
information-extraction backbone for building databases from documents.

Model note: the first run downloads the ONNX model (~tens of MB) from Hugging
Face and caches it; subsequent runs are offline. (To pre-warm a cache for CI or
air-gapped use, see how-to/prefetch-models.)

Run it:

    uv run examples/extract-entities.py
"""

from __future__ import annotations

import kaos_nlp_transformers as knt

TEXT = (
    "On January 5, 2026, Acme Corporation paid $2,500,000 to Jane Doe "
    "to settle the matter under the Master Services Agreement."
)
LABELS = ["person", "organization", "money", "date"]


def main() -> dict[str, str]:
    extractor = knt.GLiNERExtractor.load()
    # extract() takes a batch of texts and returns a list of entity lists.
    entities = extractor.extract([TEXT], labels=LABELS)[0]

    print(f"entities in:\n  {TEXT!r}\n")
    found = {}
    for e in entities:
        print(f"  {e.label:>13}: {e.text!r}  ({e.score:.2f})")
        found[e.label] = e.text
    return found


if __name__ == "__main__":
    found = main()
    # The model reliably pulls the org, the amount, the person, and the date.
    assert found.get("organization") == "Acme Corporation"
    assert "2,500,000" in found.get("money", "")
    assert found.get("person") == "Jane Doe"
    assert "2026" in found.get("date", "")

What to notice

Zero-shot. You pass labels=[...] — any labels, no fine-tuning. Need court, statute, product? Add them to the list.
Local and private. It runs on a local ONNX model; the text never leaves the machine. The first run downloads the model (~tens of MB) and caches it — pre-warm it for CI or air-gapped use with prefetch-models.
Deterministic enough to build on. High-confidence spans with offsets — feed them straight into a complaint database, a knowledge graph, or structured extraction.
This is the offline complement to LLM extraction: use NER for the entities a model recognizes out of the box, and a typed Call for bespoke, schema-shaped fields.