Skip to content

Use real data from kl3m

The use-case examples run on small inline samples so they stay offline and deterministic. To run them on real documents at scale, point them at the public kl3m datasets — a large, openly-licensed corpus of legal and financial text (case law, contracts, regulations, SEC filings, and more) published by the ALEA Institute on Hugging Face.

Terminal window
pip install datasets # or: uv add datasets
from datasets import load_dataset
# Stream the dataset so you don't download it all at once.
ds = load_dataset("alea-institute/kl3m-data-snapshot", split="train", streaming=True)
for row in ds.take(100):
text = row["text"] # the document text
# ...feed `text` into any use-case workflow below.

(Browse the available datasets at huggingface.co/alea-institute and pick the collection that matches your workflow — filings, contracts, regulations, etc.)

Every use-case example takes raw document text. Swap the inline sample for a kl3m row and the same typed contract holds:

# e.g. classify real court documents (the litigation-triage use case)
from kaos_llm_core import Call
# ... build the TriageDoc Call as in examples/uc-litigation-triage.py ...
for row in ds.take(100):
result = await call(text=row["text"])
print(result.doc_type)
  • License-aware by design. kl3m is built from openly-licensed sources, which is why it’s safe to use for training and evaluation — the same care KAOS takes with its vetted model registry.
  • For real-time documents instead of a snapshot, use the kaos-source connectors — SEC EDGAR, Federal Register, and others.
  • Pair a real model (KAOS_LEARN_LIVE=1, a provider key) with real kl3m documents to run any use case end to end; the offline examples prove the workflow first.