Skip to content

Research over a corpus

Goal: answer a question over a set of documents with a verifiable citation — and refuse when the corpus doesn’t support an answer. This is the skeleton of what a KAOS research agent does, built from primitives you’ve already seen: BM25 retrieval and grounded verification.

It runs fully offline and deterministic — a production agent swaps the deterministic “answer” step for an LLM (offline via FunctionClient), but the retrieve → ground → refuse contract is identical.

Terminal window
uv run examples/research-over-corpus.py
GROUNDED: 'Rent is due monthly' (from doc 0)
REFUSED: the supporting quote does not appear in the retrieved source.
examples/research-over-corpus.py
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.13"
# dependencies = ["kaos-nlp-core>=0.1.6,<0.2", "kaos-llm-core>=0.1.12,<0.2"]
# ///
"""The skeleton of a grounded research agent: retrieve, answer, verify — or refuse.
A research agent over a document corpus does three things: it RETRIEVES the
relevant source for a question (BM25), ANSWERS from it, and VERIFIES the answer's
citation against the source — and if no source is relevant, it REFUSES instead of
guessing. This example wires those verified primitives together over a tiny
corpus, fully offline and deterministic.
A production agent uses an LLM for the answer step (offline via FunctionClient);
here we keep it deterministic to show the retrieve -> ground -> refuse contract
without any model at all.
Run it:
uv run examples/research-over-corpus.py
"""
from __future__ import annotations
from kaos_llm_core.signatures.grounding import Span
from kaos_nlp_core.search import Searcher
# A tiny synthetic corpus (no licensing risk).
CORPUS = {
0: "Master Lease. The lease term is five years. Rent is due monthly on the first.",
1: "Mutual NDA. Confidential Information must be protected for three years.",
2: "Services Agreement. The vendor shall deliver the software by the milestone dates.",
}
RECORDS = [{"id": i, "text": t} for i, t in CORPUS.items()]
def answer(question: str, searcher: Searcher, quote: str) -> str:
"""Retrieve the best source, then verify the supporting quote against it.
Returns a grounded answer, or a refusal when nothing relevant is found."""
hits = searcher.search(question, top_k=1)
if not hits:
return "REFUSED: no source in the corpus supports that question."
source = CORPUS[hits[0].doc_id]
start = max(source.find(quote), 0)
span = Span(source_uri=str(hits[0].doc_id), quote=quote, char_span=(start, start + len(quote)))
if not span.verify(source):
return "REFUSED: the supporting quote does not appear in the retrieved source."
return f"GROUNDED: {quote!r} (from doc {hits[0].doc_id})"
def main() -> list[str]:
searcher = Searcher.from_documents(RECORDS)
results = [
# A question the corpus supports, with a real quote -> grounded.
answer("when is rent due", searcher, quote="Rent is due monthly"),
# A question nothing in the corpus addresses -> refuse.
answer("what are the patent infringement damages", searcher, quote="patent damages"),
]
for r in results:
print(f" {r}")
return results
if __name__ == "__main__":
results = main()
assert results[0].startswith("GROUNDED"), results[0]
assert results[1].startswith("REFUSED"), results[1]

What to notice

  • Retrieve, then ground. The agent finds the most relevant source with BM25, then verifies the answer’s quote against that source. Both steps are separate, so a wrong citation can’t slip through.
  • Refuse, don’t guess. When nothing supports the answer — no relevant source, or a quote that doesn’t verify — the result is a typed refusal, not a fabricated answer. See the refusal contract.
  • This is the agent, minus the LLM. The full ResearchAgent adds memory, an LLM answer step, and multi-turn planning on top of exactly this contract — which is why the contract, not the model, is what makes the answers trustworthy.