Research over a corpus
Goal: answer a question over a set of documents with a verifiable citation — and refuse when the corpus doesn’t support an answer. This is the skeleton of what a KAOS research agent does, built from primitives you’ve already seen: BM25 retrieval and grounded verification.
It runs fully offline and deterministic — a production agent swaps the deterministic “answer” step for an LLM (offline via FunctionClient), but the retrieve → ground → refuse contract is identical.
uv run examples/research-over-corpus.py GROUNDED: 'Rent is due monthly' (from doc 0) REFUSED: the supporting quote does not appear in the retrieved source.#!/usr/bin/env -S uv run --script# /// script# requires-python = ">=3.13"# dependencies = ["kaos-nlp-core>=0.1.6,<0.2", "kaos-llm-core>=0.1.12,<0.2"]# ///"""The skeleton of a grounded research agent: retrieve, answer, verify — or refuse.
A research agent over a document corpus does three things: it RETRIEVES therelevant source for a question (BM25), ANSWERS from it, and VERIFIES the answer'scitation against the source — and if no source is relevant, it REFUSES instead ofguessing. This example wires those verified primitives together over a tinycorpus, fully offline and deterministic.
A production agent uses an LLM for the answer step (offline via FunctionClient);here we keep it deterministic to show the retrieve -> ground -> refuse contractwithout any model at all.
Run it:
uv run examples/research-over-corpus.py"""
from __future__ import annotations
from kaos_llm_core.signatures.grounding import Spanfrom kaos_nlp_core.search import Searcher
# A tiny synthetic corpus (no licensing risk).CORPUS = { 0: "Master Lease. The lease term is five years. Rent is due monthly on the first.", 1: "Mutual NDA. Confidential Information must be protected for three years.", 2: "Services Agreement. The vendor shall deliver the software by the milestone dates.",}RECORDS = [{"id": i, "text": t} for i, t in CORPUS.items()]
def answer(question: str, searcher: Searcher, quote: str) -> str: """Retrieve the best source, then verify the supporting quote against it. Returns a grounded answer, or a refusal when nothing relevant is found.""" hits = searcher.search(question, top_k=1) if not hits: return "REFUSED: no source in the corpus supports that question."
source = CORPUS[hits[0].doc_id] start = max(source.find(quote), 0) span = Span(source_uri=str(hits[0].doc_id), quote=quote, char_span=(start, start + len(quote))) if not span.verify(source): return "REFUSED: the supporting quote does not appear in the retrieved source." return f"GROUNDED: {quote!r} (from doc {hits[0].doc_id})"
def main() -> list[str]: searcher = Searcher.from_documents(RECORDS) results = [ # A question the corpus supports, with a real quote -> grounded. answer("when is rent due", searcher, quote="Rent is due monthly"), # A question nothing in the corpus addresses -> refuse. answer("what are the patent infringement damages", searcher, quote="patent damages"), ] for r in results: print(f" {r}") return results
if __name__ == "__main__": results = main() assert results[0].startswith("GROUNDED"), results[0] assert results[1].startswith("REFUSED"), results[1]What to notice
- Retrieve, then ground. The agent finds the most relevant source with BM25, then verifies the answer’s quote against that source. Both steps are separate, so a wrong citation can’t slip through.
- Refuse, don’t guess. When nothing supports the answer — no relevant source, or a quote that doesn’t verify — the result is a typed refusal, not a fabricated answer. See the refusal contract.
- This is the agent, minus the LLM. The full
ResearchAgentadds memory, an LLM answer step, and multi-turn planning on top of exactly this contract — which is why the contract, not the model, is what makes the answers trustworthy.