Provenance & the producer contract
KAOS’s ingestion packages — kaos-pdf, kaos-office, kaos-web, kaos-tabular,
kaos-source — share one contract: they are producers of the
document model. Each turns its format into a
ContentDocument (or TabularDocument) carrying provenance.
Provenance: where each piece came from
Section titled “Provenance: where each piece came from”Extracted nodes record their origin — page number, bounding box, confidence, source URI. So when an answer later cites a fact, the citation can point not just at a document but at the exact place in it (a block ref) — and that citation can be verified.
The producer contract
Section titled “The producer contract”Because every extractor emits the same shape, downstream code is written once. Search, chunking, dedup, LLM programs, agents, and citation verification don’t branch on “was this a PDF or a web page” — they operate on the AST. Adding a new format means writing a new producer; nothing downstream changes.
Why it’s foundational
Section titled “Why it’s foundational”Provenance + a uniform shape is what makes KAOS trustworthy for legal and financial work: you can always trace an output back to its source, byte for byte, regardless of what format that source arrived in.