Skip to content

Provenance & the producer contract

KAOS’s ingestion packages — kaos-pdf, kaos-office, kaos-web, kaos-tabular, kaos-source — share one contract: they are producers of the document model. Each turns its format into a ContentDocument (or TabularDocument) carrying provenance.

Extracted nodes record their origin — page number, bounding box, confidence, source URI. So when an answer later cites a fact, the citation can point not just at a document but at the exact place in it (a block ref) — and that citation can be verified.

Because every extractor emits the same shape, downstream code is written once. Search, chunking, dedup, LLM programs, agents, and citation verification don’t branch on “was this a PDF or a web page” — they operate on the AST. Adding a new format means writing a new producer; nothing downstream changes.

Provenance + a uniform shape is what makes KAOS trustworthy for legal and financial work: you can always trace an output back to its source, byte for byte, regardless of what format that source arrived in.