One document model
A legal/financial workflow ingests wildly different inputs: PDFs, Word documents, web pages, spreadsheets, email. The naive approach writes separate code for each format — separate parsing, separate search, separate citation logic. It doesn’t scale, and it makes provenance (knowing where a fact came from) nearly impossible.
KAOS makes a different bet: every extractor produces the same document model.
The Block/Inline AST
Section titled “The Block/Inline AST”kaos-content defines a single AST. Block nodes are structural (headings,
paragraphs, lists, tables, figures); Inline nodes are content within a block (text,
bold, links, citations):
flowchart TD
doc["ContentDocument"] --> body["body"]
body --> h["Heading<br/><small>#/body/0</small>"]
body --> p["Paragraph<br/><small>#/body/1</small>"]
body --> tbl["Table<br/><small>#/body/2</small>"]
p --> t1["Text<br/><small>“The rate is ”</small>"]
p --> b1["Bold<br/><small>“5%”</small>"]
p --> c1["Citation<br/><small>→ source span</small>"]
classDef block fill:#eef2ff,stroke:#6366f1,color:#1e1b4b;
classDef inline fill:#fef9c3,stroke:#ca8a04,color:#713f12;
class doc,body,h,p,tbl block;
class t1,b1,c1 inline;
Block nodes (indigo) nest structure; Inline nodes (amber) carry content. Every node has a stable ref.
Every node carries:
- Provenance — where it came from (page, bounding box, confidence) when extracted.
- A stable block reference (like
#/body/2) — a precise, addressable location.
You saw this in build a document: a ContentDocument
serializes to Markdown, HTML, text, or JSON, and exposes an outline with each block’s
ref.
What it buys you
Section titled “What it buys you”- Write once, run on everything. Search, chunking, dedup, LLM programs, and agents
operate on the AST — so they work identically whether the source was a PDF or a web
page.
kaos-pdf,kaos-office,kaos-web, andkaos-tabularare all just producers of this one shape. - Citations that point at something real. Because every block has a stable
reference, an answer can cite
#/body/2of a specific document — and that citation can be verified by checking the quoted text against the source span. This is the foundation of KAOS’s grounded findings. - Format is a choice, not a constraint. The AST is the source of truth; Markdown, HTML, CSV, and JSON are serializations you pick at the end.
The mental model
Section titled “The mental model”real files / sources one AST output───────────────────── ───────────────── ────────────PDF, DOCX, PPTX, XLSX, ──extractors──> ContentDocument ──> search / LLM /HTML, CSV, archives... (Block + Inline + agents / citations provenance + refs) / Markdown / JSONLearn the model once and the rest of the stack stops caring what format your input was.