One document model

A legal/financial workflow ingests wildly different inputs: PDFs, Word documents, web pages, spreadsheets, email. The naive approach writes separate code for each format — separate parsing, separate search, separate citation logic. It doesn’t scale, and it makes provenance (knowing where a fact came from) nearly impossible.

KAOS makes a different bet: every extractor produces the same document model.

The Block/Inline AST

kaos-content defines a single AST. Block nodes are structural (headings, paragraphs, lists, tables, figures); Inline nodes are content within a block (text, bold, links, citations):

flowchart TD
    doc["ContentDocument"] --> body["body"]
    body --> h["Heading<br/><small>#/body/0</small>"]
    body --> p["Paragraph<br/><small>#/body/1</small>"]
    body --> tbl["Table<br/><small>#/body/2</small>"]
    p --> t1["Text<br/><small>“The rate is ”</small>"]
    p --> b1["Bold<br/><small>“5%”</small>"]
    p --> c1["Citation<br/><small>→ source span</small>"]

    classDef block fill:#eef2ff,stroke:#6366f1,color:#1e1b4b;
    classDef inline fill:#fef9c3,stroke:#ca8a04,color:#713f12;
    class doc,body,h,p,tbl block;
    class t1,b1,c1 inline;

Block nodes (indigo) nest structure; Inline nodes (amber) carry content. Every node has a stable ref.

Every node carries:

Provenance — where it came from (page, bounding box, confidence) when extracted.
A stable block reference (like #/body/2) — a precise, addressable location.

You saw this in build a document: a ContentDocument serializes to Markdown, HTML, text, or JSON, and exposes an outline with each block’s ref.

What it buys you

Write once, run on everything. Search, chunking, dedup, LLM programs, and agents operate on the AST — so they work identically whether the source was a PDF or a web page. kaos-pdf, kaos-office, kaos-web, and kaos-tabular are all just producers of this one shape.
Citations that point at something real. Because every block has a stable reference, an answer can cite #/body/2 of a specific document — and that citation can be verified by checking the quoted text against the source span. This is the foundation of KAOS’s grounded findings.
Format is a choice, not a constraint. The AST is the source of truth; Markdown, HTML, CSV, and JSON are serializations you pick at the end.

The mental model

real files / sources                          one AST                output
─────────────────────                     ─────────────────       ────────────
PDF, DOCX, PPTX, XLSX,   ──extractors──>   ContentDocument   ──>   search / LLM /
HTML, CSV, archives...                     (Block + Inline +        agents / citations
                                            provenance + refs)      / Markdown / JSON

Learn the model once and the rest of the stack stops caring what format your input was.