Skip to content

One document model

A legal/financial workflow ingests wildly different inputs: PDFs, Word documents, web pages, spreadsheets, email. The naive approach writes separate code for each format — separate parsing, separate search, separate citation logic. It doesn’t scale, and it makes provenance (knowing where a fact came from) nearly impossible.

KAOS makes a different bet: every extractor produces the same document model.

kaos-content defines a single AST. Block nodes are structural (headings, paragraphs, lists, tables, figures); Inline nodes are content within a block (text, bold, links, citations):

flowchart TD
    doc["ContentDocument"] --> body["body"]
    body --> h["Heading<br/><small>#/body/0</small>"]
    body --> p["Paragraph<br/><small>#/body/1</small>"]
    body --> tbl["Table<br/><small>#/body/2</small>"]
    p --> t1["Text<br/><small>“The rate is ”</small>"]
    p --> b1["Bold<br/><small>“5%”</small>"]
    p --> c1["Citation<br/><small>→ source span</small>"]

    classDef block fill:#eef2ff,stroke:#6366f1,color:#1e1b4b;
    classDef inline fill:#fef9c3,stroke:#ca8a04,color:#713f12;
    class doc,body,h,p,tbl block;
    class t1,b1,c1 inline;

Block nodes (indigo) nest structure; Inline nodes (amber) carry content. Every node has a stable ref.

Every node carries:

  • Provenance — where it came from (page, bounding box, confidence) when extracted.
  • A stable block reference (like #/body/2) — a precise, addressable location.

You saw this in build a document: a ContentDocument serializes to Markdown, HTML, text, or JSON, and exposes an outline with each block’s ref.

  • Write once, run on everything. Search, chunking, dedup, LLM programs, and agents operate on the AST — so they work identically whether the source was a PDF or a web page. kaos-pdf, kaos-office, kaos-web, and kaos-tabular are all just producers of this one shape.
  • Citations that point at something real. Because every block has a stable reference, an answer can cite #/body/2 of a specific document — and that citation can be verified by checking the quoted text against the source span. This is the foundation of KAOS’s grounded findings.
  • Format is a choice, not a constraint. The AST is the source of truth; Markdown, HTML, CSV, and JSON are serializations you pick at the end.
real files / sources one AST output
───────────────────── ───────────────── ────────────
PDF, DOCX, PPTX, XLSX, ──extractors──> ContentDocument ──> search / LLM /
HTML, CSV, archives... (Block + Inline + agents / citations
provenance + refs) / Markdown / JSON

Learn the model once and the rest of the stack stops caring what format your input was.