Extract a PDF

Goal: turn a PDF into the document AST — with page/position provenance — so search, LLM programs, agents, and citations can work on it.

kaos-pdf does this via parse_pdf(path). To stay self-contained and offline, the example generates a small PDF (pure-Python fpdf2), then extracts it — no committed binary, no network. With a real document, point parse_pdf at its path.

uv run examples/pdf-extract.py

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.13"
# dependencies = ["kaos-pdf>=0.1.4,<0.2", "kaos-content>=0.1.6,<0.2", "fpdf2>=2.7"]
# ///
"""Extract a PDF to the document AST.

`kaos-pdf` turns a PDF into a `ContentDocument` — the same AST every other
extractor produces — with page/position provenance. So a PDF, a Word doc, and a
web page all become one shape the rest of the stack works on.

To stay self-contained and offline, this example *generates* a small PDF with a
pure-Python writer (fpdf2), then extracts it with kaos-pdf — no committed binary,
no network. With a real document, just point `parse_pdf` at its path.

Run it:

    uv run examples/pdf-extract.py
"""

from __future__ import annotations

import tempfile
from pathlib import Path

import kaos_content as kc
import kaos_pdf as kp
from fpdf import FPDF


def make_pdf(path: Path) -> None:
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font("Helvetica", style="B", size=16)
    pdf.cell(0, 10, "Engagement Memo")
    pdf.ln(14)
    pdf.set_font("Helvetica", size=11)
    pdf.multi_cell(
        0,
        8,
        "The retainer is twenty thousand dollars. Fees are billed monthly "
        "against the retainer. Unused amounts are refundable on termination.",
    )
    pdf.output(str(path))


def main() -> str:
    with tempfile.TemporaryDirectory() as d:
        path = Path(d) / "memo.pdf"
        make_pdf(path)
        print(f"generated {path.name} ({path.stat().st_size} bytes)")

        # Extract it to the content AST.
        doc = kp.parse_pdf(path)

    text = kc.serialize_text(doc)
    print("--- extracted text ---")
    print(text.strip())
    return text


if __name__ == "__main__":
    text = main()
    assert "retainer is twenty thousand dollars" in text
    assert "billed monthly" in text

Notes

parse_pdf returns a ContentDocument — the same shape kaos-office, kaos-web, and kaos-tabular produce.
It has rich options: detect_headings, extract_tables, pages=[...] for a subset, and ocr= modes for scanned documents (with [nlp]/OCR engines).
Other kaos-pdf surfaces: search, outline, render (page → PNG), info/metadata, and page classification (text vs. scanned).
PDFium runs under a global lock for thread-safety; extraction is offloaded to a thread pool for async callers.