Triage a file before ingesting
Goal: at the front of an ingestion pipeline, decide two things fast — what format a file is (so you pick the right parser) and, for a PDF, whether it has a text layer (so you know whether to run OCR). Both checks are deterministic and offline.
uv run examples/triage-before-ingest.py#!/usr/bin/env -S uv run --script# /// script# requires-python = ">=3.13"# dependencies = ["kaos-nlp-core>=0.1.6,<0.2", "kaos-pdf>=0.1.0,<0.2", "fpdf2"]# ///"""Triage a file before ingesting it — what is it, and does it need OCR?
Before an ingestion pipeline parses a file, it should know two things: the file's*format* (so it picks the right parser) and, for PDFs, whether there's a *textlayer* (so it knows whether to run OCR). `kaos-nlp-core` sniffs the format fromthe bytes; `kaos-pdf` classifies a PDF as text vs. scanned. Both aredeterministic and offline.
This example generates a small PDF, then triages it. (A scanned PDF wouldclassify differently and route to OCR.)
Run it:
uv run examples/triage-before-ingest.py"""
from __future__ import annotations
import tempfilefrom pathlib import Path
import kaos_pdffrom fpdf import FPDFfrom kaos_nlp_core import content_type
def make_text_pdf() -> Path: pdf = FPDF() pdf.add_page() pdf.set_font("Helvetica", size=12) pdf.multi_cell(0, 8, "MASTER SERVICES AGREEMENT. This Agreement is governed by the " "laws of Delaware. The initial term is three years.") path = Path(tempfile.mkdtemp()) / "contract.pdf" pdf.output(str(path)) return path
def main() -> tuple[str, str]: path = make_text_pdf() data = path.read_bytes()
# 1. What format is it? (sniffed from the bytes, not the extension) fmt = content_type.detect(data) print(f" format: {fmt.mime_type} (group={fmt.group})")
# 2. For a PDF, is there a text layer or does it need OCR? kind = kaos_pdf.classify_document(path) needs_ocr = kind != "text" print(f" pdf kind: {kind} -> {'route to OCR' if needs_ocr else 'extract text directly'}")
return fmt.group, kind
if __name__ == "__main__": group, kind = main() assert group == "pdf" # The generated PDF has a real text layer, so no OCR is needed. assert kind == "text"Notes
content_type.detect(bytes)sniffs the format from the content, not the file extension — returningmime_type,group(pdf/office/html/…), andextension. Route tokaos-pdf,kaos-office, orkaos-webaccordingly.kaos_pdf.classify_document(path)returnstextfor a PDF with an extractable text layer, or a scanned/image kind that should go through OCR (kaos-pdf’s optional[ocr]extra). This saves you from running expensive OCR on PDFs that don’t need it. For hybrid PDFs (some pages text, some scanned), classify page by page and OCR only the pages that need it.- Both are pure and offline, so they’re cheap to run as a first pass over a large drop folder before committing to the heavier extraction.