Skip to content

Generate synthetic data

Goal: demo and test a workflow when you can’t touch real client data. Generate it instead — realistic, labeled, and reproducible. This follows the pattern KAOS uses to build training corpora (kaos-embeddings): seed a per-row RNG from a stable hash so every row is independently reproducible — same seed in, identical dataset out, regardless of order or parallelism.

Here we generate ~180 law-firm billing line items across synthetic matters; the billing-analytics how-to consumes them.

Terminal window
uv run examples/generate-billing-data.py
examples/generate-billing-data.py
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.13"
# dependencies = ["kaos-names>=0.1.0a5,<0.2"]
# ///
"""Generate synthetic billing data — realistic, labeled, and reproducible.
You can't demo a billing workflow on a real firm's invoices, and you can't test
one without data. So generate it. Following the pattern KAOS uses to build
training corpora (`kaos-embeddings`), this seeds a per-row RNG from a stable hash
(BLAKE2b) so every row is *independently reproducible* — same seed in, identical
dataset out, regardless of order or parallelism. Matter codenames come from
`kaos-names` (its noun pool is legal terms).
This is the synthetic-data-generation use case; `analyze-billing.py` consumes the
output. Fully offline, no model.
Run it:
uv run examples/generate-billing-data.py
"""
from __future__ import annotations
import hashlib
import random
from datetime import date, timedelta
import kaos_names as kn
# Fixed pools — a real generator would draw these from config.
TIMEKEEPERS = [
("Okafor", "Partner", 925), ("Nguyen", "Partner", 880),
("Alvarez", "Associate", 520), ("Brandt", "Associate", 470), ("Cohen", "Associate", 440),
("Devi", "Paralegal", 240), ("Eriksson", "Paralegal", 215),
]
TASKS = [ # UTBMS litigation task codes + narrative templates
("L120", ["Analyze case strategy and key issues", "Develop litigation strategy"]),
("L160", ["Confer with client regarding settlement posture", "Draft settlement demand letter"]),
("L210", ["Draft motion to dismiss", "Revise answer and affirmative defenses"]),
("L240", ["Draft motion for summary judgment", "Research summary judgment standard"]),
("L320", ["Review documents produced in discovery", "Prepare privilege log entries"]),
("L330", ["Prepare for deposition of fact witness", "Attend and defend deposition"]),
]
PRACTICE_AREAS = ["Litigation", "M&A", "Employment", "Intellectual Property", "Regulatory"]
CLIENTS = ["Acme Corporation", "Globex Industries", "Initech LLC", "Wayne Enterprises"]
def _matters(seed: int, n: int = 6) -> list[dict]:
rng = random.Random(seed)
out = []
for i in range(n):
codename = kn.generate_session_name(rng=rng, number_min=2026, number_max=2026)
out.append({
"matter": f"{codename}",
"practice_area": rng.choice(PRACTICE_AREAS),
"client": rng.choice(CLIENTS),
})
return out
def generate_billing_rows(seed: int = 42, n: int = 180) -> list[dict]:
"""Generate `n` billing line items. Each row is seeded independently from a
BLAKE2b hash of (seed, index), so the dataset is fully reproducible."""
matters = _matters(seed)
base = date(2026, 1, 1)
rows = []
for i in range(n):
digest = hashlib.blake2b(f"{seed}:{i}".encode(), digest_size=8).digest()
rng = random.Random(int.from_bytes(digest, "big"))
matter = rng.choice(matters)
name, role, rate = rng.choice(TIMEKEEPERS)
code, narratives = rng.choice(TASKS)
hours = round(rng.uniform(0.3, 8.0), 1)
rows.append({
"entry_id": f"E{i:04d}",
"date": (base + timedelta(days=rng.randint(0, 89))).isoformat(),
"matter": matter["matter"],
"practice_area": matter["practice_area"],
"client": matter["client"],
"timekeeper": name,
"role": role,
"task_code": code,
"narrative": rng.choice(narratives),
"hours": hours,
"rate": rate,
"amount": round(hours * rate, 2),
})
return rows
def main() -> list[dict]:
rows = generate_billing_rows()
total = sum(r["amount"] for r in rows)
matters = {r["matter"] for r in rows}
print(f"generated {len(rows)} billing entries across {len(matters)} matters")
print(f" date range: {min(r['date'] for r in rows)} .. {max(r['date'] for r in rows)}")
print(f" total fees: ${total:,.2f}")
print(f" total hours: {sum(r['hours'] for r in rows):,.1f}")
print("\nsample rows:")
for r in rows[:3]:
print(f" {r['date']} {r['timekeeper']:9} {r['task_code']} {r['hours']:>4}h ${r['amount']:>9,.2f} {r['matter']}")
return rows
if __name__ == "__main__":
rows = main()
assert 100 <= len(rows) <= 200
# Reproducible: regenerating with the same seed yields an identical dataset.
assert generate_billing_rows() == rows
# ...and a different seed yields a different one.
assert generate_billing_rows(seed=7) != rows

What to notice

  • Per-row hashed seed. random.Random(blake2b(f"{seed}:{i}")) makes row i reproducible on its own — you can regenerate any subset, in any order, on any machine, and get the same rows. That’s why the asserts at the bottom hold. It’s the same trick kaos-embeddings uses to make training-corpus generation deterministic.
  • Pools + structure. Realistic data comes from realistic pools — UTBMS task codes, role-based rates, kaos-names for matter codenames (its noun pool is legal terms). Swap the pools to retarget the generator.
  • Labeled by construction. Because you generated it, every row’s true matter, timekeeper, and task code are known — ideal for testing extraction/classification, or for benchmarking a workflow before you have real data.
  • For text generation (paraphrases, Q&A pairs) rather than structured rows, the same seed-then-fill idea applies, with a FunctionClient offline or a real model live.