Find near-duplicate text

Goal: decide whether two texts are nearly the same — two versions of a clause, a boilerplate paragraph reused across contracts, or duplicates to drop from a corpus. Exact hashing can’t; MinHash estimates how much their content overlaps.

kaos-nlp-core shingles the text into overlapping token n-grams and compares compact signatures. Deterministic and offline.

uv run examples/near-duplicates.py

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.13"
# dependencies = ["kaos-nlp-core>=0.1.6,<0.2"]
# ///
"""Detect near-duplicate text with MinHash.

Exact hashing tells you if two documents are *identical*; MinHash tells you if
they're *nearly* identical — the question that matters for contract versions,
boilerplate clauses, and deduplicating a corpus. `kaos-nlp-core` shingles the
text into overlapping token n-grams and estimates their Jaccard similarity from
compact MinHash signatures.

Fully offline and deterministic.

Run it:

    uv run examples/near-duplicates.py
"""

from __future__ import annotations

from kaos_nlp_core.hashing import MinHasher

# Two versions of the same clause (one word changed) + an unrelated clause.
CLAUSE_V1 = (
    "The Tenant shall pay rent monthly on the first business day of each month "
    "and any late payment incurs a five percent fee on the outstanding balance"
)
CLAUSE_V2 = (
    "The Tenant shall pay rent monthly on the first business day of every month "
    "and any late payment incurs a five percent fee on the outstanding balance"
)
UNRELATED = (
    "Confidential Information must be protected by the Receiving Party for a "
    "period of three years following the date of disclosure"
)


def main() -> tuple[float, float]:
    hasher = MinHasher()
    # 3-token shingles over the lowercased words.
    sig_v1 = hasher.hash_token_shingles(CLAUSE_V1.lower().split(), 3)
    sig_v2 = hasher.hash_token_shingles(CLAUSE_V2.lower().split(), 3)
    sig_u = hasher.hash_token_shingles(UNRELATED.lower().split(), 3)

    near_dup = sig_v1.jaccard(sig_v2)
    unrelated = sig_v1.jaccard(sig_u)
    print(f"  similarity(clause v1, clause v2 — one word changed) = {near_dup:.3f}")
    print(f"  similarity(clause v1, unrelated NDA clause)         = {unrelated:.3f}")
    return near_dup, unrelated


if __name__ == "__main__":
    near_dup, unrelated = main()
    # Near-duplicate scores high; the unrelated clause scores zero.
    assert near_dup > 0.5, f"expected high near-dup similarity, got {near_dup}"
    assert unrelated == 0.0

Notes

The one-word-different clauses score ~0.81; the unrelated clause scores 0 — exactly the signal you want for “is this a revision of that?”.
hash_char_shingles works at the character level (robust to tokenization differences); MinHashIndex scales this to finding duplicates across a whole corpus instead of comparing pairs.
For semantic near-duplicates (same meaning, different words), use embeddings instead — MinHash is lexical.