Finding near-duplicates on the web is important for several unrelated reasons that include archiving and versioning, plagiarism and copyright, SEO and optimization of storage capacity. It is maybe instructive to re-review the original groundbreaking research in the space – an eleven-year-old highly cited study of 1.6 billion distinct web pages.
What’s interesting about this work is that it takes two relatively faulty heuristics and combines them both to produce a third one that is superior.
Specifically, let A and B be the two heuristics in question, and (p1, p2) – a random pair of web pages that get evaluated for near-duplication (in one of its domain-specific interpretations). Each method, effectively, computes a similarity between the two pages – let’s denote this as sim-A(p1, p2) and sim-B(p1, p2), respectively.
Next, based on their own configured tunables A and B would, separately, determine whether p1 and p2 are near-duplicate:
- duplicate-A(p1, p2) = sim-A(p1, p2) > threshold-A
- duplicate-B(p1, p2) = sim-B(p1, p2) > threshold-B
The combined algorithm – let’s call it C – evaluates as follows from left to right:
- duplicate-C(p1, p2) = duplicate-A(p1, p2) && sim-B(p1, p2) > threshold-C
where threshold-C is a yet another constant that gets carefully tailored to minimize both false negatives and false positives (the latter, when a given pair of different web pages is falsely computed as near-duplicate).
Easy to notice that, given enough time to experiment and large enough datasets, this can work for more than two unrelated heuristics that utilize minhash, simhash and various other popular near-duplication detection techniques.
Maybe not so obvious is another observation: given real-life time and space limitations, any near-duplication algorithm will have to remain information-lossy and irreversible. And since the information that is being lost cannot be compared in terms of duplication (or non-duplication), we cannot eliminate the scenarios of non-identical (and maybe even disjoint) subsets of the respective false positives.
Therefore, given algorithms A and B and their respective false positive sets FP-A and FP-B, we should always be able to come up with a combined heuristics C(A, B) that optimizes its own inevitable FP-C as a function of (FP-A, FP-B).
PS. A blog that (at the time of this writing) comes on top when searching for “near-duplicate detection”. Easy and illustrated.