Nothing to do with Perl per se (at least not yet - I'm hoping somebody will turn this into a nifty little CPAN module), but Greg Linden
points out a clever method of near duplicate detection
, in a paper by Martin Theobald, Jonathan Siddharth, and Andreas Paepcke from Stanford University. (PDF
The problem they're trying to solve is to find near duplicate web pages, which may have different layout, different branding and interspersed adverts.
They do this by using the next few words after a stop word as a signature. For instance:
From Greg Linden's blog
The paper gives an example of a generating a spot signature where a piece of text like "at a rally to kick off a weeklong campaign" produces two spots: "a:rally:kick" and "a:weeklong:campaign".
From the paper: SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections (PDF)
The contributions of SpotSigs are twofold:
- by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages
that would otherwise distract pure n-gram-based approaches
such as Shingling;
- we provide an exact and efficient, self-
tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for
high-dimensional similarity search. Experiments confirm an
increase in combined precision and recall of more than 24
percent over state-of-the-art approaches such as Shingling
or I-Match and up to a factor of 3 faster execution times
than Locality Sensitive Hashing (LSH), over a demonstrative "Gold Set" of manually assessed near-duplicate news
articles as well as the TREC WT10g Web collection.
Makes for an interesting read.