Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

[OT] SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections

by clinton (Priest)
on Oct 04, 2008 at 10:31 UTC ( #715339=perlmeditation: print w/ replies, xml ) Need Help??

Nothing to do with Perl per se (at least not yet - I'm hoping somebody will turn this into a nifty little CPAN module), but Greg Linden points out a clever method of near duplicate detection, in a paper by Martin Theobald, Jonathan Siddharth, and Andreas Paepcke from Stanford University. (PDF)

The problem they're trying to solve is to find near duplicate web pages, which may have different layout, different branding and interspersed adverts.

They do this by using the next few words after a stop word as a signature. For instance:

From Greg Linden's blog

The paper gives an example of a generating a spot signature where a piece of text like "at a rally to kick off a weeklong campaign" produces two spots: "a:rally:kick" and "a:weeklong:campaign".

From the paper: SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections (PDF)

The contributions of SpotSigs are twofold:
  1. by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure n-gram-based approaches such as Shingling;
  2. we provide an exact and efficient, self- tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as Shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative "Gold Set" of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.

Makes for an interesting read.

Comment on [OT] SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections
Select or Download Code
Re: [OT] SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections
by roho (Monsignor) on Oct 08, 2008 at 07:55 UTC
    Sounds very interesting -- unfortunately the link to the article does not work. :-(

    "Its not how hard you work, its how much you get done."

      All of them work for me. Check again?

        All the links work for me too!

        In both IE7 and Firefox.

        Might as well be a firewall problem.

        Just checked at $work and trying to access that Url http://dbpubs.stanford.edu/pub/...&name=2008-10.pdf, redirects me to

        HTTP/1.x 302 Found Date: Thu, 09 Oct 2008 10:54:28 GMT Server: Apache/2.0.49 (Fedora) Location: http://DBPubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en +&doc=2008-10&format=pdf&compression=&name=2008-10.pdf Content-Length: 400 Connection: close Content-Type: text/html; charset=iso-8859-1

        And that server (notice the port number in there) is reported to never answer…

        I'll have to try that later @home…

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://715339]
Approved by GrandFather
Front-paged by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (17)
As of 2014-08-27 13:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (238 votes), past polls