http://www.perlmonks.org?node_id=683137

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I have a series of databases with articles in them. These are front-ended by a Wordpress blog interface.

I've been importing the articles into Wordpress with a little spider I wrote that pulls them from a remote article resource. This part all works.

The relevant insert code looks something like this (wrapped for PM):

my $users_sth = $dbh->prepare(qq{ INSERT IGNORE INTO `wp_users` (`user_login` , `user_pass` , `user_nicename` , `user_ +email` , `user_registered` , `user_status` , `display_name` ) VALUES (?, $random_password, ?, 'no-email\@localhost', ?, 0, ? +)}); $users_sth->execute($authors[$i], $authors[$i], $now, $authors[$i]); my $author_sth = $dbh->prepare(qq{SELECT id from wp_users WHERE user_ +login=?}); my $post_author = $authors[$i]; $author_sth->execute($authors[$i]); $author_sth->bind_col(1, \$post_author); my @output = fetch_body($urls[$i]); while ($author_sth->fetch) { my $post_sth = $dbh->prepare(qq{ INSERT IGNORE INTO `wp_posts` (`post_author` , `post_date` , `post_content` , `post_ +title`) VALUES (?, ?, ?, ?)}); $post_sth->execute($post_author, $now, @output, $title +s[$i]); } };

I had to alter some of the default tables in Wordpress to facilitate this by making post_title a VARCHAR(255) and setting it to unique as well as making user_login unique:

ALTER TABLE `wp_posts` ADD UNIQUE (`post_title`);

The inserts work, and so far no issues and no duplicate USERS in the database, but here's where it gets tricky...

When I'm inserting articles, sometimes there are articles which have "similar" titles and very similar content, written by two different people (plagiarism, no-doubt). Here are some examples:

| 597 | 102 | New Home Builders in Houston + + | 595 | 102 | New Home Builders In Houston Texas + + ... | 772 | 136 | Buy to Let Rental Properties + + | 754 | 136 | Buy to Let Rental Property

There are probably more examples, but those are the two that I picked out visually.

I asked the MySQL folks, and they pointed me to Soundex, but it didn't look that easy to implement, and I've heard bad things about the pattern matching.

So I started looking around and found Text::Levenshtein, Text::WagnerFischer, Text::PhraseDistance and String::Approx.

What I'm trying to figure out, is what is the best way to achieve this (preferably at article import time), so I can eliminate the potential for dupes?

An alternate solution is to import the "dupes", and mark them as "Unpublished", so I can manually review them later.

I don't want to add more tables to the db to facilitate this, I need to stay as close to "stock" as possible.

Ideas? Suggestions? Pseudocode?

Replies are listed 'Best First'.
Re: Document similarity in large collections
by tachyon-II (Chaplain) on Apr 27, 2008 at 19:03 UTC

    This is a fascinating problem and there is a very effective solution. I implmented a version of it in C (code since lost) some years ago based on research paper I can't find anymore! Sorry.

    I will describe the problem, and the technique and allow you to find the paper. I have done a very quick perl implementation to demonstrate the basics of how it works.

    The problem with finding similar, but not exact document duplicates in a large document collection is that any naive implementation is On^2 ie you have to compare each new document with every old document so as the number of documents climbs this becomes impossible to do in finite time.

    Mission impossible. Not so. Here is the algorithm.

    1. For each document first convert to lower case plain text and then split it into "phrase" tokens. What you want is a tokeniser method that returns chunks of text a few words long. You want these tokens to overlap (code presented later) effectively you walk an N word wide frame down your word stream, shifting by one word. The width should be somewhere from 3-5 words from memory.
    2. Now you apply a hashing algorithm to the tokens. You need fast because you will have one token for every word in your document. The odd collision does not matter so I chose Digest::JHash which is twice as fast as MD5 and returns an unsigned 4 byte int giving 3+ billion hash values. Google use this hash FWIW.
    3. Now you sort the hash tokens and select say the first 20 as you representative sample of the text. Because we hashed the tokens and sorted we have no idea what we selected but it will effectively be 20 N word tokens selected completely at random based on what they hashed to.
    4. Next you add the tokens and doc_id to your database. Because we selected 20 random tokens per document we don't need much space for each document added and the space requirement is linear with document numbers in the collection.

    Now for any new document we want to test against our set we tokenise and get our top 20 randomly selected tokens. For disimilar documents there will be say 0-2 matches. The probability of a near identical document increased geometrically for each match found. The odd collision is imaterial to the logic.

    Case and punctuation are removed from the equation on entry. Changing paragraph order only affects the few tokens that overlap the begining and end of the paragraph. Changing sentence order is similar but has a greater effect because the number of identical tokens we generate depends on how many full frames fit in that sentence. Deletion/insertion or changing single words only affect the frames that include it ie if our frame width is 4 it will affect only 4 tokens.

    Here is some demo code. Note there is no need to hash, we simply do it to pack a given token into a fixed unique(ish) space and also make sure when we sort and pick the selection is completely random. Note how the sort pulls up the numerical items first so it is not as random as we want but good enough for a demo.

Re: Comparing inputs before inserting into MySQL for "sounds-like" similarity
by jhourcle (Prior) on Apr 27, 2008 at 17:14 UTC

    Soundex is a standard for 'sound alike' of the type that the name 'chris' sounds like 'kris' or 'joseph' sounds like 'josef' -- I've typically seen it used for telephone book type applications, so that you can match the various americanized spellings of names from other languages.

    The advantage of soundex is that it essentially provides a hash of the value, and when you get a new input, you compute the hash and look to see if anything in the system already had a hash that matches ... the other items you mentioned require comparing the string against all of the other existing strings in the system, which just doesn't scale well.

    If you were doing this with just single words, I'd probably look at the concept of stemming, where for each word, you compute the root of the word, so you can then check to ensure you're not storing 'boat' and 'boats' and 'boating', etc. (all stemming routines are different, and are language dependent ... some just handle plurals, others do more). It still might be a useful concept to look into to see what sort of processes are done, as you may wish to make use of it in your solution.

    For the examples you gave, the second one could be solved by just stemming each word -- you might be able to strip some adjectives, adverbs or other less significant modifiers, but you're going to start getting into issues of word order when computing your hashes -- in your first example, someone might've entered 'Houston New Home Builders' ... but there might be valid reason for duplication, as language is imprecise ... is this an article about builders who make new homes (which is kinda redundant, I would think), or about home builders who didn't exist previously? If the later, than you might have the article reoccur every year or two with different information.

    If you're just going for an attempt to identify plagiarism, you might compute some form of statistics on individual sentences / paragraphs (eg, like the number of times each word appears), and then see if anything already matches that item ... but you're have to balance the rate of false positives / false negatives. ... and I haven't even considered your request to not add / modify tables -- I just don't think it's realistic for what you're trying to do

Re: Comparing inputs before inserting into MySQL for "sounds-like" similarity
by pc88mxer (Vicar) on Apr 27, 2008 at 17:44 UTC
    Here are some suggestions...

    Decouple the article import from detecting duplicates - the latter process could be quite involved, and it will allow you to play with the detection algorithm without interrupting your import process.

    Have a look at this article: Building a Vector Space Search Engine in Perl.

    What you need to do could be seen as an instance of the document classification problem. A search engine would help find documents would could be duplicates of a target document, and then you could use something like Text::Similarity to perform a more robust comparison (or flag them for manual review.)

Re: Comparing inputs before inserting into MySQL for "sounds-like" similarity
by FunkyMonk (Chancellor) on Apr 27, 2008 at 22:37 UTC
    I asked the MySQL folks, and they pointed me to Soundex, but it didn't look that easy to implement, and I've heard bad things about the pattern matching.
    Soundex is easy to implement (or get it from CPAN: Text::Soundex), but you're right in that it's a long way from being perfect

    Unless I state otherwise, my code all runs with strict and warnings