This is a fascinating problem and there is a very effective solution. I implmented a version of it in C (code since lost) some years ago based on research paper I can't find anymore! Sorry.
I will describe the problem, and the technique and allow you to find the paper. I have done a very quick perl implementation to demonstrate the basics of how it works.
The problem with finding similar, but not exact document duplicates in a large document collection is that any naive implementation is On^2 ie you have to compare each new document with every old document so as the number of documents climbs this becomes impossible to do in finite time.
Mission impossible. Not so. Here is the algorithm.
- For each document first convert to lower case plain text and then split it into "phrase" tokens. What you want is a tokeniser method that returns chunks of text a few words long. You want these tokens to overlap (code presented later) effectively you walk an N word wide frame down your word stream, shifting by one word. The width should be somewhere from 3-5 words from memory.
- Now you apply a hashing algorithm to the tokens. You need fast because you will have one token for every word in your document. The odd collision does not matter so I chose Digest::JHash which is twice as fast as MD5 and returns an unsigned 4 byte int giving 3+ billion hash values. Google use this hash FWIW.
- Now you sort the hash tokens and select say the first 20 as you representative sample of the text. Because we hashed the tokens and sorted we have no idea what we selected but it will effectively be 20 N word tokens selected completely at random based on what they hashed to.
- Next you add the tokens and doc_id to your database. Because we selected 20 random tokens per document we don't need much space for each document added and the space requirement is linear with document numbers in the collection.
Now for any new document we want to test against our set we tokenise and get our top 20 randomly selected tokens. For disimilar documents there will be say 0-2 matches. The probability of a near identical document increased geometrically for each match found. The odd collision is imaterial to the logic.
Case and punctuation are removed from the equation on entry. Changing paragraph order only affects the few tokens that overlap the begining and end of the paragraph. Changing sentence order is similar but has a greater effect because the number of identical tokens we generate depends on how many full frames fit in that sentence. Deletion/insertion or changing single words only affect the frames that include it ie if our frame width is 4 it will affect only 4 tokens.
Here is some demo code. Note there is no need to hash, we simply do it to pack a given token into a fixed unique(ish) space and also make sure when we sort and pick the selection is completely random. Note how the sort pulls up the numerical items first so it is not as random as we want but good enough for a demo.
#!/usr/bin/per -w
use strict;
my $text1 = <<TEXT;
Once upon a time, in a land far away, there lived a giant. .. .
He was big. Really big. He had 5.5 toes on 2.3 feet.
TEXT
my $text2 = <<TEXT;
Once upon a time, in a land far away, there lived a GIANT.
He was big!!!!!!!!!!!!!
YOU GOTTA BELIEVE ME.
Really big! He was HUGE I tell you. He had 5.5 toes on 2.3 feet.
TEXT
my $FRAME = 4;
print "Here are the texts:\n\n$text1\n\n$text2\n\n";
my $res1 = tokenise($text1);
my $res2 = tokenise($text2);
print "Here is the walking token frame for text1:\n\n";
print "$_\n" for @$res1;
print "\n\nAnd here are the matches:\n\n";
my $top_twenty1 = get_top($res1);
my $top_twenty2 = get_top($res2);
my $same;
$same->{$_}++ for @$top_twenty1, @$top_twenty2;
my $matches = grep{ $same->{$_} == 2 } keys %$same;
printf "%20s %1s | %1s %-20s\n",
$top_twenty1->[$_],
$same->{$top_twenty1->[$_]} == 2 ? "*" : " ",
$same->{$top_twenty2->[$_]} == 2 ? "*" : " ",
$top_twenty2->[$_] for 0..19;
print "\nFound $matches matches!\n";
sub tokenise {
my $text = shift;
$text = lc($text);
$text =~ tr/a-z0-9/ /c;
my @text = split " ", $text;
my @tokens;
push @tokens, join " ", @text[$_..($_+$FRAME-1)] for 0..(@text-$FR
+AME);
return \@tokens;
}
sub get_top {
my $ary = shift;
[ (sort @$ary)[0..19] ];
}
__DATA__
Here are the texts:
Once upon a time, in a land far away, there lived a giant. .. .
He was big. Really big. He had 5.5 toes on 2.3 feet.
Once upon a time, in a land far away, there lived a giant. .. .
He was big!!!!!!!!!!!!!
You gotta believe me.
Really big! He was huge I tell you. He had 5.5 toes on 2.3 feet.
Here is the walking token frame for text1:
once upon a time
upon a time in
a time in a
time in a land
in a land far
a land far away
land far away there
far away there lived
away there lived a
there lived a giant
lived a giant he
a giant he was
giant he was big
he was big really
was big really big
big really big he
really big he had
big he had 5
he had 5 5
had 5 5 toes
5 5 toes on
5 toes on 2
toes on 2 3
on 2 3 feet
And here are the matches:
5 5 toes on * | * 5 5 toes on
5 toes on 2 * | * 5 toes on 2
a giant he was * | * a giant he was
a land far away * | * a land far away
a time in a * | * a time in a
away there lived a * | * away there lived a
big he had 5 | believe me really big
big really big he | big he was huge
far away there lived * | big you gotta believe
giant he was big * | * far away there lived
had 5 5 toes * | * giant he was big
he had 5 5 * | gotta believe me really
he was big really | * had 5 5 toes
in a land far * | * he had 5 5
land far away there * | he was big you
lived a giant he | he was huge i
on 2 3 feet | huge i tell you
once upon a time | i tell you he
really big he had | * in a land far
there lived a giant | * land far away there
Found 12 matches!
So that is the basic concept. Now you can tune this in various ways. A wide frame make the algorithm more sensitive to sentence order changes and insertion/deletions. The frame should be wide enough to miss common phrases, but less wide than the average sentence. The algorithm works best at detecting similar documents of similar lengths. If you have a 20k document and some one lifts 1k out if it we would only expect a single match, because we will probably have sampled 19 tokens from the 19k not lifted. If we increased the number of tokens in line with document length we can still detect this as more than 2 matches will happen rarely between completely dissimilar docs. Our storage requirements would then become a function of total document size in the collection rather than number. It is still linear.
Anyway HTH
Cheers
tachyon