Re: similar texts !?

in reply to similar texts !?

Here's a fairly simple method to measure similarity of strings. Given $one and $two, apply text compression to each and to their concatenation. The ratio of the size of them compressed together to the sum of the separately compressed sizes measures their similarity. The smaller, the closer.

#!/usr/bin/perl

use Compress::Zlib 'compress';

# Usage:   $arrayref = similarity( LIST)
# Returns: AoA reference to string similarity table for LIST
sub similarity {
    my (%single, @ret) = map {$_ => length compress $_} @_;
    for my $this (@_) {
        push @ret, [
            map {
               (length compress $this . $_)
                 / ($single{$this} + $single{$_})
            } @_ 
        ];
    }
    \@ret;
}

my @titles = (
    q(The Last Public Hanging In Old West Virginia - Flatt and Scruggs
+),
    q(Flatt_and_Scruggs__The_Last_Public_Hanging_In_Old_West_Virginia)
+,
    q(Rainy Day Woman Number 12 and 35 - Flatt and Scruggs),
    q(Rainy Day Woman Number Twelve and Thirty-five - Bob Dylan),
);

my $results = similarity @titles;

for my $this (@$results) {
    print pack('A6' x @$this, map {sprintf '%4.3f', $_} @$this), $/;
}

__END__

0.529 0.715 0.784 0.841 
0.708 0.529 0.887 0.870 
0.784 0.863 0.536 0.748 
0.848 0.863 0.739 0.532
[download]

Note that 0.500 is the ideal minimum for that, so subtracting .5 from those would give more impressive differences.

I saw this technique described in a SciAm recently. Will update if I can find out which.

After Compline,
Zaxo

Comment on Re: similar texts !? Download Code

Replies are listed 'Best First'.
Re: Re: similar texts !? by allolex (Curate) on Jul 13, 2003 at 08:47 UTC
Great idea, but a small caveat anyway---the accuracy of this method increases greatly on texts that are a bit longer than MP3 filenames. :) (The keyword for any googling on the subject is "maximum entropy". You can have a look here as well.) -- Allolex	[reply]
Re: Re: similar texts !? by bugsbunny (Scribe) on Jul 12, 2003 at 17:18 UTC
cool waiting ... for update	[reply]

In Section Seekers of Perl Wisdom