Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: similar texts !?

by Zaxo (Archbishop)
on Jul 12, 2003 at 15:47 UTC ( #273638=note: print w/ replies, xml ) Need Help??


in reply to similar texts !?

Here's a fairly simple method to measure similarity of strings. Given $one and $two, apply text compression to each and to their concatenation. The ratio of the size of them compressed together to the sum of the separately compressed sizes measures their similarity. The smaller, the closer.

#!/usr/bin/perl use Compress::Zlib 'compress'; # Usage: $arrayref = similarity( LIST) # Returns: AoA reference to string similarity table for LIST sub similarity { my (%single, @ret) = map {$_ => length compress $_} @_; for my $this (@_) { push @ret, [ map { (length compress $this . $_) / ($single{$this} + $single{$_}) } @_ ]; } \@ret; } my @titles = ( q(The Last Public Hanging In Old West Virginia - Flatt and Scruggs +), q(Flatt_and_Scruggs__The_Last_Public_Hanging_In_Old_West_Virginia) +, q(Rainy Day Woman Number 12 and 35 - Flatt and Scruggs), q(Rainy Day Woman Number Twelve and Thirty-five - Bob Dylan), ); my $results = similarity @titles; for my $this (@$results) { print pack('A6' x @$this, map {sprintf '%4.3f', $_} @$this), $/; } __END__ 0.529 0.715 0.784 0.841 0.708 0.529 0.887 0.870 0.784 0.863 0.536 0.748 0.848 0.863 0.739 0.532
Note that 0.500 is the ideal minimum for that, so subtracting .5 from those would give more impressive differences.

I saw this technique described in a SciAm recently. Will update if I can find out which.

After Compline,
Zaxo


Comment on Re: similar texts !?
Download Code
Re: Re: similar texts !?
by bugsbunny (Scribe) on Jul 12, 2003 at 17:18 UTC
    cool waiting ... for update
Re: Re: similar texts !?
by allolex (Curate) on Jul 13, 2003 at 08:47 UTC

    Great idea, but a small caveat anyway---the accuracy of this method increases greatly on texts that are a bit longer than MP3 filenames. :)

    (The keyword for any googling on the subject is "maximum entropy". You can have a look here as well.)

    --
    Allolex

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://273638]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2014-12-20 23:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (99 votes), past polls