Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

similar texts !?

by bugsbunny (Scribe)
on Jul 12, 2003 at 11:16 UTC ( #273609=perlquestion: print w/ replies, xml ) Need Help??
bugsbunny has asked for the wisdom of the Perl Monks concerning the following question:

hi,
I have made a quick look at Linugua modules, but still can figure out does some of them will do the work i want or I have to do it other way...
What I want to do is to compare two or more short texts and as result get index how much they are alike..
What I want to compare are filenames or more specificly mp3 filenames, movie filenames etc...

Comment on similar texts !?
Re: similar texts !?
by Abigail-II (Bishop) on Jul 12, 2003 at 11:21 UTC
    Will Text::Soundex do what you want to do?

    Abigail

Re: similar texts !?
by BrowserUk (Pope) on Jul 12, 2003 at 13:10 UTC

    The problem with the filenames I seen for mp3's and the like is that everyone tends to classify them differently. The words used may be the same, but the order tends to get switched around. Some classify my the musician surname/first name/album/track, others by any number of permutations of those plus other stuff.

    You might get somewhere if you striped non-alphas and spaces, and the used String::Approx,String::Similarity, Text::Levenstien or if speed is a concern Text::LevenstienXS, though I've had trouble getting the latter to compile.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: similar texts !?
by Corion (Pope) on Jul 12, 2003 at 13:18 UTC

    I'm not sure whether the module will be suitable for really short texts like filenames, but another option for string similarity is String::Trigram.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: similar texts !?
by Albannach (Prior) on Jul 12, 2003 at 13:35 UTC
    Here's another vote for Text::Levenshtein which I have found very handy for comparing strings (mostly detecting data entry errors), especially those with mixed letters and numbers, though I too wish I could get the XS version working.

    I'd also like to point out Text::Metaphone as a soundex on steroids, as I've found soundex to be too insensitive at times. Note however that all but letters are ignored by Metaphone, which may limit its usefulness to you.

    I think BrowserUk points out a serious problem in the case of MP3 files, but as most cases I've seen use some sort of fairly standard separators between "fields" in the filename, you could split each name into fields, then do the comparisons between two MP3 names on all possible pairings, selecting the best match as the most likely set of pairings. This will of course be much slower than comparing the entire name, but there are probably only 3 or 4 fields per name so you shouldn't be looking at run times greater than the lifetime of the universe either.

    --
    I'd like to be able to assign to an luser

Re: similar texts !?
by Zaxo (Archbishop) on Jul 12, 2003 at 15:47 UTC

    Here's a fairly simple method to measure similarity of strings. Given $one and $two, apply text compression to each and to their concatenation. The ratio of the size of them compressed together to the sum of the separately compressed sizes measures their similarity. The smaller, the closer.

    #!/usr/bin/perl use Compress::Zlib 'compress'; # Usage: $arrayref = similarity( LIST) # Returns: AoA reference to string similarity table for LIST sub similarity { my (%single, @ret) = map {$_ => length compress $_} @_; for my $this (@_) { push @ret, [ map { (length compress $this . $_) / ($single{$this} + $single{$_}) } @_ ]; } \@ret; } my @titles = ( q(The Last Public Hanging In Old West Virginia - Flatt and Scruggs +), q(Flatt_and_Scruggs__The_Last_Public_Hanging_In_Old_West_Virginia) +, q(Rainy Day Woman Number 12 and 35 - Flatt and Scruggs), q(Rainy Day Woman Number Twelve and Thirty-five - Bob Dylan), ); my $results = similarity @titles; for my $this (@$results) { print pack('A6' x @$this, map {sprintf '%4.3f', $_} @$this), $/; } __END__ 0.529 0.715 0.784 0.841 0.708 0.529 0.887 0.870 0.784 0.863 0.536 0.748 0.848 0.863 0.739 0.532
    Note that 0.500 is the ideal minimum for that, so subtracting .5 from those would give more impressive differences.

    I saw this technique described in a SciAm recently. Will update if I can find out which.

    After Compline,
    Zaxo

      cool waiting ... for update

      Great idea, but a small caveat anyway---the accuracy of this method increases greatly on texts that are a bit longer than MP3 filenames. :)

      (The keyword for any googling on the subject is "maximum entropy". You can have a look here as well.)

      --
      Allolex

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://273609]
Approved by valdez
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2014-08-29 04:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (275 votes), past polls