Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Fuzzy Strings

by orthanc (Monk)
on Sep 05, 2000 at 13:11 UTC ( #31105=perlquestion: print w/ replies, xml ) Need Help??
orthanc has asked for the wisdom of the Perl Monks concerning the following question:

Hi Fellow Monks

I have a problem where I have two lists of strings, I need to find the best match for each string in the first list with a string from the second list. I hope that makes sense.

I have tried using String::Approx 'amatch' but this returns all of the best matches, I need to receive only the best match.

Does anyone know of another module or method I can use, or should I be hacking amatch apart :).
Hope someone can help

Regards, Orthanc

Comment on Fuzzy Strings
(jeffa) Re: Fuzzy Strings
by jeffa (Chancellor) on Sep 05, 2000 at 17:55 UTC
    You can try Text::Soundex, but like String::Approx, finding the 'best' match is not an exact science.
    use Text::Soundex; $similar = soundex($string); @sims = soundex(@strings);

    Jeff

Re: Fuzzy Strings
by orthanc (Monk) on Sep 05, 2000 at 17:58 UTC

    Hi Again

    In response to Jeff:

    It wasn't that I thought String::Approx was not working it was just that it returned all the matches where as I just want the function to return the highest scoring match.

    Orthanc

      You could reinvoke String::Approx using successively more and more relaxed constraints until it gets some hits. Those hits would then be all equally "the best", unless you have some other criterion in mind. For example, if given "abcd" and the best matches are "abce" or "bbcd", there's no one match that's "best". Perhaps you need to rethink your problem.

      -- Randal L. Schwartz, Perl hacker

(Ovid) Re: Fuzzy Strings
by Ovid (Cardinal) on Sep 05, 2000 at 18:23 UTC
    Well, that depends upon how you define "best match." The first step in tackling any problem is defining it clearly. By best match, do you mean the most letters in common? If so, you could potentially takes all strings and break their letters into a hash with the value being the occurence of each letter and do a foreach loop over the keys and keep a count of the differences. Of course, this doesn't take into account the "order" of the letters. It would find "notnilC" as matching "Clinton."

    If you do take the hash approach, you might want to consider letter frequency. Since the letter 's' is more frequent than the letter 'q', does that mean that 'said' is a closer match to 'laid' than 'qaid'? (yes, that's a word)

    Also, do you know that with String::Approx that you can adjust the number of "edits"? For example, for a word with only two characters of difference, you can specify:

    my @catches = amatch("plugh", ['2'], @inputs);
    You could set the number of edits to 1 and if that doesn't return a list to examine, just keep increasing the number of edits until you get something.

    You may also want to check out Text::Soundex which will encode words into four character strings that represents what they "sound" like. Then, you can compare the shorter strings. I don't know how reliable this is and it's only for the English language.

    A final option to consider is Text::Metaphone, which does phonetic encoding of words. You could then check to see if words sound the same (yeah, I know, this is a longshot). I do not know if this is for languages other than English.

    Since you have a "fuzzy" problem, there is going to be no simple solution to this problem and you will have quite a time working with this, I'm sure. However, it might make a nifty module for CPAN, when finished.

    Cheers,
    Ovid

Re: Fuzzy Strings
by orthanc (Monk) on Sep 05, 2000 at 18:26 UTC

    Thanks All

    I've successfully used String::Similarity to get back a score for each string. I'm then storing the highest scoring match.

    Maybe this will be useful to others with similar problems.

    Regards, Orthanc

Re: Fuzzy Strings
by orthanc (Monk) on Sep 05, 2000 at 18:42 UTC

    Well That Has Made My Day !!

    I helped Randal :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://31105]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (11)
As of 2014-11-22 16:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (123 votes), past polls