Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

compare strings intuitively

by rsiedl (Friar)
on May 07, 2007 at 02:34 UTC ( [id://613878]=perlquestion: print w/replies, xml ) Need Help??

rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

hi monks,

i'm trying to compare two strings which are almost identical using the module String::Compare.
i've found out though that its not very intuitive. for example these two strings only give a score of 0.39871...
my $str1 = "dn despaigne"; my $str2 = "dan despaigne";
can anyone suggest a better way or another module to compare strings?

cheers,
rsiedl

Replies are listed 'Best First'.
Re: compare strings intuitively
by Zaxo (Archbishop) on May 07, 2007 at 02:46 UTC
Re: compare strings intuitively
by almut (Canon) on May 07, 2007 at 07:04 UTC

    There's also String::Approx, whose measure of similarity/approximateness is based on the Levenshtein edit distance. The default approximateness for a match is 10%, but you specify other values (see the docs).

    use String::Approx qw(amatch adist); my $str1 = "dn despaigne"; my $str2 = "dan despaigne"; print "matched\n" if amatch($str1, $str2); print "matched\n" if amatch($str1, ["5%"], $str2); printf "distance = %d\n", adist($str1, $str2); # number of edits requi +red
Re: compare strings intuitively
by graff (Chancellor) on May 07, 2007 at 06:36 UTC
    Nice replies so far. I would add that any single score returned by String::Compare is probably not intended to be "intuitively" meaningful in itself; instead, the intent seems to be that when looking for "fuzzy matches" among a set of strings, the relative scores might serve as a means for ranking them in terms of their degree of similarity.

    For example, if you had a $str3 set to "don dispeigne", comparing this one to the other two would yield scores that were "worse" (lower, meaning "less similar") than the 0.39871 that you got from comparing your first two strings.

    I think the issue depends on what you intend to do with the "similarity score" (or "difference score") once you have it: what is your purpose in assigning a numeric value to the difference between two strings?

    Having looked at the source code for String::Compare, it's hard to say how its scoring would compare to that of Text::Levenshtein (or to any sort of arithmetic result you might be able to derive from using Algorithm::Diff, which is another fine and powerful module).

    I like String::Compare's notion of combining results from a variety of fairly simple, linguistically-motivated tests (score just the consonants, then just the vowels, then all letters, then all characters, then weighted in some way according to "word count", etc). But the implementation in that module seems awfully simple... I've used Algorithm::Diff and I've read its manual; because of that, I'm inclined to think that the problem is not as simple as String::Compare's methods would suggest.

      [...] you might be able to derive from using Algorithm::Diff
      A personal remark from me about anything based on: diff: it performs very badly (or "unintuitively" ;-)) when two adjectant substrings are swapped, which is a very common typo, at least for me, I do it all the time — usually with single letters.
Re: compare strings intuitively
by dewey (Pilgrim) on May 07, 2007 at 05:01 UTC
    Interesting question-- whether or not the module Zaxo suggests does it for you, can we clarify what your intuitions are? From your example, I would guess that what matters to you is the number of operations required to turn one string into another (where character substitution and insertion are allowed as operations) as compared to the length of the strings (one-character strings 'a' and 'b' are, to me, less identical than 'aaaaaa' and 'aaaaba'). Any ideas?

    ~dewey
      i'm sorting through a bunch of data with an author field. sometimes an author could be represented as 'a name' and other times 'am name', things like that. my goal is to try and work out which authors are the same people.
      "what matters to you is the number of operations required to turn one string into another" - yeah thats pretty much right but with a few exceptions.
      'a name' is obviously different from 'b name'
      but
      'am name' could be 'a name'
      the substitution method wouldnt work on its own here.

        Good luck ... this is a tough nut to crack with any non-trivial set of data - the number of false positives is going to be high. Uncommon names will work well but the m. smiths of the world are not going to be happy campers.

        -derby
Re: compare strings intuitively
by Moron (Curate) on May 07, 2007 at 11:58 UTC
    Unless we know what it's for or failing that what you think should happen functionally, it would be rash to suggest any particular module or algorithm, If it's to correct typing errors, the "number of edits needed" is the most useful information. But if you want more statistical information, or if other factors play a part, you need to make very clear decisions (*) such as:-

    - is order of the letters important, of weighted importance (and with what weighting), or irrelevant?

    - are repeated letters important?

    - should it be case-sensitive?

    - are common dictionary variations or alternative parts of speech a close match?

    - what part if any does soundex play in determining a match?

    - is the match quotient intended to drive a hashing or other storage algorithm?

    - etc. etc. etc.

    I would not feel comfortable making any guess as to all those undeclared possibilities!

    Update: * and of course post such clearer decisions in this thread.

    __________________________________________________________________________________

    ^M Free your mind!

Re: compare strings intuitively
by CountZero (Bishop) on May 07, 2007 at 16:38 UTC
    I almost dare not suggest it, but what about the Text::Soundex module? Or the more modern variant Text::Metaphone?

    By itself it will not be useful as it will hash into a too small space, but as an additional criterium it is to be considered, I feel. It would not account for casual mis-typings of course, but would assist with variants of the name, typed by someone who played it "by ear".

    Wikipedia has some interesting links to more sophisticated algorithms.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://613878]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2024-04-23 11:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found