compare strings intuitively

rsiedl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: compare strings intuitively by Zaxo (Archbishop) on May 07, 2007 at 02:46 UTC
Text::Levenshtein or Text::LevenshteinXS will give a metric that is very good logically. Can't say how intuitive you'll find it. After Compline, Zaxo	[reply]
Re: compare strings intuitively by almut (Canon) on May 07, 2007 at 07:04 UTC
There's also `String::Approx`, whose measure of similarity/approximateness is based on the Levenshtein edit distance. The default approximateness for a match is 10%, but you specify other values (see the docs). `use String::Approx qw(amatch adist); my $str1 = "dn despaigne"; my $str2 = "dan despaigne"; print "matched\n" if amatch($str1, $str2); print "matched\n" if amatch($str1, ["5%"], $str2); printf "distance = %d\n", adist($str1, $str2); # number of edits requi +red` [download]	[reply] [d/l] [select]
Re: compare strings intuitively by graff (Chancellor) on May 07, 2007 at 06:36 UTC
Nice replies so far. I would add that any single score returned by String::Compare is probably not intended to be "intuitively" meaningful in itself; instead, the intent seems to be that when looking for "fuzzy matches" among a set of strings, the relative scores might serve as a means for ranking them in terms of their degree of similarity. For example, if you had a $str3 set to "don dispeigne", comparing this one to the other two would yield scores that were "worse" (lower, meaning "less similar") than the 0.39871 that you got from comparing your first two strings. I think the issue depends on what you intend to do with the "similarity score" (or "difference score") once you have it: what is your purpose in assigning a numeric value to the difference between two strings? Having looked at the source code for String::Compare, it's hard to say how its scoring would compare to that of Text::Levenshtein (or to any sort of arithmetic result you might be able to derive from using Algorithm::Diff, which is another fine and powerful module). I like String::Compare's notion of combining results from a variety of fairly simple, linguistically-motivated tests (score just the consonants, then just the vowels, then all letters, then all characters, then weighted in some way according to "word count", etc). But the implementation in that module seems awfully simple... I've used Algorithm::Diff and I've read its manual; because of that, I'm inclined to think that the problem is not as simple as String::Compare's methods would suggest.	[reply]
Re^2: compare strings intuitively by bart (Canon) on May 07, 2007 at 12:05 UTC
[...] you might be able to derive from using Algorithm::Diff A personal remark from me about anything based on: diff: it performs very badly (or "unintuitively" ;-)) when two adjectant substrings are swapped, which is a very common typo, at least for me, I do it all the time — usually with single letters.	[reply]
Re: compare strings intuitively by dewey (Pilgrim) on May 07, 2007 at 05:01 UTC
Interesting question-- whether or not the module Zaxo suggests does it for you, can we clarify what your intuitions are? From your example, I would guess that what matters to you is the number of operations required to turn one string into another (where character substitution and insertion are allowed as operations) as compared to the length of the strings (one-character strings 'a' and 'b' are, to me, less identical than 'aaaaaa' and 'aaaaba'). Any ideas? ~dewey	[reply]
Re^2: compare strings intuitively by rsiedl (Friar) on May 07, 2007 at 07:34 UTC
i'm sorting through a bunch of data with an author field. sometimes an author could be represented as 'a name' and other times 'am name', things like that. my goal is to try and work out which authors are the same people. "what matters to you is the number of operations required to turn one string into another" - yeah thats pretty much right but with a few exceptions. 'a name' is obviously different from 'b name' but 'am name' could be 'a name' the substitution method wouldnt work on its own here.	[reply]
Re^3: compare strings intuitively by derby (Abbot) on May 07, 2007 at 12:18 UTC
Good luck ... this is a tough nut to crack with any non-trivial set of data - the number of false positives is going to be high. Uncommon names will work well but the m. smiths of the world are not going to be happy campers. -derby	[reply]
Re: compare strings intuitively by Moron (Curate) on May 07, 2007 at 11:58 UTC
Unless we know what it's for or failing that what you think should happen functionally, it would be rash to suggest any particular module or algorithm, If it's to correct typing errors, the "number of edits needed" is the most useful information. But if you want more statistical information, or if other factors play a part, you need to make very clear decisions () such as:- - is order of the letters important, of weighted importance (and with what weighting), or irrelevant? - are repeated letters important? - should it be case-sensitive? - are common dictionary variations or alternative parts of speech a close match? - what part if any does soundex play in determining a match? - is the match quotient intended to drive a hashing or other storage algorithm? - etc. etc. etc. I would not feel comfortable making any guess as to all those undeclared possibilities! Update: and of course post such clearer decisions in this thread. __________________________________________________________________________________ ^M Free your mind!	[reply]
Re: compare strings intuitively by CountZero (Bishop) on May 07, 2007 at 16:38 UTC
I almost dare not suggest it, but what about the Text::Soundex module? Or the more modern variant Text::Metaphone? By itself it will not be useful as it will hash into a too small space, but as an additional criterium it is to be considered, I feel. It would not account for casual mis-typings of course, but would assist with variants of the name, typed by someone who played it "by ear". Wikipedia has some interesting links to more sophisticated algorithms. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]


Perl Monk, Perl Meditation
	PerlMonks