Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Levenshtein distance: calculating similarity of strings

by spurperl (Priest)
on Mar 24, 2003 at 12:58 UTC ( #245428=CUFP: print w/ replies, xml ) Need Help??

Friend monks,

Sometimes there is a need to find out how similar two strings are. I hereby present one of the most useful and intuitive metrics for string similarity - the Levenshtein distance.

There is a Text::Levenshtein module on CPAN, but it is not well documented. My implementation, on the other hand, explains the algorithm step by step. As they say - "sometimes the best way to understand how the wheel works is to reinvent the wheel".

Take a look, the algorithm is very interesting - a neat hack, I'd say.

#!/usr/local/bin/perl -w use strict; ($#ARGV == 1) or die "Usage: $0 <string1> <string2>\n"; my ($s1, $s2) = (@ARGV); print "The Levenshtein distance between $s1 and $s2 is: " . levenshtei +n($s1, $s2) . "\n"; # Return the Levenshtein distance (also called Edit distance) # between two strings # # The Levenshtein distance (LD) is a measure of similarity between two # strings, denoted here by s1 and s2. The distance is the number of # deletions, insertions or substitutions required to transform s1 into # s2. The greater the distance, the more different the strings are. # # The algorithm employs a promimity matrix, which denotes the # distances between substrings of the two given strings. Read the # embedded comments for more info. If you want a deep understanding # of the algorithm, printthe matrix for some test strings # and study it # # The beauty of this system is that nothing is magical - the distance # is intuitively understandable by humans # # The distance is named after the Russian scientist Vladimir # Levenshtein, who devised the algorithm in 1965 # sub levenshtein { # $s1 and $s2 are the two strings # $len1 and $len2 are their respective lengths # my ($s1, $s2) = @_; my ($len1, $len2) = (length $s1, length $s2); # If one of the strings is empty, the distance is the length # of the other string # return $len2 if ($len1 == 0); return $len1 if ($len2 == 0); my %mat; # Init the distance matrix # # The first row to 0..$len1 # The first column to 0..$len2 # The rest to 0 # # The first row and column are initialized so to denote distance # from the empty string # for (my $i = 0; $i <= $len1; ++$i) { for (my $j = 0; $j <= $len2; ++$j) { $mat{$i}{$j} = 0; $mat{0}{$j} = $j; } $mat{$i}{0} = $i; } # Some char-by-char processing is ahead, so prepare # array of chars from the strings # my @ar1 = split(//, $s1); my @ar2 = split(//, $s2); for (my $i = 1; $i <= $len1; ++$i) { for (my $j = 1; $j <= $len2; ++$j) { # Set the cost to 1 iff the ith char of $s1 # equals the jth of $s2 # # Denotes a substitution cost. When the char are equal # there is no need to substitute, so the cost is 0 # my $cost = ($ar1[$i-1] eq $ar2[$j-1]) ? 0 : 1; # Cell $mat{$i}{$j} equals the minimum of: # # - The cell immediately above plus 1 # - The cell immediately to the left plus 1 # - The cell diagonally above and to the left + the cost # # We can either insert a new char, delete a char of # substitute an existing char (with an associated cost) # $mat{$i}{$j} = min([$mat{$i-1}{$j} + 1, $mat{$i}{$j-1} + 1, $mat{$i-1}{$j-1} + $cost]); } } # Finally, the distance equals the rightmost bottom cell # of the matrix # # Note that $mat{$x}{$y} denotes the distance between the # substrings 1..$x and 1..$y # return $mat{$len1}{$len2}; } # minimal element of a list # sub min { my @list = @{$_[0]}; my $min = $list[0]; foreach my $i (@list) { $min = $i if ($i < $min); } return $min; }

Comment on Levenshtein distance: calculating similarity of strings
Download Code
Re: Levenshtein distance: calculating similarity of strings
by omnibus (Scribe) on Mar 24, 2003 at 16:34 UTC
    The next step is to work your way backwards through %mat to determine the best alignment of the two sequences. This is the Needleman/Wunsch algorithm. There is a good writeup on it, including diagrams, here
Re: Levenshtein distance: calculating similarity of strings
by PodMaster (Abbot) on Mar 24, 2003 at 18:11 UTC
      • When I said "what else is new" I didn't refer to CPAN modules, but the general practice of not commenting. I see how I could've been misunderstood
      • http://www.merriampark.com/ld.htm#FLAVORS indeed gives a step-by-step recipe to the algorithm, but if you read carefully you see that it doesn't go a bit into explaining why things are the way they are. My code does
      • To avoid further mis-understanding, I removed the offending phrase
Re: Levenshtein distance: calculating similarity of strings
by hossman (Prior) on Mar 25, 2003 at 01:57 UTC

    In addition to Text::Levenshtein (whose docs look short, but sufficient to me) Jarkko has a great module that does the same thing, but with a lot more fine control: String::Approx.

    It provides an LRU cache for what he calls the "matching engines" of a pattern (so that testing the distance between $string1 and $string3 doesn't incur a bunch of overhead you allready spent on calculating hte distance between $string1 and $string2), it provides lots of usefull methods that utilize the Levenshtein Distance to do splicing, slicing, or finding the location of substring ... and he provides modifiers that let you say things like "allow only inserts or substitutions of characters, but no deleters" or tell me if these strings are 90% similar"

Re: Levenshtein distance: calculating similarity of strings
by goldbug (Initiate) on Mar 27, 2003 at 16:55 UTC
    There are probably similar posts on this topic but I'd just like to mention String::Trigram. It's also a good algorithm and I found the documentation to be more than adequate. I used this to do fuzzy matching for a translation program (similar to what commercial translation programs do). I would be very interested to see if anyone has any benchmarks on these different algorithms.

      hello everyone,

      I am trying to get the separate results from the Levenshtein distance, namely the number of deleted, inserted and replaced characters. Do you have any idea?

      Thanks in advance.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://245428]
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (10)
As of 2014-11-28 12:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (196 votes), past polls