Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Levenshtein distance: calculating similarity of strings

by spurperl (Priest)
on Mar 24, 2003 at 12:58 UTC ( #245428=CUFP: print w/replies, xml ) Need Help??

Friend monks,

Sometimes there is a need to find out how similar two strings are. I hereby present one of the most useful and intuitive metrics for string similarity - the Levenshtein distance.

There is a Text::Levenshtein module on CPAN, but it is not well documented. My implementation, on the other hand, explains the algorithm step by step. As they say - "sometimes the best way to understand how the wheel works is to reinvent the wheel".

Take a look, the algorithm is very interesting - a neat hack, I'd say.

#!/usr/local/bin/perl -w use strict; ($#ARGV == 1) or die "Usage: $0 <string1> <string2>\n"; my ($s1, $s2) = (@ARGV); print "The Levenshtein distance between $s1 and $s2 is: " . levenshtei +n($s1, $s2) . "\n"; # Return the Levenshtein distance (also called Edit distance) # between two strings # # The Levenshtein distance (LD) is a measure of similarity between two # strings, denoted here by s1 and s2. The distance is the number of # deletions, insertions or substitutions required to transform s1 into # s2. The greater the distance, the more different the strings are. # # The algorithm employs a promimity matrix, which denotes the # distances between substrings of the two given strings. Read the # embedded comments for more info. If you want a deep understanding # of the algorithm, printthe matrix for some test strings # and study it # # The beauty of this system is that nothing is magical - the distance # is intuitively understandable by humans # # The distance is named after the Russian scientist Vladimir # Levenshtein, who devised the algorithm in 1965 # sub levenshtein { # $s1 and $s2 are the two strings # $len1 and $len2 are their respective lengths # my ($s1, $s2) = @_; my ($len1, $len2) = (length $s1, length $s2); # If one of the strings is empty, the distance is the length # of the other string # return $len2 if ($len1 == 0); return $len1 if ($len2 == 0); my %mat; # Init the distance matrix # # The first row to 0..$len1 # The first column to 0..$len2 # The rest to 0 # # The first row and column are initialized so to denote distance # from the empty string # for (my $i = 0; $i <= $len1; ++$i) { for (my $j = 0; $j <= $len2; ++$j) { $mat{$i}{$j} = 0; $mat{0}{$j} = $j; } $mat{$i}{0} = $i; } # Some char-by-char processing is ahead, so prepare # array of chars from the strings # my @ar1 = split(//, $s1); my @ar2 = split(//, $s2); for (my $i = 1; $i <= $len1; ++$i) { for (my $j = 1; $j <= $len2; ++$j) { # Set the cost to 1 iff the ith char of $s1 # equals the jth of $s2 # # Denotes a substitution cost. When the char are equal # there is no need to substitute, so the cost is 0 # my $cost = ($ar1[$i-1] eq $ar2[$j-1]) ? 0 : 1; # Cell $mat{$i}{$j} equals the minimum of: # # - The cell immediately above plus 1 # - The cell immediately to the left plus 1 # - The cell diagonally above and to the left + the cost # # We can either insert a new char, delete a char of # substitute an existing char (with an associated cost) # $mat{$i}{$j} = min([$mat{$i-1}{$j} + 1, $mat{$i}{$j-1} + 1, $mat{$i-1}{$j-1} + $cost]); } } # Finally, the distance equals the rightmost bottom cell # of the matrix # # Note that $mat{$x}{$y} denotes the distance between the # substrings 1..$x and 1..$y # return $mat{$len1}{$len2}; } # minimal element of a list # sub min { my @list = @{$_[0]}; my $min = $list[0]; foreach my $i (@list) { $min = $i if ($i < $min); } return $min; }

Replies are listed 'Best First'.
Re: Levenshtein distance: calculating similarity of strings
by omnibus (Scribe) on Mar 24, 2003 at 16:34 UTC
    The next step is to work your way backwards through %mat to determine the best alignment of the two sequences. This is the Needleman/Wunsch algorithm. There is a good writeup on it, including diagrams, here
Re: Levenshtein distance: calculating similarity of strings
by PodMaster (Abbot) on Mar 24, 2003 at 18:11 UTC
      • When I said "what else is new" I didn't refer to CPAN modules, but the general practice of not commenting. I see how I could've been misunderstood
      • indeed gives a step-by-step recipe to the algorithm, but if you read carefully you see that it doesn't go a bit into explaining why things are the way they are. My code does
      • To avoid further mis-understanding, I removed the offending phrase
Re: Levenshtein distance: calculating similarity of strings
by hossman (Prior) on Mar 25, 2003 at 01:57 UTC

    In addition to Text::Levenshtein (whose docs look short, but sufficient to me) Jarkko has a great module that does the same thing, but with a lot more fine control: String::Approx.

    It provides an LRU cache for what he calls the "matching engines" of a pattern (so that testing the distance between $string1 and $string3 doesn't incur a bunch of overhead you allready spent on calculating hte distance between $string1 and $string2), it provides lots of usefull methods that utilize the Levenshtein Distance to do splicing, slicing, or finding the location of substring ... and he provides modifiers that let you say things like "allow only inserts or substitutions of characters, but no deleters" or tell me if these strings are 90% similar"

Re: Levenshtein distance: calculating similarity of strings
by goldbug (Initiate) on Mar 27, 2003 at 16:55 UTC
    There are probably similar posts on this topic but I'd just like to mention String::Trigram. It's also a good algorithm and I found the documentation to be more than adequate. I used this to do fuzzy matching for a translation program (similar to what commercial translation programs do). I would be very interested to see if anyone has any benchmarks on these different algorithms.

      hello everyone,

      I am trying to get the separate results from the Levenshtein distance, namely the number of deleted, inserted and replaced characters. Do you have any idea?

      Thanks in advance.

Re: Levenshtein distance: calculating similarity of strings
by Anonymous Monk on Feb 21, 2018 at 22:04 UTC
    This code uses a hash instead of a simple array for mat. I changed it to a regular array, and it reduced the execution speed by 19.3%.
      ... it reduced the execution speed ...

      So just to be clear, are you saying your change made the code slower by about 20%?

      Give a man a fish:  <%-{-{-{-<

        Oops, sorry. It reduced execution time by about 20%. I'm not allowed to post my actual code, but it was just the few obvious changes: declare @mat rather than %mat, and use [] rather than {}.

        I made a loop that called levenshtein with 2 arrays of strings of 2-character words. Between one and 25 words per string, comparing all possible combinations of 51 different such "sentences". Hence 51 * 51 = 2061 string comparisons of between one and 25 words per sentence. I executed it on an old MacBook Pro 200 times, and took the average "user" time from the "time" command.

        Average "user" time was 19.31% shorter.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://245428]
Front-paged by diotalevi
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (11)
As of 2023-03-29 14:13 GMT
Find Nodes?
    Voting Booth?
    Which type of climate do you prefer to live in?

    Results (72 votes). Check out past polls.