Friend monks,

Sometimes there is a need to find out how similar two strings are. I hereby present one of the most useful and intuitive metrics for string similarity - the Levenshtein distance.

There is a Text::Levenshtein module on CPAN, but it is not well documented. My implementation, on the other hand, explains the algorithm step by step. As they say - "sometimes the best way to understand how the wheel works is to reinvent the wheel".

Take a look, the algorithm is very interesting - a neat hack, I'd say.

`#!/usr/local/bin/perl -w
use strict;
($#ARGV == 1) or die "Usage: $0 <string1> <string2>\n";
my ($s1, $s2) = (@ARGV);
print "The Levenshtein distance between $s1 and $s2 is: " . levenshtei
+n($s1, $s2) . "\n";
# Return the Levenshtein distance (also called Edit distance)
# between two strings
#
# The Levenshtein distance (LD) is a measure of similarity between two
# strings, denoted here by s1 and s2. The distance is the number of
# deletions, insertions or substitutions required to transform s1 into
# s2. The greater the distance, the more different the strings are.
#
# The algorithm employs a promimity matrix, which denotes the
# distances between substrings of the two given strings. Read the
# embedded comments for more info. If you want a deep understanding
# of the algorithm, printthe matrix for some test strings
# and study it
#
# The beauty of this system is that nothing is magical - the distance
# is intuitively understandable by humans
#
# The distance is named after the Russian scientist Vladimir
# Levenshtein, who devised the algorithm in 1965
#
sub levenshtein
{
# $s1 and $s2 are the two strings
# $len1 and $len2 are their respective lengths
#
my ($s1, $s2) = @_;
my ($len1, $len2) = (length $s1, length $s2);
# If one of the strings is empty, the distance is the length
# of the other string
#
return $len2 if ($len1 == 0);
return $len1 if ($len2 == 0);
my %mat;
# Init the distance matrix
#
# The first row to 0..$len1
# The first column to 0..$len2
# The rest to 0
#
# The first row and column are initialized so to denote distance
# from the empty string
#
for (my $i = 0; $i <= $len1; ++$i)
{
for (my $j = 0; $j <= $len2; ++$j)
{
$mat{$i}{$j} = 0;
$mat{0}{$j} = $j;
}
$mat{$i}{0} = $i;
}
# Some char-by-char processing is ahead, so prepare
# array of chars from the strings
#
my @ar1 = split(//, $s1);
my @ar2 = split(//, $s2);
for (my $i = 1; $i <= $len1; ++$i)
{
for (my $j = 1; $j <= $len2; ++$j)
{
# Set the cost to 1 iff the ith char of $s1
# equals the jth of $s2
#
# Denotes a substitution cost. When the char are equal
# there is no need to substitute, so the cost is 0
#
my $cost = ($ar1[$i-1] eq $ar2[$j-1]) ? 0 : 1;
# Cell $mat{$i}{$j} equals the minimum of:
#
# - The cell immediately above plus 1
# - The cell immediately to the left plus 1
# - The cell diagonally above and to the left + the cost
#
# We can either insert a new char, delete a char of
# substitute an existing char (with an associated cost)
#
$mat{$i}{$j} = min([$mat{$i-1}{$j} + 1,
$mat{$i}{$j-1} + 1,
$mat{$i-1}{$j-1} + $cost]);
}
}
# Finally, the distance equals the rightmost bottom cell
# of the matrix
#
# Note that $mat{$x}{$y} denotes the distance between the
# substrings 1..$x and 1..$y
#
return $mat{$len1}{$len2};
}
# minimal element of a list
#
sub min
{
my @list = @{$_[0]};
my $min = $list[0];
foreach my $i (@list)
{
$min = $i if ($i < $min);
}
return $min;
}
`

Comment onLevenshtein distance: calculating similarity of stringsDownloadCode