Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

improved levenshtein

by bageler (Hermit)
on Mar 03, 2004 at 17:31 UTC ( #333616=CUFP: print w/ replies, xml ) Need Help??

I wanted to cut cycles on one of my programs that needs string approximation and the Text::LevenshteinXS module wouldn't install (no 5.8 on the server) so I wrote my own copy of the levenshtein algorithm. benchmarking against Text::Levenshtein showed 15-35% speed gain. I was thinking of uploading this to cpan, since i've seen many something/Fastsomething modules up there. The biggest difference in the algorithm itself (not the short circuiting) is the min function...why people want to loop over an array when they KNOW there are only going to be 3 paramaters really confounds me.
Think this fast module is worthy of cpan?
package Text::FastLevenshtein; use strict; use Exporter; use vars qw ($VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS); $VERSION = '0.02'; @ISA = qw(Exporter); @EXPORT = (); @EXPORT_OK = qw(&distance); %EXPORT_TAGS = (); sub _min { my $min = $_[0]; $min = $_[1] if $_[1] < $min; $min = $_[2] if $_[2] < $min; return $min; } sub distance($$) { my $word1 = shift; my $word2 = shift; return 0 if $word1 eq $word2; my @d; my $len1 = length $word1; my $len2 = length $word2; $d[0][0] = 0; for (1 .. $len1) { $d[$_][0] = $_; return $_ if $_!=$len1 && substr($word1,$_) eq substr( +$word2,$_); } for (1 .. $len2) { $d[0][$_] = $_; return $_ if $_!=$len2 && substr($word1,$_) eq substr( +$word2,$_); } for my $i (1 .. $len1) { my $w1 = substr($word1,$i-1,1); for (1 .. $len2) { $d[$i][$_] = _min($d[$i-1][$_]+1, $d[$i][$_-1] ++1, $d[$i-1][$_-1]+($w1 eq substr($word2,$_-1,1) ? 0 : 1)); } } return $d[$len1][$len2]; } 1; __END__

Comment on improved levenshtein
Download Code
Re: improved levenshtein
by kvale (Monsignor) on Mar 03, 2004 at 17:59 UTC
    That is a nice speedup of the pure Perl version of the algorithm.

    But if I were you, I would benchmark your code against the XS code before you call your stuff 'fast'. The XS version was written explicitly for speed.

    How about submitting a patch for the Text::Levenshtein instead? That way CPAN has one optimized pure Perl module and one optimized XS module for the task, simple.

    -Mark

      I did also benchmark against the XS module, and my module is quite slow in comparison (duh) unless the strings are equal, where my short circuit (return 0 if $word1 eq $word2) lays the smack down, in the parlance of our times.

      for your information, here are my benchmark results:
      joshs-Computer:~/leven josh$ perl levenshtein.pl foo fee The distance between foo and fee is 2 Rate cpan mine cpanxs cpan 80.6/s -- -16% -97% mine 95.5/s 18% -- -97% cpanxs 2844/s 3430% 2879% -- joshs-Computer:~/leven josh$ perl levenshtein.pl foo foo The distance between foo and foo is 0 Rate cpan cpanxs mine cpan 83.2/s -- -97% -98% cpanxs 2849/s 3325% -- -31% mine 4109/s 4840% 44% -- joshs-Computer:~/leven josh$ perl levenshtein.pl foo bar The distance between foo and bar is 3 Rate cpan mine cpanxs cpan 81.2/s -- -15% -97% mine 95.4/s 18% -- -97% cpanxs 2829/s 3385% 2865% -- oshs-Computer:~/leven josh$ perl levenshtein.pl 1234567890 0987654321 The distance between 1234567890 and 0987654321 is 10 Rate cpan mine cpanxs cpan 9.35/s -- -24% -99% mine 12.3/s 31% -- -99% cpanxs 1782/s 18966% 14410% --

        The author of both modules is known here as dree, send your patch to him.

        Ciao, Valerio

        does rolling out _min inline make much difference?


        -Waswas
        in the parlance of our times.

        "New shit has come to light, man!"

        Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

Re: improved levenshtein
by hossman (Prior) on Mar 03, 2004 at 21:28 UTC

    I concur with kvale submit a patch for the pure perl version ... and while you're at it, since both of the existing modulesare by the same person, you might want to suggest incorperating the short-circut equality test into the XS version as well.

Re: improved levenshtein
by diotalevi (Canon) on Mar 03, 2004 at 23:29 UTC
    String::Approx also implements this algorithm and should be taken into account when deciding whether another CPAN module makes sense.
Re: improved levenshtein
by QM (Vicar) on Mar 04, 2004 at 00:01 UTC
    You can speed up that _min by 20% by using this instead:
    sub _min { return $_[0] < $_[1] ?( $_[0] < $_[2] ? $_[0] : $_[2] ) :( $_[1] < $_[2] ? $_[1] : $_[2] ); }

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: improved levenshtein
by tachyon (Chancellor) on Mar 04, 2004 at 14:45 UTC

    Nice implementation. But I would also suggest that this should patch the existing perl implementation of Text::Levenshtein. Text::LevensteinXS is vanilla C and compiles and installs under 5.6.x on Win32 or Linux. You don't need 5.8 but it comiles with that too. See below. You may find Algorithm::HowSimilar does a very similar job but faster in pure perl.

Re: improved levenshtein
by Bill.Costa (Novice) on Oct 16, 2013 at 01:21 UTC

    I copied this module and put the following test code in the body. I didn't get expected results when I made simple changes within the string. For example, why isn't "set" => "sit" a distance of 1 (this code returns 2). I created a larger test set that also generates substr outside of string warnings as well. But these warnings went away and all the tests work as expected, when I commented out the early exits (return) in the first two for loops.

    my @tSet = ( [ qw(a a 0) ], [ qw(batman batman 0) ], [ qw(a b 1) ], [ qw(here there 1) ], [ qw(cat cats 1) ], [ qw(set sit 1) ], # returns 2 [ qw(robin Robin 1) ], [ qw(robin Bobin 1) ], [ qw(robin roBin 1) ], # returns 2 [ qw(set tes 2) ], [ qw(postal postage 2) ], [ qw(kitten sitting 3) ], [ qw(aaaaaaaaaaaaaaa bbbbbbbbbbbbbbb 15) ], ); use Test::More; foreach my $set (@tSet) { my($s, $t, $d) = @{$set}; my $got = distance($s, $t); if ($got != $d) { ok(0, "$s x $t => $d but got $got") } else { ok(1, "$s x $t => $d") } } done_testing(scalar(@tSet));

      What is the code for distance()?

      Update: Nevermind, the OP was changed while I was posting this.

      The answer to the question "Can we do this?" is always an emphatic "Yes!" Just give me enough time and money.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://333616]
Approved by valdez
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (11)
As of 2014-09-30 15:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (375 votes), past polls