Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: finding longest common substring

by BrowserUk (Pope)
on Nov 20, 2003 at 00:00 UTC ( #308451=note: print w/replies, xml ) Need Help??


in reply to finding longest common substring

Building on japhy's regex, this seems to work and it's fairly simple, though I haven't tested it's efficiency. It returns undef if there is no common substring.

Caveat: The strings mustn't contain nulls.

#! perl -slw use strict; sub lcs{ my $strings = join "\0", @_; my $lcs; for my $n ( 1 .. length $strings ) { my $re = "(.{$n})" . '.*\0.*\1' x ( @_ - 1 ); last unless $strings =~ $re; $lcs = $1 } return $lcs; } my @a = <DATA>; chomp @a; print "lcs: ", lcs( @a ); __DATA__ The quick brown fox jump over the lazy dog The quick brown fox jumps over the lazy jumps over the lazy dog The quick brown fox quick brown fox jumps over the lazy dog

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!

Replies are listed 'Best First'.
Re: Re: finding longest common substring
by sauoq (Abbot) on Nov 20, 2003 at 00:44 UTC

    While significantly faster than the OP's, I believe it suffers from some of the same scalability issues. That's one thing I hate about regular expressions: determining the complexity of an algorithm built on them tends to be very difficult.

    Anyway, that also suffers from another problem. Try it with

    my @a = ('a' . 'X' x 32768, 'a' . 'O' x 32768);
    and you'll get an error:
    Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/ +(.{ <-- HERE 32767}).*\0.*\1/

    -sauoq
    "My two cents aren't worth a dime.";
    
      While significantly faster than the OP's

      Actually, simply using index instead of m// in the grep makes my algorithm a bit faster than BrowserUk's regex. Granted, there's still a scalability problem here, but with the small tweak suggested by CombatSquirrel, it's much faster than the original, and works fine on data that is representative of what I'm actually using this for.

      Here is the benchmark code I used:

      And here are the results I got:

      I guess a length limit of 32k for the longest common substring could be a concern for some applications -- but nothing I regularly have to deal with.

      Do you have an alternative?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://308451]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (9)
As of 2020-11-24 10:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?