Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

finding longest common substring

by revdiablo (Prior)
on Nov 19, 2003 at 21:47 UTC ( #308417=perlquestion: print w/ replies, xml ) Need Help??
revdiablo has asked for the wisdom of the Perl Monks concerning the following question:

Hello good Monks. First let me start out with the problem: given an arbitrary list of strings, find the longest common substring. My approach to this problem is to grab one of the strings from the list, scan through it with successively decreasing substring lengths, and check the list for matches. This is simple, and seems quite effective, but I'm wondering if there are any problems with this approach? Could it be done simpler and more efficiently? Is there a well-known algorithm to do this, and I am too daft to find it?

Here is the working code I came up with. I do not have any particular problem with this code, I just wanted to run it by the Monastery to see if anyone could give me suggestions, or find any lurking problems:

#!/usr/bin/perl use warnings; use strict; use Data::Dumper; for ([ qw(fooabc123 fooabc321 foobca232) ], [ qw(abcfoo123 bcafoo321 foo123abc) ], [ qw(foo bor boz bzo) ]) { print Dumper($_); print findlcs(@{ $_ }), "\n"; print "---\n"; } sub findlcs { my $substr = $_[0]; my $len = length $_[0]; my $off = 0; while ($substr) { my @matches = grep /\Q$substr/, @_; #printf "%s%-".(length($_[0])-$off)."s matches %d\n", # " " x $off, $substr, scalar @matches; last if @matches == @_; $off++; $len-- and $off=0 if $off+$len > length $_[0]; $substr = substr $_[0], $off, $len; } return $substr; }

Comment on finding longest common substring
Download Code
Re: finding longest common substring
by etcshadow (Priest) on Nov 19, 2003 at 22:08 UTC
    FYI: you might want to have a look at Algorithm::Diff, which is centered around an LCS implementation.

    ------------
    :Wq
    Not an editor command: Wq

      Can anyone tell me how to use Algoritm::Diff::LCS to find the longest common substring of an array of strings?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!

        you don't -- LCS != LCSS
Re: finding longest common substring
by duff (Vicar) on Nov 19, 2003 at 22:12 UTC

    One method that might work well enough would be to concatenate the strings and then look for the longest non-overlapping repetition. You'll have to worry about string boundaries though. Here's some code for finding the longest repeated substring in case you take to the idea:

    sub repeated_substring { my ($ssl,$pos) = (1,-1); my $len = length($_[0]); my $i = 0; while ($i < $len-2*$ssl) { if (index($_[0],substr($_[0],$i,$ssl),$i+$ssl) == -1) { $i++; ne +xt } $pos = $i; $ssl++; } return $pos == -1 ? "" : substr($_[0], $pos, $ssl-1); }

    If you joined the strings with some character that won't appear in the strings (a colon say), then you could modify the above such that as soon as you hit a colon, stop

    Update: I just noticed that you're blindly grabbing the first element in the list. An optimization would be to sort the list of strings by length and always start with the shortest one (assuming you continue using your method).

    DOH! I just realized that my method won't work at all!

    Update: Okay ... I'm stubborn. I know it. Here's how to *make* it work with the repeated_substring() routine:

    sub findlcs { my @ret; for my $i (0..$#_-1) { for my $j ($i..$#_) { my $str = join ":", @_[$i,$j]; my $ans = repeated_substring($str); push @ret, $ans; } } return (sort { length($a) <=> length($b) } @ret)[0]; }
    Whew! Boy is that inefficient and ugly! :-)

    revdiablo, you did it well. Don't knock your implementation.

    Another! update: I realized on my way to pick up the kids that even this method fails. I sure hope revdiablo is watching this to see what could have happened to him :-)

Re: finding longest common substring
by CombatSquirrel (Hermit) on Nov 19, 2003 at 22:38 UTC
    I personally like your idea and implementation. They are simple and (appearantly) efficient. You might want to have a look at index instead of the RegEx you are using, especially because you are escaping metacharacters, but that's just a minor issue. Anyways, I tried and benchmarked it; if you are interested in the results, here they are: ++ for you
    Cheers,
    CombatSquirrel.
    Entropy is the tendency of everything going to hell.
Re: finding longest common substring
by revdiablo (Prior) on Nov 19, 2003 at 22:54 UTC

    It seems I was a bit hasty in posting this question, and probably should have done a little more research first. gmax has pointed out Longest Common Substring, and etcshadow has pointed out Algorithm::Diff (which I have seen before, and is undoubtedly the reason the phrase "longest common substring" was floating around in my brain to begin with). Also, I updated my node mentioning japhy's Longest common substring, but apparently the update got eaten. (Is it just my imagination, or didn't we recently gain the ability to update root nodes?)

    While Algorithm::Diff doesn't do exactly the same thing I am doing, it would have probably been useful at least. Sorry for wasting any time, and many thanks for the replies that did come! :-)

      It's not a waste of time (at least, I don't think it is). Personally, I think it's interesting that diff is based on LCS, and there are probably a lot of people who don't know that/might think it's interesting, too.

      Maybe this is more of a meditation, but I don't really consider it to be horribly obnoxious to ask a question that might be answerable via documentation search. Espescially when you consider how frequently people post ludicrously simple questions like "how do I write a regexp that does this: ...?" The other monks don't jump on those folks' cases, with "perldoc perlre... RTFM!" I don't understand why some of them are upset when someone asks a far more intelligent and informed question that might be answered by doc-diving.

      The truth of the matter is that doc-diving is often very difficult for finding where to start. You tend to run into the all-to-frequent problem of "I can find out what $module does, but I can't find which module does $thing".

      Anyway, don't appologize, it was a perfectly reasonable question. I (and the other folks) were just pointing you towards resources that might help in your search for more info.


      ------------
      :Wq
      Not an editor command: Wq
Re: finding longest common substring
by QM (Vicar) on Nov 19, 2003 at 23:06 UTC
    See QOTW Expert #14 for a similar discussion, which may prove helpful (or not).

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      In fact, my unsubmitted entry to QOTW #14 is where my repeated_substring() routine came from

      QOTW 14 solves a slightly different problem, the longest repeated substring in a single string which is somewhat related

      Here is my suffix trees based QOTW entry updated for LCS. In a sense this is the fastest type of algorithm possible, since it's guaranteed linear in both time and space relative to the combined string length (add one termination character to each string, and actually there is also O(number_of_strings) due to the way I currently mark the strings, though that can be improved upon) Unfortunately the needed complex datastructures make it use so much memory and need so much setup that many of the clever solutions in QOTW 14 can be converted to effectively faster solutions.

        Hi,
        I was referred to your node above from my earlier request about Generalized Suffix Tree.
        Since I am very interested to the use of it for bioinformatics purpose.

        I just want to clarify:
        • Does you implementation above covers the "generalized" suffix tree?
          Or is it the same with S.Yona's module?
        • If so, do you have a CPAN module for it?
        • Otherwise, I would really appreciate if you can kindly point me where can I find the actual Perl implementation for it, if you happen to know one.

        Thanks so much beforehand.
        Hope to hear from you again.

        Regards,
        Edward
Re: finding longest common substring
by pizza_milkshake (Monk) on Nov 19, 2003 at 23:49 UTC
    #!perl -wl my @L1 = qw/A b c de fgh ijk lmno pqrst uvwxyz/; my @L2 = qw/a b c de fgh ijk lmnO pqrsT uvwxyZ/; my @MATCHES = (); my $LMATCH = 0; for $I1 (sort{length($b)<=>length($a)}@L1){ last if length($I1) < $LMATCH; for $I2 (grep{$_ eq $I1}sort{length($b)<=>length($a)}@L2){ $LMATCH = length($I2); push @MATCHES, $I1; } } print "LIST1: " . scalar @L1 . " items LIST2: " . scalar @L2 . " items MATCHES: @MATCHES";

    perl -e"map print(chr(hex(( q{6f634070617a6d692e7273650a}=~/../g)[hex]))), (q{375542349abb99098106c}=~/./g)"

Re: finding longest common substring
by BrowserUk (Pope) on Nov 20, 2003 at 00:00 UTC

    Building on japhy's regex, this seems to work and it's fairly simple, though I haven't tested it's efficiency. It returns undef if there is no common substring.

    Caveat: The strings mustn't contain nulls.

    #! perl -slw use strict; sub lcs{ my $strings = join "\0", @_; my $lcs; for my $n ( 1 .. length $strings ) { my $re = "(.{$n})" . '.*\0.*\1' x ( @_ - 1 ); last unless $strings =~ $re; $lcs = $1 } return $lcs; } my @a = <DATA>; chomp @a; print "lcs: ", lcs( @a ); __DATA__ The quick brown fox jump over the lazy dog The quick brown fox jumps over the lazy jumps over the lazy dog The quick brown fox quick brown fox jumps over the lazy dog

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!
    Wanted!

      While significantly faster than the OP's, I believe it suffers from some of the same scalability issues. That's one thing I hate about regular expressions: determining the complexity of an algorithm built on them tends to be very difficult.

      Anyway, that also suffers from another problem. Try it with

      my @a = ('a' . 'X' x 32768, 'a' . 'O' x 32768);
      and you'll get an error:
      Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/ +(.{ <-- HERE 32767}).*\0.*\1/

      -sauoq
      "My two cents aren't worth a dime.";
      

        I guess a length limit of 32k for the longest common substring could be a concern for some applications -- but nothing I regularly have to deal with.

        Do you have an alternative?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!
        Wanted!

        While significantly faster than the OP's

        Actually, simply using index instead of m// in the grep makes my algorithm a bit faster than BrowserUk's regex. Granted, there's still a scalability problem here, but with the small tweak suggested by CombatSquirrel, it's much faster than the original, and works fine on data that is representative of what I'm actually using this for.

        Here is the benchmark code I used:

        And here are the results I got:

Re: finding longest common substring
by sauoq (Abbot) on Nov 20, 2003 at 00:25 UTC

    Without looking at it too closely, that looks fairly clean. It isn't terribly efficient though. It won't scale well to very large strings, especially if the common substrings end up being relatively small. In the worst case, you try (N^2+N)/2 substrings. (Try that with ['a' . 'X' x 10000, 'a' . 'O' x 10000] for instance...)

    If you were doing this in C, you'd probably use Suffix Trees (offsite). Someone once asked about those at (the poorly named) suffix arrays but nothing much came of it except a suggestion by Zaxo that substr() could be used to create a data structure similar to a suffix tree and a note by BrowserUk that Zaxo's suggestions wouldn't work with versions less than 5.8.1.

    Update: Looking at Dominus's site, it turns out that the discussion on Expert QOTW #14 referenced by others also mentions suffix trees. Dominus points to this explanation of them. Warning: that link is to a directory with a bunch of stuff in it. This might be a better starting point. Or, if you prefer, the PDF version. There's also a powerpoint version in there.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: finding longest common substring
by tilly (Archbishop) on Nov 20, 2003 at 03:09 UTC
    The following solution has a lot of overhead, but the algorithm should be more scalable than yours. The pathological case that it sucks at is multiple large strings with one character repeated many, many times. But on real text it should fairly quickly narrow down to just working on text that is repeated at least once per string.

    I'm sure that this approach could be made much faster and clearer.

      This is an interesting problem and that's an intruiging algorithm. It would be really nice to see these all implemented in C and compared. I suspect that yours would fair much better.

      Which set of data you consider to be more realistic, will determine which algorithm/implementation is better suited to your application I guess. It's not very often that the choice of best algorithm varies so wildly with the input data.

      fooabc123 fooabc321 foobca232 Rate Tilly revdiablo CombatSqu BrowserUk Tilly 89.6/s -- -45% -85% -85% revdiablo 163/s 81% -- -72% -72% CombatSqu 581/s 549% 257% -- -1% BrowserUk 587/s 554% 261% 1% -- abcfoo123 bcafoo321 foo123abc Rate Tilly revdiablo CombatSqu BrowserUk Tilly 91.7/s -- -37% -82% -84% revdiablo 145/s 58% -- -72% -74% CombatSqu 514/s 460% 255% -- -9% BrowserUk 563/s 514% 289% 10% -- foo bor boz bzo Rate Tilly revdiablo BrowserUk CombatSqu Tilly 408/s -- -43% -59% -82% revdiablo 719/s 76% -- -28% -68% BrowserUk 995/s 144% 38% -- -56% CombatSqu 2236/s 449% 211% 125% -- The quick brown fox jump over the lazy dog The quick brown fox jumps over the lazy jumps over the lazy dog The quick brown fox quick brown fox jumps over the lazy dog Rate Tilly revdiablo CombatSqu BrowserUk Tilly 3.59/s -- -59% -89% -95% revdiablo 8.75/s 143% -- -74% -87% CombatSqu 33.4/s 829% 282% -- -51% BrowserUk 68.6/s 1807% 683% 105% -- The quick brown fox jump over the lazy dog The quick brown fox jumps over the lazy jumps over the lazy dog The quick brown fox quick brown fox jumps over the lazy dog x Rate revdiablo CombatSqu BrowserUk Tilly revdiablo 3.79/s -- -71% -91% -95% CombatSqu 12.9/s 240% -- -69% -85% BrowserUk 41.0/s 984% 219% -- -51% Tilly 83.2/s 2098% 546% 103% -- The quick brown fox jump over the lazy dog xThe quick brown fox jump over the lazy dog xxThe quick brown fox jump over the lazy dog xxxThe quick brown fox jump over the lazy dog xxxxThe quick brown fox jump over the lazy dog Rate Tilly BrowserUk revdiablo CombatSqu Tilly 0.761/s -- -98% -100% -100% BrowserUk 44.2/s 5714% -- -99% -99% revdiablo 4728/s 621405% 10589% -- -25% CombatSqu 6305/s 828641% 14153% 33% --

      Benchmark


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!

        I guess that I had somewhat more overhead than I realized. I'll think about whether I can improve on that. :-(

        Incidentally my solution compares much better than others if you make the input strings much longer than the common substrings. For instance if the common match is "The quick brown fox" and the strings are each a few hundred characters, I win by a wide margin.

Re: finding longest common substring
by davido (Archbishop) on Nov 20, 2003 at 05:23 UTC
    I haven't seen this method posted yet. It doesn't get fancy with regexps, but is fairly clear and simple to understand. Like the rest, scalability is an issue.

    use strict; use warnings; my @first = qw/ short bigger longest superbig /; my @second = qw/ short longer even longer longest /; my $string = ""; foreach my $item ( @first ) { $string = $item if length $item > length $string and grep { $item eq $_ } @second; } print $string, "\n";

    One way to improve the algorithm to scale better may be to keep the arrays ordered in decending order of length so that you could just stop searching on the first match. That would require more overhead at "insertion" time, but much less at searching time.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein

      I haven't looked at your solution in depth, but upon initial inspection it seems to do the same thing as pizza_milkshake's at Re: finding longest common substring. I must admit that -- at least to my eyes -- your version looks cleaner and is more understandable than his, but it doesn't seem like either of these solve the problem I was asking about. Perhaps my quick summary of the problem was unclear, or perhaps I am simply missing something. I was looking for the longest substring that is common to all elements of the list. In my example, I have 3 lists, but each one is analyzed independently. Both of your solutions seem to be searching two lists in parallel for the longest matching element. Maybe I'm just not looking at them from the right perspective?

        :) Ok, now that I better understand what you're asking, here's my regex solution. I think that this improves upon some other methods by starting with the longest and working down to the shortest, (quitting as soon as a match is found).

        use strict; use warnings; my @array = ( "this is a string is", "a string this is", "tie a string together" ); my $string = join "|", @array; my $repeat = '\|.*?\1.*?' x ( scalar(@array) - 1 ); for my $n ( reverse ( 1 .. length( $string ) ) ) { next unless $string =~ /(.{$n}).*?$repeat/; print $1, "\n"; last; }

        Ob-Update: This assumes the substrings don't overlap. And I've tinkered with the code to make it keep track of what the original substrings looked like. In so doing, I used the | character as a delimiter, which means it shouldn't appear in the original substrings. Thanks danger for the polite nudge. ;)


        Dave


        "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
Re: finding longest common substring (ALL common substrings)
by BrowserUk (Pope) on Nov 20, 2003 at 06:20 UTC

    Broken code Warning

    The code below is broken! Please see Re: Re: Re: finding longest common substring (ALL common substrings) for details, and the update at the bottom for a couple of 'fixed' versions.


    This will never win the "fastest longest common substring" accolade, but it is interesting in that in a list context, it returns a list of all common substring sorted by length (longest first).

    I was also surprised how simple it was to code, and fairly surprised by how efficient it was given what it does.

    sub lcs{ our %subs = (); my $n = @_; shift =~ m[^.*(.+)(?{ $subs{ $^N }++ })(?!)] while @_; my @subs = sort{ length $b <=> length $a } grep{ $subs{ $_ } == $n } keys %subs; return wantarray ? @subs : $subs[ 0 ]; }

    Update: The following two versions work, in as much as they will return the longest common substring if called in a scalar context. They will also return all common substrings (ordered longest to shortest) when called in a list context. As lcs routines, they are both slow, with lcs3() being marginally quicker than lcs2(). I'm not sure how they compare performance-wise with other mechanism for generating all common substrings.

    As implemented, they also do not preserve the value of two (unavoidable?) globals %subs & $n. This could be fixed by judicious use of local if it is of concern.

    sub lcs2{ our %subs = (); my $selector = ''; for our $n ( 0 .. $#_ ) { vec( $selector, $n, 1 ) = 1; $_[ $n ] =~ m[^.*?(.+?)(?{ $subs{ $^N } = '' unless exists $subs{ $^N }; vec( $subs{ $^N }, $n, 1 ) = 1 })(?!)]; } return wantarray ? sort{ length $b <=> length $a } grep{ $subs{ $_ } eq $selector } keys %subs : reduce{ length $a > length $b ? $a : $b } grep{ $subs{ $_ } eq $selector } keys %subs; } sub lcs3{ our %subs = (); my $selector = ' ' x @_; for our $n ( 0 .. $#_ ) { substr( $selector, $n, 1, '1' ); $_[ $n ] =~ m[^.*?(.+?)(?{ $subs{ $^N } = ' ' x @_ unless exists $subs{ $^N }; substr( $subs{ $^N }, $n, 1, '1' ); })(?!)]; } return wantarray ? sort{ length $b <=> length $a } grep{ $subs{ $_ } eq $selector } keys %subs : reduce{ length $a > length $b ? $a : $b } grep{ $subs{ $_ } eq $selector } keys %subs; }

    Whether the above code has any merits I'm not sure, but it's here should anyone find a good use for it.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!
    Wanted!

      This is indeed interesting. With my test data, it's actually a bit faster than my original version (though we've seen how much different data will affect the various algorithms). Pretty impressive, considering what it does. There appears to be a problem, however. It returns undef if you feed it qw(foo bor boz bzo), but works fine with qw(foo boor booz bzoo) and qw(fo bor boz bzo). So if there are any mismatching number of o's, it returns undef. I don't see why offhand; maybe you have some ideas?

        Sorry. The code is flawed. It does produce all the common substrings, but it will often select the wrong "longest".

        The problem occurs because if a substring occurs twice in one of the input strings, and not at all in one of the others, it's count will be the same as if it had appeared once in both, The selection mechanism, the longest key who's count is equal to the number of input strings is bogus, but suffuciently convincing that it worked for all 5 sets of test data I tried it on!

        I'm trying to think of an efficient way of counting how many of the original strings each substring is found in, but the only one I've come up with so far would limit the number of input strings to 32. A couple of other ideas I tried worked, but carry enough overhead to make the method less interesting.

        I'll keep looking at it, but maybe my "surprise at the simplicity and efficiency" was the red flag that should have told me that I was missing something! Still, nothing ventured, nothing gained.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!
        Wanted!

      Hmmm...I tried your code and I am confused by at least the outputs for some inputs. Maybe you have an explanation for the following result? (I must confess, I really tried the code out since I have never used ^N before, so I apologize if I'm missing something here).

      sub lcs{ our %subs = (); my $n = @_; shift =~ m[^.*(.+)(?{ $subs{ $^N }++ })(?!)] while @_; my @subs = sort{ length $b <=> length $a } grep{ $subs{ $_ } == $n } keys %subs; return wantarray ? @subs : $subs[ 0 ]; } $" = ", "; my @inputs = ( ["ABCDEF", "BCD", "QXDAF"], #okay result ["AAAA", "AA"] #spurious result(???) ); for (@inputs) { @subs = lcs @$_; print "@$_=>@subs\n"; } __END__ __OUTPUT___ ABCDEF, BCD, QXDAF=>D AAAA, AA=>AAA

      ,welchavw

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://308417]
Approved by Itatsumaki
Front-paged by Itatsumaki
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2014-10-23 03:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (123 votes), past polls