Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: finding longest common substring (ALL common substrings)

by BrowserUk (Pope)
on Nov 20, 2003 at 06:20 UTC ( #308521=note: print w/replies, xml ) Need Help??


in reply to finding longest common substring

Broken code Warning

The code below is broken! Please see Re: Re: Re: finding longest common substring (ALL common substrings) for details, and the update at the bottom for a couple of 'fixed' versions.


This will never win the "fastest longest common substring" accolade, but it is interesting in that in a list context, it returns a list of all common substring sorted by length (longest first).

I was also surprised how simple it was to code, and fairly surprised by how efficient it was given what it does.

sub lcs{ our %subs = (); my $n = @_; shift =~ m[^.*(.+)(?{ $subs{ $^N }++ })(?!)] while @_; my @subs = sort{ length $b <=> length $a } grep{ $subs{ $_ } == $n } keys %subs; return wantarray ? @subs : $subs[ 0 ]; }

Update: The following two versions work, in as much as they will return the longest common substring if called in a scalar context. They will also return all common substrings (ordered longest to shortest) when called in a list context. As lcs routines, they are both slow, with lcs3() being marginally quicker than lcs2(). I'm not sure how they compare performance-wise with other mechanism for generating all common substrings.

As implemented, they also do not preserve the value of two (unavoidable?) globals %subs & $n. This could be fixed by judicious use of local if it is of concern.

sub lcs2{ our %subs = (); my $selector = ''; for our $n ( 0 .. $#_ ) { vec( $selector, $n, 1 ) = 1; $_[ $n ] =~ m[^.*?(.+?)(?{ $subs{ $^N } = '' unless exists $subs{ $^N }; vec( $subs{ $^N }, $n, 1 ) = 1 })(?!)]; } return wantarray ? sort{ length $b <=> length $a } grep{ $subs{ $_ } eq $selector } keys %subs : reduce{ length $a > length $b ? $a : $b } grep{ $subs{ $_ } eq $selector } keys %subs; } sub lcs3{ our %subs = (); my $selector = ' ' x @_; for our $n ( 0 .. $#_ ) { substr( $selector, $n, 1, '1' ); $_[ $n ] =~ m[^.*?(.+?)(?{ $subs{ $^N } = ' ' x @_ unless exists $subs{ $^N }; substr( $subs{ $^N }, $n, 1, '1' ); })(?!)]; } return wantarray ? sort{ length $b <=> length $a } grep{ $subs{ $_ } eq $selector } keys %subs : reduce{ length $a > length $b ? $a : $b } grep{ $subs{ $_ } eq $selector } keys %subs; }

Whether the above code has any merits I'm not sure, but it's here should anyone find a good use for it.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!

Replies are listed 'Best First'.
Re: Re: finding longest common substring (ALL common substrings)
by revdiablo (Prior) on Nov 20, 2003 at 17:55 UTC

    This is indeed interesting. With my test data, it's actually a bit faster than my original version (though we've seen how much different data will affect the various algorithms). Pretty impressive, considering what it does. There appears to be a problem, however. It returns undef if you feed it qw(foo bor boz bzo), but works fine with qw(foo boor booz bzoo) and qw(fo bor boz bzo). So if there are any mismatching number of o's, it returns undef. I don't see why offhand; maybe you have some ideas?

      Sorry. The code is flawed. It does produce all the common substrings, but it will often select the wrong "longest".

      The problem occurs because if a substring occurs twice in one of the input strings, and not at all in one of the others, it's count will be the same as if it had appeared once in both, The selection mechanism, the longest key who's count is equal to the number of input strings is bogus, but suffuciently convincing that it worked for all 5 sets of test data I tried it on!

      I'm trying to think of an efficient way of counting how many of the original strings each substring is found in, but the only one I've come up with so far would limit the number of input strings to 32. A couple of other ideas I tried worked, but carry enough overhead to make the method less interesting.

      I'll keep looking at it, but maybe my "surprise at the simplicity and efficiency" was the red flag that should have told me that I was missing something! Still, nothing ventured, nothing gained.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!

Re: Re: finding longest common substring (ALL common substrings)
by welchavw (Pilgrim) on Nov 20, 2003 at 20:43 UTC

    Hmmm...I tried your code and I am confused by at least the outputs for some inputs. Maybe you have an explanation for the following result? (I must confess, I really tried the code out since I have never used ^N before, so I apologize if I'm missing something here).

    sub lcs{ our %subs = (); my $n = @_; shift =~ m[^.*(.+)(?{ $subs{ $^N }++ })(?!)] while @_; my @subs = sort{ length $b <=> length $a } grep{ $subs{ $_ } == $n } keys %subs; return wantarray ? @subs : $subs[ 0 ]; } $" = ", "; my @inputs = ( ["ABCDEF", "BCD", "QXDAF"], #okay result ["AAAA", "AA"] #spurious result(???) ); for (@inputs) { @subs = lcs @$_; print "@$_=>@subs\n"; } __END__ __OUTPUT___ ABCDEF, BCD, QXDAF=>D AAAA, AA=>AAA

    ,welchavw

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://308521]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2020-11-30 05:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?