Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

longest common substring (with needed tweaks)

by R56 (Sexton)
on Oct 26, 2013 at 16:26 UTC ( #1059823=perlquestion: print w/replies, xml ) Need Help??
R56 has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys!

I have to make a script for the longest common substring. Since it's a 'known' problem, I searched around for a bit and stumbled across the algorithm:

my $str1 = 'stringone'; my $str2 = 'stringtwo'; my $l_length = 0; my $len1 = length $str1; my $len2 = length $str2; my @char1 = (undef, split(//, $str1)); my @char2 = (undef, split(//, $str2)); my @lc_suffix; my @substrings; for my $n1 ( 1 .. $len1 ) { for my $n2 ( 1 .. $len2 ) { if ($char1[$n1] eq $char2[$n2]) { $lc_suffix[$n1-1][$n2-1] ||= 0; $lc_suffix[$n1][$n2] = $lc_suffix[$n1-1][$n2-1] + 1; if ($lc_suffix[$n1][$n2] > $l_length) { $l_length = $lc_suffix[$n1][$n2]; @substrings = (); } if ($lc_suffix[$n1][$n2] == $l_length) { push @substrings, substr($str1, ($n1-$l_length), $l_length); } } } }

Works like a charm for a simple comparison between two predetermined strings, but my implementation requires a few tweaks:

Instead of only two predetermined strings, it has to read and compare from a previously input-fed array.

Use two limits that are already on the input. The first line of the input contains two numbers. First one tells us how many strings we should consider, second one tells us the minimum number of original strings where the substring should appear. The rest of the lines are the strings for comparison.

I'm handling the beginning like so:

open (IN, 'text.txt'); while (<IN>){ if ($_ =~ /^(\d)\s(\d)/){ $size = $1; $minmatch = $2; } else{ push(@array, $_); } }

So how could I make the other changes?

Thanks for helping!

Replies are listed 'Best First'.
Re: longest common substring (with needed tweaks)
by Athanasius (Chancellor) on Oct 27, 2013 at 04:41 UTC

    Hello R56,

    Three points:

    (1) Unless you’re sure that the first line of the data file begins with a single digit, followed by a single whitespace character, followed by a single digit, your regex will not work. You need to add quantifiers: /^(\d+)\s+(\d+)/.

    (2) I do not know what this means:

    The first line of the input contains two numbers. First one tells us how many strings we should consider, second one tells us the minimum number of original strings where the substring should appear.

    However, I suggest you postpone consideration of these constraints until you’ve tackled the more important problem of finding the longest common substring for 3 or more strings.

    (3) Your post implies that the general problem can be solved by “tweaking” the algorithm for finding the longest common substring between 2 strings. However, this may be harder than you think. Consider the following 3 strings:

    adam***betty ^^charlie^^^^^adam^^betty^^^^^ #####adam######charlie##

    By inspection, the longest substring common to this data set is adam. But application of your 2-string algorithm to the first 2 strings gives betty only. You need an algorithm which finds:
        the set S1 of all substrings common to strings 1 and 2,
        the set S2 of all substrings common to strings 1 and 3, and
        the set S3 of all substrings common to strings 2 and 3,
    and then finds the solution as the longest string in the intersection of S1, S2, and S3.

    A glance at Longest_common_substring_problem suggests that the algorithm you need is far from trivial. But the dynamic programming pseudocode given there should get you started in the right direction.

    Hope that helps (a little),

    Update: Improved the paragraph beginning “By inspection”.

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Hey Athanasius!

      1) Yes, it's exactly digit-space-digit, so my regex will work.

      2) I can see that's a bit hard to understand. Let me try again. First number tells us how many strings will be considered for comparison. Second number tells us the minimum number of strings where the substring has to appear. For instance:

      3 3 strrringggg ssttrrringggg stttrrringgg

      output will be 'rrr'

      3 2 strrringggg ssttrrringggg stttrrringgg

      output will be 'gggg'

      It is a weird filter, I know, but it needs to be that way.

      3) Yeah, I already read on the subject but can't seem to figure out how to make the changes. My initial idea was that some clever well-places loops would get me what I wanted, but know I'm starting to see it's not that simple.

      Thanks for the help though!

Re: longest common substring (with needed tweaks) (trie)
by tye (Sage) on Oct 27, 2013 at 20:14 UTC

    I'd build a trie but have it track how many strings so far have contained each substring. Then I'd track the length of the longest substring so far found in enough strings. When a substring's find count reaches the threshold, if it is the same length as the longest, then I'd add it to the "found" list. If it is longer, then I'd update the max length and replace the "found" list with just that one substring.

    Since a single substring could appear multiple times in each string, you also need to track the substrings that you have, so far, entered into the trie for the current string (which I'd just do in a hash).

    That method is efficient enough that I ran it against all 172,819 words in my copy of the "enable1" word list, and just had it simultaneously compute the answers for all counts. Producing, in part:

    Note that I was lazy and had it just track (eventually) all 120,964 "answers" and whittled it down to the 91 different answers as the last step. It'd be significantly even more efficient (in both space and time) if I had only tracked the unique answers as it went along. But that probably would have taken more of my time in programming than just waiting a few minutes for the above results took. The script as-is took very little time to write.

    No, I didn't include the code. This looks like a homework problem so I won't be posting the code for at least a couple of weeks.

    - tye        

      Hey tye!

      I read some things on suffix and prefix trees, but didn't get very far. Still, your solution looks really good when dealing with a large number of strings. Fortunately I won't have to deal with that many :)

      It's not homework, it's to help me on my job, but thanks for the guidance!

Re: longest common substring (with needed tweaks)
by Limbic~Region (Chancellor) on Oct 28, 2013 at 12:21 UTC
    Re: longest common substring

    It references a thread I wrote back in 2006 (Longest Common Subsequence ). Substrings and subsequences aren't the same thing though I did solve for both. I just found it odd that there wasn't an off-the-shelf solution that handled more than 2 strings. There may be now - I haven't looked in years.

    Cheers - L~R

Re: longest common substring (with needed tweaks)
by Anonymous Monk on Oct 27, 2013 at 01:04 UTC

    So how could I make the other changes?

    First describe in words what the changes should do? :)

      what's in the 'tweaks' section! :)
Re: longest common substring (with needed tweaks)
by Lennotoecom (Pilgrim) on Oct 27, 2013 at 17:27 UTC
    $_ = <DATA>; $_ = $` if /$/; @a = split //, $_; for $i (0 .. $#a){ $e = $a[$i]; $hash{$e} = 1; for $y ($i+1 .. $#a){ $e .= $a[$y]; $hash{$e} = 1; } } while(<DATA>){ $_ = $` if /$/; @a = split //, $_; %thash = (); for $i (0 .. $#a){ $e = $a[$i]; $thash{$e} = 1 if defined $hash{$e}; for $y ($i+1 .. $#a){ $e .= $a[$y]; $thash{$e} = 1 if defined $hash{$e}; } } foreach $key (keys %hash){ $hash{$key}++ if defined $thash{$key}; } } $max = ''; foreach $key (keys %hash){ if($hash{$key} == 3){ $max = $key if length($max) < length($key); } } print "$max\n"; __DATA__ strrringggg ssttrrringggg stttrrringgg
    output will be
    which is the longest common substring in this case so about your options, change 3 to 2 in the last cycle
    and you'll get the common substring for at least to lines.
    I know this is ugly. No time)
    the algorithm of the monstrosity above:
    1. taking first line.
    2. create all possible combinations of substrings out of it and put it into hash.
    3. in the cycle take each line, and create all possible substrings out of it creating temporal hash.
    4. compare temp. hash with the first, increment only those which are in both.
    5. at the last cycle the number in the if construction, sorts how many lines have to have desirable longest substring.

      Great piece of code Lennotoecom :)

      Struggling a bit to understand it tho, as it's greatly simplified!

      Can you explain me this first line in detail? Never seen the $` before...

      $_ = <DATA>; $_ = $` if /$/; @a = split //, $_;

      Thank you!

        for example:
        $a = 'aa ab c c'; $a=~m/b/; now $` contains 'aa a' $& contains 'b' $' contains ' c c'
        in other words all symbols of a line before the found result
        found result,
        and all the symbols after found results
Re: longest common substring (with needed tweaks)
by oiskuu (Hermit) on Oct 28, 2013 at 01:56 UTC
    I'm not entirely sure if I understood the first parameter correctly.
    It's just a (redundant) check? No matter.
    chomp (my @dta = <DATA>); my ($ntot, $nmin) = split /\s/, shift @dta; die if $nmin < 2 or $nmin > $ntot or $ntot != @dta; sub cpfx { ($_[0]^$_[1]) =~ m/^\0*/s; substr($_[0], 0, length $&) } sub sufs { map {[substr($_[0], -$_) => $_[1]]} 1..length($_[0]) } my @SA = sort {$a->[0] cmp $b->[0]} map {sufs($dta[$_], $_)} 0..$#dta; my @N = (0) x @dta; my ($best, $n, $h, $t) = ("", -$nmin, 0, 0); while ($h < @SA) { ($n += !$N[$SA[$h++]->[1]]++) >= 0 or next; ($n -= !--$N[$SA[$t++]->[1]]) while $n > 0 || $N[$SA[$t]->[1]] > 1; my $pfx = cpfx(map $SA[$_]->[0], $t, $h-1); $best = $pfx if length($best) < length($pfx); } print "\"$best\"\n"; __DATA__ 6 3 You can use the substr() function as an lvalue, in which case EXPR mus +t itself be an lvalue. If you assign something shorter than LENGTH, the string will shrink, and if you assign something longer than LENGTH +, the string will grow to accommodate it. To keep the string the same length, you may need to pad or chop your value using "sprintf". If OFFSET and LENGTH specify a substring that is partly outside the st +ring, only the part within the string is returned. If the substring is beyond either end of the string, substr() returns the undefined val +ue and produces a warning. When used as an lvalue, specifying a substring that is entirely outside the string raises an exception. He +re's an example showing the behavior for boundary cases:

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1059823]
Approved by davido
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2018-03-19 17:13 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (245 votes). Check out past polls.