http://www.perlmonks.org?node_id=1039715


in reply to Finding repeat sequences.

Consider the following:

use strict; use warnings; # 1 2 3 4 5 + 6 # 12345678901234567890123456789012345678901234567890123456 +7890 my $search = 'abcdefabcababeabcdefabcababeabcdefabcababeabcdefabcababe +abcd'; print "searching '$search'\n length=" . length($search) . "\n"; # FIRST, FIND THE LENGTH OF THE TAIL: THIS IS THE SHORTEST SUBSTRING # WHICH MATCHES THE START OF THE STRING. # TRY TO EXPAND THIS NUMBER AS RAPIDLY AS POSSIBLE. my $tail_length = 1; my $tail_step = int(length($search) / 2); while ($tail_step > 0) { $tail_length += $tail_step while ( substr($search, 0, $tail_length + $tail_step) eq substr($search, -($tail_length+$tail_step), ($tail_length+$tail_step))); $tail_step = int($tail_step / 2); } print "tail is '" . substr($search, 0, $tail_length) . "' length=$tail_length\n"; # === my $body_length = length($search) - $tail_length; print "body is '$body_length' characters long ... \n"; # THE BODY MUST CONTAIN AN INTEGRAL NUMBER OF OCCURRENCES # OF THE STRING. # THEREFORE, FIND THE LONGEST ONE THAT OCCURS TWICE. my $longest = $body_length; my $n = $body_length - 1; while ($n > 1) { if ( ($body_length % $n) == 0) { print "try $n\n"; if ( substr($search, 0, $n) eq substr($search, $n, $n) ) { $longest = $n; last; } } $n--; } print "longest string '" . substr($search, 1, $longest) . "' length=$longest\n";

The “key” to this puzzle, as I read it, is the tail.   This is the shortest string at the end of the string which matches the beginning of the string.   The example code tries aggressively to find that number by adding power-of-two-smaller successive increments as many times as it can.   (Note: might this introduce a flaw, vs. a simple 2-up count?)

The second part of the algorithm now tries to find the longest string which occurs an integral number of times in the leading (repeated ...) portion.   We know that the length of this string must be a mathematical factor, i.e. (length mod factor == 0).   Only the first and second occurrence must be considered.   If none can be found, the string consists of one non-repeated occurrence.

Replies are listed 'Best First'.
Re^2: Finding repeat sequences.
by sundialsvc4 (Abbot) on Jun 19, 2013 at 04:22 UTC

    It might even be easier than this ... (pseudocode this time)

    for $t from 1 to int(length(string) / 2): next unless the first $t chars equal the last $t; # i.e. "$t" must be a plausible tail-size. $r = length - $t; $incr = int($r / 2); while( $incr > 0) { next unless ($incr % $r == 0); # i.e. must be a possible repeated-block size in this space solution is found if string(0, $incr) eq string($incr, $incr); # i.e. if we find one repetition then it must be the answer. $incr--; } }
    As before, start by looking for a tail. If you think you found one, look for repetitions in possibly-sized increments of the remaining string. This approach will consider all successively-larger candidates for the "tail," up to and including the largest tail, which is "half of it." For each tail-size, it looks for one repetition of all possible sizes which would fit evenly within this area, knowing that the left-hand portion is defined to consist only of repetitions. Continue for all tail-sizes, and for each of these, all rep-sizes, until the first success is found. It works only for data which does, in fact, match this assertion.

      Oh! I didn't think of that one. But I'm not sure why you have to do all those extra verifications. Since the input data is a repeated pattern, the length of that pattern is how little you can shift your string to the left and have a perfect match.

      $s = 'abcdabcdabceabcdabcdabceab'; for (1..length($s)-1) { print substr($s,0,$_) and last if substr($s,$_) eq substr($s,0 +,length($s)-$_); }
      abcdabcdabce

      I think that the pseudo-coded solution is the strongest one, if it is made to stop after finding the longest possible substring (then last), and if it is allowed to find all possible (reasonably-long) tails, once again starting with the longest one.   (These would be what a human being would probably consider to be the best “right answers.”)

      With very-short tails, as written it might produce wrong answers if it considers only the first two occurrences (consider string 'aaba' if the tail were merely 'a' .. incorrect).   But the essential idea, I think, is still valid.

      One reason why I wrote it this way was in an effort to avoid “churning” memory when dealing with exceptionally-long strings.   You don’t need to consider any string that isn’t a tail, nor, within the head, any repeated-string candidate that won’t fit.   That subdivides the problem into two searches, both of which have only a few possibilities each.   It might not yet be bug-free, but it ought to be close . . .