http://www.perlmonks.org?node_id=1039719


in reply to Re: Finding repeat sequences.
in thread Finding repeat sequences.

It might even be easier than this ... (pseudocode this time)

for $t from 1 to int(length(string) / 2): next unless the first $t chars equal the last $t; # i.e. "$t" must be a plausible tail-size. $r = length - $t; $incr = int($r / 2); while( $incr > 0) { next unless ($incr % $r == 0); # i.e. must be a possible repeated-block size in this space solution is found if string(0, $incr) eq string($incr, $incr); # i.e. if we find one repetition then it must be the answer. $incr--; } }
As before, start by looking for a tail. If you think you found one, look for repetitions in possibly-sized increments of the remaining string. This approach will consider all successively-larger candidates for the "tail," up to and including the largest tail, which is "half of it." For each tail-size, it looks for one repetition of all possible sizes which would fit evenly within this area, knowing that the left-hand portion is defined to consist only of repetitions. Continue for all tail-sizes, and for each of these, all rep-sizes, until the first success is found. It works only for data which does, in fact, match this assertion.

Replies are listed 'Best First'.
Re^3: Finding repeat sequences.
by Eily (Monsignor) on Jun 19, 2013 at 12:40 UTC

    Oh! I didn't think of that one. But I'm not sure why you have to do all those extra verifications. Since the input data is a repeated pattern, the length of that pattern is how little you can shift your string to the left and have a perfect match.

    $s = 'abcdabcdabceabcdabcdabceab'; for (1..length($s)-1) { print substr($s,0,$_) and last if substr($s,$_) eq substr($s,0 +,length($s)-$_); }
    abcdabcdabce
Re^3: Finding repeat sequences.
by sundialsvc4 (Abbot) on Jun 19, 2013 at 13:05 UTC

    I think that the pseudo-coded solution is the strongest one, if it is made to stop after finding the longest possible substring (then last), and if it is allowed to find all possible (reasonably-long) tails, once again starting with the longest one.   (These would be what a human being would probably consider to be the best “right answers.”)

    With very-short tails, as written it might produce wrong answers if it considers only the first two occurrences (consider string 'aaba' if the tail were merely 'a' .. incorrect).   But the essential idea, I think, is still valid.

    One reason why I wrote it this way was in an effort to avoid “churning” memory when dealing with exceptionally-long strings.   You don’t need to consider any string that isn’t a tail, nor, within the head, any repeated-string candidate that won’t fit.   That subdivides the problem into two searches, both of which have only a few possibilities each.   It might not yet be bug-free, but it ought to be close . . .