Re^2: Finding repeat sequences.

in reply to Re: Finding repeat sequences.
in thread Finding repeat sequences.

It might even be easier than this ... (pseudocode this time)

for $t from 1 to int(length(string) / 2):
  next unless the first $t chars equal the last $t;
  # i.e. "$t" must be a plausible tail-size.

  $r = length - $t;
  
  $incr = int($r / 2);
  while( $incr > 0) {
     next unless ($incr % $r == 0);
     # i.e. must be a possible repeated-block size in this space
     solution is found if string(0, $incr) eq string($incr, $incr);
     # i.e. if we find one repetition then it must be the answer.
     $incr--;
  }
}
[download]

As before, start by looking for a tail. If you think you found one, look for repetitions in possibly-sized increments of the remaining string. This approach will consider all successively-larger candidates for the "tail," up to and including the largest tail, which is "half of it." For each tail-size, it looks for one repetition of all possible sizes which would fit evenly within this area, knowing that the left-hand portion is defined to consist only of repetitions. Continue for all tail-sizes, and for each of these, all rep-sizes, until the first success is found. It works only for data which does, in fact, match this assertion.

Replies are listed 'Best First'.
Re^3: Finding repeat sequences. by Eily (Monsignor) on Jun 19, 2013 at 12:40 UTC
Oh! I didn't think of that one. But I'm not sure why you have to do all those extra verifications. Since the input data is a repeated pattern, the length of that pattern is how little you can shift your string to the left and have a perfect match. `$s = 'abcdabcdabceabcdabcdabceab'; for (1..length($s)-1) { print substr($s,0,$_) and last if substr($s,$_) eq substr($s,0 +,length($s)-$_); }` [download] abcdabcdabce	[reply] [d/l]
Re^3: Finding repeat sequences. by sundialsvc4 (Abbot) on Jun 19, 2013 at 13:05 UTC
I think that the pseudo-coded solution is the strongest one, if it is made to stop after finding the longest possible substring (then `last`), and if it is allowed to find all possible (reasonably-long) tails, once again starting with the longest one. (These would be what a human being would probably consider to be the best “right answers.”) With very-short tails, as written it might produce wrong answers if it considers only the first two occurrences (consider string `'aaba'` if the tail were merely `'a'` .. incorrect). But the essential idea, I think, is still valid. One reason why I wrote it this way was in an effort to avoid “churning” memory when dealing with exceptionally-long strings. You don’t need to consider any string that isn’t a tail, nor, within the head, any repeated-string candidate that won’t fit. That subdivides the problem into two searches, both of which have only a few possibilities each. It might not yet be bug-free, but it ought to be close . . .

In Section Seekers of Perl Wisdom