in reply to Finding repeat sequences.
Consider the following:
use strict; use warnings; # 1 2 3 4 5 + 6 # 12345678901234567890123456789012345678901234567890123456 +7890 my $search = 'abcdefabcababeabcdefabcababeabcdefabcababeabcdefabcababe +abcd'; print "searching '$search'\n length=" . length($search) . "\n"; # FIRST, FIND THE LENGTH OF THE TAIL: THIS IS THE SHORTEST SUBSTRING # WHICH MATCHES THE START OF THE STRING. # TRY TO EXPAND THIS NUMBER AS RAPIDLY AS POSSIBLE. my $tail_length = 1; my $tail_step = int(length($search) / 2); while ($tail_step > 0) { $tail_length += $tail_step while ( substr($search, 0, $tail_length + $tail_step) eq substr($search, -($tail_length+$tail_step), ($tail_length+$tail_step))); $tail_step = int($tail_step / 2); } print "tail is '" . substr($search, 0, $tail_length) . "' length=$tail_length\n"; # === my $body_length = length($search) - $tail_length; print "body is '$body_length' characters long ... \n"; # THE BODY MUST CONTAIN AN INTEGRAL NUMBER OF OCCURRENCES # OF THE STRING. # THEREFORE, FIND THE LONGEST ONE THAT OCCURS TWICE. my $longest = $body_length; my $n = $body_length - 1; while ($n > 1) { if ( ($body_length % $n) == 0) { print "try $n\n"; if ( substr($search, 0, $n) eq substr($search, $n, $n) ) { $longest = $n; last; } } $n--; } print "longest string '" . substr($search, 1, $longest) . "' length=$longest\n";
The “key” to this puzzle, as I read it, is the tail. This is the shortest string at the end of the string which matches the beginning of the string. The example code tries aggressively to find that number by adding power-of-two-smaller successive increments as many times as it can. (Note: might this introduce a flaw, vs. a simple 2-up count?)
The second part of the algorithm now tries to find the longest string which occurs an integral number of times in the leading (repeated ...) portion. We know that the length of this string must be a mathematical factor, i.e. (length mod factor == 0). Only the first and second occurrence must be considered. If none can be found, the string consists of one non-repeated occurrence.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Finding repeat sequences.
by sundialsvc4 (Abbot) on Jun 19, 2013 at 04:22 UTC | |
by Eily (Monsignor) on Jun 19, 2013 at 12:40 UTC | |
by sundialsvc4 (Abbot) on Jun 19, 2013 at 13:05 UTC |