Finding repeat sequences.

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding repeat sequences. by DamianConway (Beadle) on Jun 18, 2013 at 21:43 UTC
Here's a possible solution for what—I think—you're trying to achieve: #! /usr/bin/env perl use 5.014; use warnings; # We're going to try matching substrings of this... my $MASTER_STR = 'abcdabcdabceabcdabcdabceab'; # Using this pattern to find (possibly truncated) repetitions... my $REPETITION_PAT = qr{ \A # Start at the beginning of the string (.+?) # Match minimal initial sequence (as $1) (?{$^N}) # Remember it internally (as $^R) (?\| # Then either it's not truncated... \1+ # Rematch it exactly as often as possible \z # ...until the end of the string () # Remember that nothing was left over (as $2) \| # Or else it is truncated... \1* # Rematch it exactly as often as possible (or not) (.+) # Then grab what's left (as $2)(internally as $^N) \z # ...until the end of the string # Then verify that what's left is a proper truncation... (??{ # If what's left ($^N) is substring of repetition ($^R)... $^N eq substr($^R,0,length($^N)) ? q{} # ...then do nothing (i.e. keep matching) : q{(?!)} # ...else initiate backtracking }) ) }xms; # Loop through increasingly truncated substrings... for my $substr (1..length($MASTER_STR)) { # Compute the substring... my $str = substr($MASTER_STR,0,-$substr); # Evaluate the match... if ($str =~ $REPETITION_PAT) { say "$str\n Matched: $1*$2"; } else { say "$str\n Didn't match"; } } [download] Damian	[reply] [d/l]
Re^2: Finding repeat sequences. by DamianConway (Beadle) on Jun 19, 2013 at 22:05 UTC
Here's a significantly optimized version of my previous solution, which draws heavily on anomalous's excellent work, but only bothers to capture the initial repeated component (in `$1`) as that's apparently all that's wanted. `my $REPETITION_PAT = qr{ \A # Start at the beginning of the string (.+?) # Match minimal initial sequence (as $1) \1*+ # Rematch it exactly as often as possible (maybe zero) # Then verify what's left is a proper truncation... (?(?{ index($^N, substr($_,pos())) }) (?!) ) }xms;` [download] It still passes BrowserUK's gauntlet, however in my own testing it's approximately 400 times faster than my previous attempt. I suspect that's still not sufficient to make it competitive with the non-regex solutions. (I know which one I'd rather maintain, though ;-) As you can see, the key to the improvement in regex performance was—as usual—to eliminate opportunities for backtracking or capturing. Damian	[reply] [d/l] [select]
Re^3: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 20, 2013 at 01:08 UTC
That is a very substantial improvement and it now beats tobyink's non-regex solution, but cannot touch hdb's: Partisipants in performance tests: hdb tobyink damianc b: 4 in s: 6 hdb :: 0.000089 s b: 4 in s: 6 tobyink :: 0.000336 s b: 4 in s: 6 damianc :: 0.000081 s b: 4 in s: 43 hdb :: 0.000070 s b: 4 in s: 43 tobyink :: 0.000070 s b: 4 in s: 43 damianc :: 0.000266 s b: 4 in s: 402 hdb :: 0.000083 s b: 4 in s: 402 tobyink :: 0.000092 s b: 4 in s: 402 damianc :: 0.000210 s b: 4 in s: 4000 hdb :: 0.000391 s b: 4 in s: 4000 tobyink :: 0.000158 s b: 4 in s: 4000 damianc :: 0.000719 s b: 26 in s: 32 hdb :: 0.000075 s b: 26 in s: 32 tobyink :: 0.000097 s b: 26 in s: 32 damianc :: 0.000109 s b: 26 in s: 260 hdb :: 0.000076 s b: 26 in s: 260 tobyink :: 0.000153 s b: 26 in s: 260 damianc :: 0.000124 s b: 26 in s: 2600 hdb :: 0.000111 s b: 26 in s: 2600 tobyink :: 0.000152 s b: 26 in s: 2600 damianc :: 0.000182 s b: 26 in s: 26016 hdb :: 0.000125 s b: 26 in s: 26016 tobyink :: 0.000947 s b: 26 in s: 26016 damianc :: 0.000620 s b:64000 in s: 100595 hdb :: 0.006378 s b:64000 in s: 100595 tobyink :: 6.614787 s b:64000 in s: 100595 damianc :: 6.247666 s b:64000 in s: 657902 hdb :: 0.017817 s b:64000 in s: 657902 tobyink :: 48.717781 s b:64000 in s: 657902 damianc :: 12.516777 s b:64000 in s: 6444916 hdb :: 0.235547 s b:64000 in s: 6444916 tobyink :: 582.471826 s b:64000 in s: 6444916 damianc :: 179.466052 s b:64000 in s: 64031416 hdb :: 2.763495 s b:64000 in s: 64031416 tobyink :: 6158.077379 s b:64000 in s: 64031416 damianc :: 2104.112186 s { damianc => { 4 => { 1 => "8.10623168945313e-005", 10 => "0.000266075134277344", + 100 => "0.000210046768188477", 1000 => "0.000719070434570313", all = +> "0.00127625465393066", }, 26 => { 1 => "0.000108957290649414", 10 => "0.000123977661132813", + 100 => "0.000181913375854492", 1000 => "0.000619888305664063", all = +> "0.00103473663330078", }, 64000 => { 1 => "6.2476658821106", 10 => "12.5167770385742", 100 = +> "179.466052055359", 1000 => "2104.11218619347", all => "2302.342681 +16951", }, all => "2302.3449921608", }, hdb => { 4 => { 1 => "8.89301300048828e-005", 10 => "7.00950622558594e-005" +, 100 => "8.29696655273438e-005", 1000 => "0.000391006469726563", all + => "0.000633001327514648", }, 26 => { 1 => "7.48634338378906e-005", 10 => "7.60555267333984e-005 +", 100 => "0.000110864639282227", 1000 => "0.000124931335449219", all + => "0.000386714935302734", }, 64000 => { 1 => "0.006378173828125", 10 => "0.0178170204162598", 1 +00 => "0.235547065734863", 1000 => "2.76349496841431", all => "3.0232 +3722839355", }, all => "3.02425694465637", }, tobyink => { 4 => { 1 => "0.000335931777954102", 10 => "7.00950622558594e-005", + 100 => "9.17911529541016e-005", 1000 => "0.000158071517944336", all +=> "0.000655889511108398", }, 26 => { 1 => "9.70363616943359e-005", 10 => "0.000153064727783203" +, 100 => "0.000151872634887695", 1000 => "0.000946998596191406", all +=> "0.00134897232055664", }, 64000 => { 1 => "6.61478710174561", 10 => "48.7177810668945", 100 +=> "582.471826076508", 1000 => "6158.07737898827", all => "6795.88177 +323341", }, all => "6795.88377809525", }, } [download] I'm not sure why you think it would be any easier to maintain that the non-regex solutions? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Finding repeat sequences. by DamianConway (Beadle) on Jun 20, 2013 at 08:02 UTC
Re^5: Finding repeat sequences. by hdb (Monsignor) on Jun 20, 2013 at 09:00 UTC
Some notes below your chosen depth have not been shown here
Re^5: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 20, 2013 at 15:11 UTC
Some notes below your chosen depth have not been shown here
Re^2: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 18, 2013 at 22:10 UTC
That works perfectly and does exactly what I want. Thank you. I suspect it may run rather slowly as is -- /xms etc. -- but given the included test suite, tuning should be simple :) (BTW: Nice to see you: to see you ... ) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Finding repeat sequences. by tobyink (Canon) on Jun 18, 2013 at 22:23 UTC
Not using regexps at all, but... #!/usr/bin/env perl use 5.012; use Test::More; sub find_substring { my $input = shift; my $length = length $input; for my $i (1 .. $length) { my $possible = substr($input, 0, $i); my $repeated = $possible x (1 + int($length / $i)); return $possible if $input eq substr($repeated, 0, $length); } return ""; } my %eg = ( abcdabcdabcdabcdab => "abcd", abcdabcdabceabcdabcdabceab => "abcdabcdabce", aaaabaaaabaaaaabaaaab => "aaaabaaaaba", ); for my $input (sort keys %eg) { my $expected = $eg{$input}; my $got = find_substring($input); is($got, $expected, "result is '$expected' given input '$input'"); } done_testing; [download] Note that when there are multiple possible matches, this returns the shortest, because it doesn't make sense to return the longest - the longest is uninteresting. For example, given the input `abcabca`, it could be that the answer is `abc` repeated two and a bit times, or `abcabc` repeated one and a bit times, or `abcabca` repeated exactly once. (Well, not really "repeated" but you know what I mean. The entire input string itself is always a valid and uninteresting answer.) Or, depending on how the problem is defined, the correct answer might be `abcabcaxx` repeated less than one time - i.e. the first repetition was truncated! So the only interesting answer to return is the shortest possible one. `package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name`	[reply] [d/l] [select]
Re^2: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 18, 2013 at 23:01 UTC
That's interesting...(not faint praise.). (I'm already thinking of optimisations; like build the longest repeat sequence and then use substr to get shorter versions.) For example, given the input abcabca, it could be ... So the only interesting answer to return is the shortest possible one. Hm. Damn you for making me think (again) at this time of night :) There will always be at least one complete substring. If there is more than 1 but less than 2, ie. 1 rep + 1 partial; (I believe) it will always be possible to determine the longest < length string match; because the residual always matches the length( residual ) first characters of the string. So, if the string is 'abcabca'; the rep could be 'abcabc' or 'abc'. But if the rep consists entirely of an exact integer number of reps of a subsubstring, then the substring is that subsubstring and the string consists of rep*n(n>1) + a partial. Thus, I believe that there is only ever one results. It will be interesting to pitch your solution against DamianConway's regex and see how they compare. I simply have no feel for it; but it's a job for tomorrow. Thank you. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: Finding repeat sequences. by hdb (Monsignor) on Jun 19, 2013 at 07:46 UTC
UPDATE: WARNING: the following code does not work in all circumstances. Sorry! Here is a variant of tobyink's solution that uses `index` to look ahead when the current candidate string repeats and then enlengthens (is that an English word?) it accordingly. `sub find_substring { my $input = shift; my $length = length $input; my $i = 0; my $possible; while( 1 ) { $possible = substr $input, 0, $i+1; # increase length by 1 $i = index $input, $possible, $i+1; # find next occurence of c +andidate return $input if $i < 0; # if not found return full + string => no repetition $possible = substr $input, 0, $i; # this is the minimum leng +th candidate return $possible if $input eq substr($possible x (1 + int($len +gth / $i)), 0, $length); # success } }` [download] UPDATE: Eily's solution below Re^3: Finding repeat sequences. can be used to avoid the construction of the repeated string (as it is the same just with offset). Therefore, this works even better: `sub find_substring { my $input = shift; my $length = length $input; my $i = 0; while( 1 ) { $i = index( $input, substr( $input, 0, $i+1 ), $i+1); return $input if $i < 0; return substr( $input, 0, $i) if substr( $input, $i ) eq substr($i +nput, 0, $length - $i); } }` [download]	[reply] [d/l] [select]
Re: Finding repeat sequences. by choroba (Cardinal) on Jun 18, 2013 at 19:10 UTC
`"abcdabcdabceabcdabcdabceab" =~ /(.+).*\1/` [download] returns `abcdabcdabce` [download] Can you give more input samples to exemplify the other constraints? لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 18, 2013 at 19:59 UTC
They are kind of hard to come up with, but okay. Given the rep 'aaaabaaaaba' and a string containing one whole and one partial rep 'aaaabaaaabaaaaabaaaab' `$s = 'aaaabaaaabaaaaabaaaab';; $s =~ m[(.+).\1] and print $1;; aaaabaaaab` [download] Which isn't correct because: `aaaabaaaabaaaaabaaaab aaaabaaaab aaaabaaaab 1 ^2` [download] You can fix that by removing the redundant . per LanX's version: `m[(.+)\1]` but then you get: `$s = 'aaaabaaaabaaaaabaaaab';; $s =~ m[(.+)\1] and print $1;; aaaab` [download] Which isn't right: `aaaabaaaabaaaaabaaaab aaaabaaaab aaaabaaaab 1 2 ^3 4` [download] I realise that this is a 'cheat' as there in no complete repetition to find, but it is one possible scenario. Given the string will always consist of 1 or more repetitions of the substring, whatever partial substring (if any) is at the end of the string should match the same number of characters at the start of the string. That's the bit I'm having trouble wrapping my head around. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Finding repeat sequences. by LanX (Saint) on Jun 18, 2013 at 19:09 UTC
> Eg. In `abcdabcdabceabcdabcdabceab`; `abcd` is a false rep; the required rep is `abcdabcdabce`; since regex are greedy, this works for me `DB<131> $_='abcdabcdabceabcdabcdabceab' => "abcdabcdabceabcdabcdabceab" DB<132> ($match)= /(.+)\1/ => "abcdabcdabce"` [download] edit: I doubt that the answer is so simple, but this should give you a start to ask more precisely... Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^2: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 18, 2013 at 20:01 UTC
Please see Re^2: Finding repeat sequences.. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Finding repeat sequences. (only mostly regex) by tye (Sage) on Jun 18, 2013 at 19:40 UTC
I assume that the pattern must repeat at least twice, otherwise, the full string is always the longest answer. A simple regex can get a good guess and tell you when that guess has failed in such a way that each subsequent guess will be more than twice as long as the previous guess so the regex doesn't have to be run very many times: sub repeating { my( $string ) = @_; my( $pattern, $repeat, $end ) = $string =~ /^(.+?)(\1+)(.)$/; while( defined $pattern ) { return $pattern if length($end) <= length($pattern) && $end eq substr($pattern,0,length($end)); print "($pattern) wasn't long enough.\n"; ( $pattern, $repeat, $end ) = $string =~ /^(\Q$pattern$repeat\E.+?)(\1+)(.)$/ } return undef; } my $pattern = repeating( "aabaabaabcaabaabaabca" ); printf "(%s) wins\n", $pattern if $pattern; __END__ (a) wasn't long enough. (aab) wasn't long enough. (aabaabaabc) wins [download] You likely can optimize this by copying less stuff, of course. (Update: Well, I didn't get very rigorous in proving to myself that $pattern.$repeat is always too short. But I believe that to be the case. One should validate or refute that assumption before deciding whether to use this.) - tye	[reply] [d/l]
Re^2: Finding repeat sequences. (only mostly regex) by BrowserUk (Patriarch) on Jun 18, 2013 at 20:04 UTC
I assume that the pattern must repeat at least twice, otherwise, the full string is always the longest answer. I wish that were the case. It mostly will be, but sometimes the string will consist of 1 complete and 1 partial rep. But the partial rep at the end will exactly match the same number of characters at the beginning of the string, so it will always be possible to determine the rep. But how to encode that in a regex or at least avoid a brute force chop and compare? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: Finding repeat sequences. (only regex) by tye (Sage) on Jun 18, 2013 at 20:13 UTC
Note that, based on that definition, if the first and last characters are the same, then the answer is "the string minus the last character". Which leads to: `/^(.+?).\1$/` [download] Which leads to a full solution of: `/^((.?).?)\2\1$/` [download] which might be horribly inefficient (at least for some cases) or might not; I haven't considered it. - tye	[reply] [d/l] [select]
Re^4: Finding repeat sequences. (only regex) by BrowserUk (Patriarch) on Jun 18, 2013 at 20:21 UTC
Re^5: Finding repeat sequences. (only regex!) by tye (Sage) on Jun 18, 2013 at 21:17 UTC
Re^3: Finding repeat sequences. (only mostly regex) by choroba (Cardinal) on Jun 18, 2013 at 20:10 UTC
Cannot you find the incomplete repetition with `/^(.).\1$/` [download] ? If it is complete, you get the whole one. لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^4: Finding repeat sequences. (only mostly regex) by BrowserUk (Patriarch) on Jun 18, 2013 at 20:18 UTC
Re^5: Finding repeat sequences. (only mostly regex) by choroba (Cardinal) on Jun 18, 2013 at 20:23 UTC
Some notes below your chosen depth have not been shown here
Re: Finding repeat sequences. by rjt (Curate) on Jun 18, 2013 at 20:23 UTC
This smells an awful lot to me like the Longest Repeated Substring Problem, maybe with a bit of a twist. Have you looked at SuffixTree? `use SuffixTree; my $stree = create_tree('abcdabcdabceabcdabcdabceab'); print_tree($stree);` [download] What is not clear to me from your description is whether you are looking for the longest substring with at least one repeat, or whether you are looking for the arbitrary length substring with the highest repeat count, or whether you are looking for the substring which, along with its (adjacent?) repeats comprises the longest length, or something else. Can you provide some more information and examples? A Super Search revealed: Generalized Suffix Tree in Perl finding longest common substring And a whole slew of other interesting threads	[reply] [d/l]
Re^2: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 18, 2013 at 20:39 UTC
What is not clear to me from your description is whether you are looking for the longest substring with at least one repeat, or whether you are looking for the arbitrary length substring with the highest repeat count, or whether you are looking for the substring which, along with its (adjacent?) repeats comprises the longest length, or something else. Can you provide some more information and examples? I thought (believe) I have described the problem exactly. Constructing examples is hard -- I have a program running (for 4+ hours now) generating controlled random string and trying to find exceptional cases. I'll try the description (unsatisfactory) again. The complete string will consist of, and only of, one or more repetitions of a substring, But the last repetition may be truncated. In code: `my $substring = getsubstring(); my $string = $substring x int( rand $N ); substr( $string, -int( rand length( $substring) ) ) = '' if length $ss +tring > length $substring;` [download] That is, all these are valid strings and all have 'fred' as their substring: `fredf fredfr fredfre fredfred fredfredf fredfredfr fredfredfre` [download] With regard to suffix trees, I feel I would probably need a prefix tree (Trie) instead, but these string can be very long and every implementation of Trie I've seen would not handle them. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^3: Finding repeat sequences. by rjt (Curate) on Jun 18, 2013 at 22:31 UTC
I thought (believe) I have described the problem exactly. Oh, I do not question that. What is in question is my own ability to comprehend. :-) Thanks for clarifying. I have a better idea now. This problem oozes recursion and backtracking (which is what most of the regex solutions are trying to accomplish). As much as I like punctuation by the kilo, I might try a more explicit recursive sub solution as a first cut, if for no other reason than to sprinkle some debug and trace the algorithm and data rep. Come up with some firm test cases based on that, then optimize. Edit: Aha, it looks like Mr. Conway hit it on the head!	[reply]
Re: Finding repeat sequences. by AnomalousMonk (Archbishop) on Jun 19, 2013 at 00:24 UTC
`m{ \A (.+?) \1* (.) (?(?{ 0 != index $1, $2 }) (FAIL)) \z }xmsg;` [download] ... seems to do the trick for all the permutations I've seen so far. I have no idea of the performance hit of the `(?(?{ CODE })yes-pattern)` thingy. (Update: `index $1, $2` works just as well as `0 != index $1, $2`) Read more... (4 kB) Update: Just noticed needless `/g` modifier in posted regex. Final regex should be `m{ \A (.+?) \1* (.) (?(?{ index $1, $2 }) (FAIL)) \z }xms` which is functionally identical (i.e., passes all tests), but which I would not expect to differ in performance in any detectable way (untested).	[reply] [d/l] [select]
Re: Finding repeat sequences. (Results:Part 1) by BrowserUk (Patriarch) on Jun 19, 2013 at 16:17 UTC
Below are the results from my first pass -- verifying basic functionality -- of (my adaptions of) the 8 solutions from tye, choroba, DamianConway, tobyink, AnomalousMonk, hdb, Eily, sundialsvc4 in this thread: C:\test>1039630-b.pl14 -SHOW=0 Looking for 'fredfre' in 'fredfrefredfr' tye found 'fredfrefred'; excluded from further consideration svc4 found 'redfrefredf'; excluded from further consideration Looking for 'fredfre' in 'fredfrefredf' Looking for 'fredfre' in 'fredfrefred' hdb found 'none found'; excluded from further consideration Looking for 'fredfre' in 'fredfrefre' Looking for 'fredfre' in 'fredfrefr' Looking for 'fredfre' in 'fredfref' Looking for 'fredfre' in 'fredfre' Looking for 'fredfre' in 'fredfrefredfrefredfr' choroba found 'fredfrefredfre'; excluded from further consideration Looking for 'fredfre' in 'fredfrefredfrefredf' Looking for 'fredfre' in 'fredfrefredfrefred' Looking for 'fredfre' in 'fredfrefredfrefre' Looking for 'fredfre' in 'fredfrefredfrefr' Looking for 'fredfre' in 'fredfrefredfref' Looking for 'fredfre' in 'fredfrefredfre' Eily found 'fredfrefred'; excluded from further consideration Partisipants in performance tests: anomalous tobyink damianc [download] If authors want to correct (my adaptions of) their solutions that's great, (but please don't moan at me If I screwed the pouch adapting them to subroutines :). Here's the test harness: #! perl -slw use strict; use 5.014; use Time::HiRes qw[ time ]; my %tests = ( tye => sub { state $re = qr[^((.?).?)(?=.)\1\2$]; my $r = shift; $$r =~ $re and return $1; return; }, choroba => sub { state $re = qr[^((.).)\1\2$]; my $r = shift; $$r =~ $re and return $1; return; }, damianc => sub { state $re = qr[ ^(.+?)(?{$^N}) ## $1 as $^R (?\| \1+$() ## Exact rep, no $2 \| \1(.+)$ ## Partial rep as $2 ## Check its a proper prefix (??{ $^N eq substr( $^R, 0, length($^N)) ? '' : '(?! +)' }) ) ]x; my $r = shift; $$r =~ $re and return $1; return; }, tobyink => sub { my $input = shift; my $length = length $$input; for my $i ( 1 .. $length ) { my $possible = substr( $$input, 0, $i ); my $repeated = $possible x ( 1 + int( $length / $i ) ); return $possible if $$input eq substr( $repeated, 0, $leng +th ); } return; }, hdb => sub { my $input = shift; my $length = length $$input; my $i = 0; my $possible; while( 1 ) { $possible = substr $$input, 0, $i+1; # increase length by +1 $i = index $$input, $possible, $i+1; # find next occurence + of candidate return if $i < 0; # if not found return full st +ring => no repetition $possible = substr $$input, 0, $i; # this is the minimum + length candidate return $possible if $$input eq substr( $possible x ( 1 + i +nt( $length / $i ) ), 0, $length ); # success } }, Eily => sub { my $input = shift; local $_ = reverse $$input; /^(.)(.+?\1)\2$/ and return reverse $2; return; }, anomalous => sub { state $re = qr[ \A (.+?) \1 (.) (?(?{ 0 != index $1, $2 }) ( +FAIL)) \z ]xms; my $r = shift; $$r =~ $re and return $1; return }, svc4 => sub { my $search = shift; my $tail_length = 1; my $tail_step = int( length( $$search ) / 2 ); while ($tail_step > 0) { $tail_length += $tail_step while substr( $$search, 0, $tail_length + $tail_step ) eq substr( $$search, -( $tail_length + $tail_step ) +, ( $tail_length + $tail_step ) ) ; $tail_step = int( $tail_step / 2 ); } my $body_length = length( $$search ) - $tail_length; my $longest = $body_length; my $n = $body_length - 1; while( $n > 1 ) { if( ( $body_length % $n ) == 0 ) { if( substr( $$search, 0, $n ) eq substr( $$search, $n, + $n ) ) { $longest = $n; last; } } $n--; } return substr( $$search, 1, $longest ); }, ); our $SHOW //= 1; ## basic functionality my $base = 'fredfre'; for my $rep ( 2 .. 3 ) { my $full = $base x $rep; for my $x ( 1 .. length( $base ) ) { my $str = substr( $full, 0, -$x ); ## make partial print "Looking for '$base' in '$str'"; for my $test ( keys %tests ) { my $res; eval { local $SIG{ALRM} = sub { die "timeout" }; alarm 10; $res = $tests{ $test }->( \$str ) // 'none found'; alarm 0; }; $SHOW and printf "%10s -> %s\n", $test, $res // $@; if( $res ne $base and not $base eq $str or $@ eq 'timeout' ) { delete $tests{ $test }; printf "\n%10s found '%s'; excluded from further consi +deration\n\n", $test, $res // $@ ; } } } } print "Partisipants in performance tests: @{[ keys %tests ]}"; exit; [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Finding repeat sequences. (Results:Part 2. The winner) by BrowserUk (Patriarch) on Jun 19, 2013 at 18:06 UTC
After exclusions on functionality, the three left standing -- DamianConway, tobyink & AnomalousMonk -- went forward to performance testing where the latter quickly fell by the wayside. Of the remaining two, tobyink's solution is hands down winner with a cumulative 66 seconds version DamianConway's 1670 seconds: ... my @bases = ( 'fred', join( '', 'a'..'z' ), unpack( 'b', pack 'Q', 0 .. 99 ), ); my %res; for my $base ( @bases ) { for my $reps ( 1, 10, 100, 1000 ) { my $str = $base x $reps . substr( $base, 0, rand( length $base + ) ); for my $test ( keys %tests ) { my $start = time; my $res; eval { local $SIG{ALRM} = sub { die "timeout\n" }; alarm $T * $reps; $res = $tests{ $test }->( \$str ) // 'none found'; alarm 0; }; if( $@ eq "timeout\n" ) { delete $tests{ $test }; warn "$test timed out [@{[ $T * $reps ]}]; excluded\n" +; next; } my $stop = time; warn "$test: Found '$res' instead of '$base'\n" unless $re +s eq $base; $res{ $test }{ length( $base ) }{ $reps } = $stop - $start +; $res{ $test }{ length( $base ) }{ all } += $stop - $start; $res{ $test }{ all } += $stop - $start; printf "b:%5u in s:%10u %10s :: %f s\n", length( $base ), length( $str ), $test, $stop - $start ; } } } pp \%res; __END__ [18:23:33.12] C:\test>1039630-b.pl14 -SHOW=0 -T=10 tye found 'fredfrefred'; excluded from further consideration svc4 found 'redfrefredf'; excluded from further consideration hdb found 'none found'; excluded from further consideration choroba found 'fredfrefredfre'; excluded from further consideration Eily found 'fredfrefred'; excluded from further consideration Partisipants in performance tests: anomalous tobyink damianc anomalous timed out [10]; excluded { anomalous => { 4 => { 1 => "0.000113964080810547", 10 => "0.000288963317871094 +", 100 => "0.00277900695800781", 1000 => "0.0275580883026123", all => "0.0307400226593018", }, 26 => { 1 => "0.0011751651763916", 10 => "0.00826501846313477", +100 => "0.118575096130371", 1000 => "4.02753186225891", all => "4.15554714202881", }, all => "4.18628716468811", }, damianc => { 4 => { 1 => "0.000234842300415039", 10 => "0.000370979309082031", + 100 => "0.000273942947387695", 1000 => "0.000956058502197266", all => "0.00183582305908203", }, 26 => { 1 => "0.000357866287231445", 10 => "0.000438928604125977" +, 100 => "0.00239300727844238", 1000 => "0.0216219425201416", all => "0.0248117446899414", }, 6400 => { 1 => "1.69514513015747", 10 => "12.9105410575867", 100 +=> "135.586049079895", 1000 => "1520.64571499825", all => "1670.83745026588", }, all => "1670.86409783363", }, tobyink => { 4 => { 1 => "0.000294923782348633", 10 => "0.000179052352905273", + 100 => "5.60283660888672e-005", 1000 => "5.10215759277344e-005", all => "0.000581026077270508", }, 26 => { 1 => "7.60555267333984e-005", 10 => "0.000123977661132813 +", 100 => "0.00015711784362793", 1000 => "0.000526189804077148", all => "0.000883340835571289", }, 6400 => { 1 => "0.071134090423584", 10 => "0.232213973999023", 10 +0 => "4.46547484397888", 1000 => "61.8524870872498", all => "66.6213099956512", }, all => "66.6227743625641", }, } [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: Finding repeat sequences. (Results:Part 2. The winner) by hdb (Monsignor) on Jun 19, 2013 at 19:07 UTC
Congrats to tobyink! Thanks to your efforts, BrowserUK, I realized my mistakes in code and thinking. Should you ever get around to re-run your tests, here is a corrected version, don't know whether it is faster then tobyink's. `hdb => sub { my $input = shift; my $length = length $$input; my $i = 0; my $possible; my $j; while( 1 ) { $possible = substr $$input, 0, ++$i; $possible = substr $$input, 0, $i=$j if( ($j = index $$input, $p +ossible, $i) > 0 ); return $possible if substr( $$input, $i ) eq substr($$input, 0, +$length - $i); } },` [download]	[reply] [d/l]
Re^4: Finding repeat sequences. (Results:Part 3. The new winner) by BrowserUk (Patriarch) on Jun 19, 2013 at 19:46 UTC
Re^5: Finding repeat sequences. (Results:Part 3. The new winner) by hdb (Monsignor) on Jun 19, 2013 at 20:03 UTC
Re^2: Finding repeat sequences. (definitions; part $n) by tye (Sage) on Jun 20, 2013 at 00:40 UTC
Looking for 'fredfre' in 'fredfrefredfr' tye found 'fredfrefred'; excluded from further consideration Looks like you changed your definition of your desired results, so it was silly to use a "solution" tailored to a different definition. This particular example matches my original speculation of what made sense, which means you might want to try my original solution. Though, I didn't wade through everything trying to find the various redefinitions of what was desired (besides the ones given in reply to my nodes) and then sort them chronologically in order to figure out what the new requirements actually are. - tye	[reply]
Re^3: Finding repeat sequences. (definitions; part $n) by BrowserUk (Patriarch) on Jun 20, 2013 at 01:11 UTC
There is no new requirement. Re-read the OP. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Finding repeat sequences. by sundialsvc4 (Abbot) on Jun 19, 2013 at 04:06 UTC
Consider the following: use strict; use warnings; # 1 2 3 4 5 + 6 # 12345678901234567890123456789012345678901234567890123456 +7890 my $search = 'abcdefabcababeabcdefabcababeabcdefabcababeabcdefabcababe +abcd'; print "searching '$search'\n length=" . length($search) . "\n"; # FIRST, FIND THE LENGTH OF THE TAIL: THIS IS THE SHORTEST SUBSTRING # WHICH MATCHES THE START OF THE STRING. # TRY TO EXPAND THIS NUMBER AS RAPIDLY AS POSSIBLE. my $tail_length = 1; my $tail_step = int(length($search) / 2); while ($tail_step > 0) { $tail_length += $tail_step while ( substr($search, 0, $tail_length + $tail_step) eq substr($search, -($tail_length+$tail_step), ($tail_length+$tail_step))); $tail_step = int($tail_step / 2); } print "tail is '" . substr($search, 0, $tail_length) . "' length=$tail_length\n"; # === my $body_length = length($search) - $tail_length; print "body is '$body_length' characters long ... \n"; # THE BODY MUST CONTAIN AN INTEGRAL NUMBER OF OCCURRENCES # OF THE STRING. # THEREFORE, FIND THE LONGEST ONE THAT OCCURS TWICE. my $longest = $body_length; my $n = $body_length - 1; while ($n > 1) { if ( ($body_length % $n) == 0) { print "try $n\n"; if ( substr($search, 0, $n) eq substr($search, $n, $n) ) { $longest = $n; last; } } $n--; } print "longest string '" . substr($search, 1, $longest) . "' length=$longest\n"; [download] The “key” to this puzzle, as I read it, is the tail. This is the shortest string at the end of the string which matches the beginning of the string. The example code tries aggressively to find that number by adding power-of-two-smaller successive increments as many times as it can. (Note: might this introduce a flaw, vs. a simple 2-up count?) The second part of the algorithm now tries to find the longest string which occurs an integral number of times in the leading (repeated ...) portion. We know that the length of this string must be a mathematical factor, i.e. (length mod factor == 0). Only the first and second occurrence must be considered. If none can be found, the string consists of one non-repeated occurrence.	[reply] [d/l]
Re^2: Finding repeat sequences. by sundialsvc4 (Abbot) on Jun 19, 2013 at 04:22 UTC
It might even be easier than this ... (pseudocode this time) `for $t from 1 to int(length(string) / 2): next unless the first $t chars equal the last $t; # i.e. "$t" must be a plausible tail-size. $r = length - $t; $incr = int($r / 2); while( $incr > 0) { next unless ($incr % $r == 0); # i.e. must be a possible repeated-block size in this space solution is found if string(0, $incr) eq string($incr, $incr); # i.e. if we find one repetition then it must be the answer. $incr--; } }` [download] As before, start by looking for a tail. If you think you found one, look for repetitions in possibly-sized increments of the remaining string. This approach will consider all successively-larger candidates for the "tail," up to and including the largest tail, which is "half of it." For each tail-size, it looks for one repetition of all possible sizes which would fit evenly within this area, knowing that the left-hand portion is defined to consist only of repetitions. Continue for all tail-sizes, and for each of these, all rep-sizes, until the first success is found. It works only for data which does, in fact, match this assertion.	[reply] [d/l]
Re^3: Finding repeat sequences. by Eily (Monsignor) on Jun 19, 2013 at 12:40 UTC
Oh! I didn't think of that one. But I'm not sure why you have to do all those extra verifications. Since the input data is a repeated pattern, the length of that pattern is how little you can shift your string to the left and have a perfect match. `$s = 'abcdabcdabceabcdabcdabceab'; for (1..length($s)-1) { print substr($s,0,$_) and last if substr($s,$_) eq substr($s,0 +,length($s)-$_); }` [download] abcdabcdabce	[reply] [d/l]
Re^3: Finding repeat sequences. by sundialsvc4 (Abbot) on Jun 19, 2013 at 13:05 UTC
I think that the pseudo-coded solution is the strongest one, if it is made to stop after finding the longest possible substring (then `last`), and if it is allowed to find all possible (reasonably-long) tails, once again starting with the longest one. (These would be what a human being would probably consider to be the best “right answers.”) With very-short tails, as written it might produce wrong answers if it considers only the first two occurrences (consider string `'aaba'` if the tail were merely `'a'` .. incorrect). But the essential idea, I think, is still valid. One reason why I wrote it this way was in an effort to avoid “churning” memory when dealing with exceptionally-long strings. You don’t need to consider any string that isn’t a tail, nor, within the head, any repeated-string candidate that won’t fit. That subdivides the problem into two searches, both of which have only a few possibilities each. It might not yet be bug-free, but it ought to be close . . .	[reply]
Re: Finding repeat sequences. by Eily (Monsignor) on Jun 19, 2013 at 08:18 UTC
The easiest way to do it with is regex is to use the reversed string, because if you want to check that the end is included in the repeated pattern, you have to be able to make a reference to it. `$_ = reverse 'abcdabcdabceabcdabcdabceab'; /^(.)(.+?\1)\2$/; print scalar reverse $2;` [download] abcdabcdabce Edit: it probably is faster with the pattern `/^(.+?)(.?\1)\2$/` since it uses a minimum length string as the trailing partial repetition, instead of a maximum one (ie : it does not read the whole input and bactracks). The first one should work in any case though, the second would work only if the base string is repeated at least once, even partially.	[reply] [d/l] [select]
Re: Finding repeat sequences. by LanX (Saint) on Jun 21, 2013 at 01:03 UTC
Still struggling to understand the task... is `$str = ($pattern x $n ) . substr($pattern,0,$k)` with `0 <= $k < ($l = length($pattern))` and the task is to find maximum `$pattern` for a given `$str` to fit these constraints? If yes, some simple mathematics should already considerably minimize the set of possible combinations you need to investigate with regexes. Cheers Rolf ( addicted to the Perl Programming Language) test `DB<109> $pattern='abcdabcdabce' => "abcdabcdabce" DB<110> $n=2,$k=2 => (2, 2) DB<111> $str = ($pattern x $n ) . substr($pattern,0,$k) => "abcdabcdabceabcdabcdabceab" DB<112> $str eq 'abcdabcdabceabcdabcdabceab' => 1` [download]	[reply] [d/l] [select]
Re^2: Finding repeat sequences. by hdb (Monsignor) on Jun 21, 2013 at 10:54 UTC
It is to find the shortest pattern, otherwise $n==1 always. Correction: replaced $n=1 with $n==1	[reply]
Re^2: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 21, 2013 at 01:24 UTC
and the task is to find maximum $pattern to fit these constraints? Um. I cannot see any errors in that. So yes. If yes, some simple mathematics should already considerably minimize the set of possible combinations you need to investigate with regexes. Hm. A realistic, but relatively small, example from my test harness: `b:64000 in s: 640028748 hdb :: 24.290438 s` [download] L=64000, N = 10,000, K=28,740. But those could equally well be: L=16,000, N = 40,001, K=12,740; or (thousands) of other permutations. I don't think it helps. (I'm being very, very conservative; my best guess is 100s, of millions.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: Finding repeat sequences. by LanX (Saint) on Jun 21, 2013 at 02:22 UTC
here a regex solution which works for the shortest possible tail of length k `DB<127> $str => "abcdabcdabceabcdabcdabceab" DB<128> $str=~/^((.+?).)\2$/; $rest=$1, $tail=$2 => ("abcdabcdabceabcdabcdabce", "ab") DB<129> $rest =~ /^(.+?)\1$/; $1 => "abcdabcdabce"` [download] needs to be extended for longer possible tails. But taking the dimensions of your data I doubt that regexes are appropriate. You could test all `$patterns` which repeat at least once (or x times) and calculate `$k = $m % $l` with `$m =length ($str)`, and check if $str starts and ends with the same substring $tail of length $k and then check if the pattern continues repeating. Or start eliminating all possible $tails and check if $l of a repeating pattern is a divisor of the $rest. Had no time to check all the other posted solutions and don't wanna reinvent the wheel, so I better stop here! =) HTH Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^4: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 21, 2013 at 02:48 UTC


Welcome to the Monastery
	PerlMonks

Finding repeat sequences.

test