Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Finding repeat sequences.

by BrowserUk (Patriarch)
on Jun 18, 2013 at 18:55 UTC ( [id://1039630]=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Given a string of arbitrary (long) length that is known to comprise of a number of repetitions of an unknown length substring, (thought the last repetition my be incomplete), how to find that repeat sequence?

Eg. Given 'abcdabcdabcdabcdab' find 'abcd'.

Complications:

  1. The first (shortest) repetition may not constitute the full repetition.

    Eg. In 'abcdabcdabceabcdabcdabceab'; 'abcd' is a false rep; the required rep is 'abcdabcdabce';

  2. As mentioned, the last part of the string may be an incomplete repetition.

    The number of characters 'ignored' at the end of the string should be less than the length of the rep. Is is possible to code that into a regex? I guess it could just be checked afterwards.

I'm assuming that a regex solution would be possible, but I cannot wrap my brain around it today?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: Finding repeat sequences.
by DamianConway (Beadle) on Jun 18, 2013 at 21:43 UTC
    Here's a possible solution for what—I think—you're trying to achieve:
    #! /usr/bin/env perl use 5.014; use warnings; # We're going to try matching substrings of this... my $MASTER_STR = 'abcdabcdabceabcdabcdabceab'; # Using this pattern to find (possibly truncated) repetitions... my $REPETITION_PAT = qr{ \A # Start at the beginning of the string (.+?) # Match minimal initial sequence (as $1) (?{$^N}) # Remember it internally (as $^R) (?| # Then either it's not truncated... \1+ # Rematch it exactly as often as possible \z # ...until the end of the string () # Remember that nothing was left over (as $2) | # Or else it is truncated... \1* # Rematch it exactly as often as possible (or not) (.+) # Then grab what's left (as $2)(internally as $^N) \z # ...until the end of the string # Then verify that what's left is a proper truncation... (??{ # If what's left ($^N) is substring of repetition ($^R)... $^N eq substr($^R,0,length($^N)) ? q{} # ...then do nothing (i.e. keep matching) : q{(?!)} # ...else initiate backtracking }) ) }xms; # Loop through increasingly truncated substrings... for my $substr (1..length($MASTER_STR)) { # Compute the substring... my $str = substr($MASTER_STR,0,-$substr); # Evaluate the match... if ($str =~ $REPETITION_PAT) { say "$str\n Matched: $1*$2"; } else { say "$str\n Didn't match"; } }
    Damian

      Here's a significantly optimized version of my previous solution, which draws heavily on anomalous's excellent work, but only bothers to capture the initial repeated component (in $1) as that's apparently all that's wanted.

      my $REPETITION_PAT = qr{ \A # Start at the beginning of the string (.+?) # Match minimal initial sequence (as $1) \1*+ # Rematch it exactly as often as possible (maybe zero) # Then verify what's left is a proper truncation... (?(?{ index($^N, substr($_,pos())) }) (?!) ) }xms;

      It still passes BrowserUK's gauntlet, however in my own testing it's approximately 400 times faster than my previous attempt. I suspect that's still not sufficient to make it competitive with the non-regex solutions. (I know which one I'd rather maintain, though ;-)

      As you can see, the key to the improvement in regex performance was—as usual—to eliminate opportunities for backtracking or capturing.

      Damian

        That is a very substantial improvement and it now beats tobyink's non-regex solution, but cannot touch hdb's:

        Partisipants in performance tests: hdb tobyink damianc b: 4 in s: 6 hdb :: 0.000089 s b: 4 in s: 6 tobyink :: 0.000336 s b: 4 in s: 6 damianc :: 0.000081 s b: 4 in s: 43 hdb :: 0.000070 s b: 4 in s: 43 tobyink :: 0.000070 s b: 4 in s: 43 damianc :: 0.000266 s b: 4 in s: 402 hdb :: 0.000083 s b: 4 in s: 402 tobyink :: 0.000092 s b: 4 in s: 402 damianc :: 0.000210 s b: 4 in s: 4000 hdb :: 0.000391 s b: 4 in s: 4000 tobyink :: 0.000158 s b: 4 in s: 4000 damianc :: 0.000719 s b: 26 in s: 32 hdb :: 0.000075 s b: 26 in s: 32 tobyink :: 0.000097 s b: 26 in s: 32 damianc :: 0.000109 s b: 26 in s: 260 hdb :: 0.000076 s b: 26 in s: 260 tobyink :: 0.000153 s b: 26 in s: 260 damianc :: 0.000124 s b: 26 in s: 2600 hdb :: 0.000111 s b: 26 in s: 2600 tobyink :: 0.000152 s b: 26 in s: 2600 damianc :: 0.000182 s b: 26 in s: 26016 hdb :: 0.000125 s b: 26 in s: 26016 tobyink :: 0.000947 s b: 26 in s: 26016 damianc :: 0.000620 s b:64000 in s: 100595 hdb :: 0.006378 s b:64000 in s: 100595 tobyink :: 6.614787 s b:64000 in s: 100595 damianc :: 6.247666 s b:64000 in s: 657902 hdb :: 0.017817 s b:64000 in s: 657902 tobyink :: 48.717781 s b:64000 in s: 657902 damianc :: 12.516777 s b:64000 in s: 6444916 hdb :: 0.235547 s b:64000 in s: 6444916 tobyink :: 582.471826 s b:64000 in s: 6444916 damianc :: 179.466052 s b:64000 in s: 64031416 hdb :: 2.763495 s b:64000 in s: 64031416 tobyink :: 6158.077379 s b:64000 in s: 64031416 damianc :: 2104.112186 s { damianc => { 4 => { 1 => "8.10623168945313e-005", 10 => "0.000266075134277344", + 100 => "0.000210046768188477", 1000 => "0.000719070434570313", all = +> "0.00127625465393066", }, 26 => { 1 => "0.000108957290649414", 10 => "0.000123977661132813", + 100 => "0.000181913375854492", 1000 => "0.000619888305664063", all = +> "0.00103473663330078", }, 64000 => { 1 => "6.2476658821106", 10 => "12.5167770385742", 100 = +> "179.466052055359", 1000 => "2104.11218619347", all => "2302.342681 +16951", }, all => "2302.3449921608", }, hdb => { 4 => { 1 => "8.89301300048828e-005", 10 => "7.00950622558594e-005" +, 100 => "8.29696655273438e-005", 1000 => "0.000391006469726563", all + => "0.000633001327514648", }, 26 => { 1 => "7.48634338378906e-005", 10 => "7.60555267333984e-005 +", 100 => "0.000110864639282227", 1000 => "0.000124931335449219", all + => "0.000386714935302734", }, 64000 => { 1 => "0.006378173828125", 10 => "0.0178170204162598", 1 +00 => "0.235547065734863", 1000 => "2.76349496841431", all => "3.0232 +3722839355", }, all => "3.02425694465637", }, tobyink => { 4 => { 1 => "0.000335931777954102", 10 => "7.00950622558594e-005", + 100 => "9.17911529541016e-005", 1000 => "0.000158071517944336", all +=> "0.000655889511108398", }, 26 => { 1 => "9.70363616943359e-005", 10 => "0.000153064727783203" +, 100 => "0.000151872634887695", 1000 => "0.000946998596191406", all +=> "0.00134897232055664", }, 64000 => { 1 => "6.61478710174561", 10 => "48.7177810668945", 100 +=> "582.471826076508", 1000 => "6158.07737898827", all => "6795.88177 +323341", }, all => "6795.88377809525", }, }

        I'm not sure why you think it would be any easier to maintain that the non-regex solutions?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      That works perfectly and does exactly what I want. Thank you.

      I suspect it may run rather slowly as is -- /xms etc. -- but given the included test suite, tuning should be simple :)

      (BTW: Nice to see you: to see you ... )


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding repeat sequences.
by tobyink (Canon) on Jun 18, 2013 at 22:23 UTC

    Not using regexps at all, but...

    #!/usr/bin/env perl use 5.012; use Test::More; sub find_substring { my $input = shift; my $length = length $input; for my $i (1 .. $length) { my $possible = substr($input, 0, $i); my $repeated = $possible x (1 + int($length / $i)); return $possible if $input eq substr($repeated, 0, $length); } return ""; } my %eg = ( abcdabcdabcdabcdab => "abcd", abcdabcdabceabcdabcdabceab => "abcdabcdabce", aaaabaaaabaaaaabaaaab => "aaaabaaaaba", ); for my $input (sort keys %eg) { my $expected = $eg{$input}; my $got = find_substring($input); is($got, $expected, "result is '$expected' given input '$input'"); } done_testing;

    Note that when there are multiple possible matches, this returns the shortest, because it doesn't make sense to return the longest - the longest is uninteresting.

    For example, given the input abcabca, it could be that the answer is abc repeated two and a bit times, or abcabc repeated one and a bit times, or abcabca repeated exactly once. (Well, not really "repeated" but you know what I mean. The entire input string itself is always a valid and uninteresting answer.) Or, depending on how the problem is defined, the correct answer might be abcabcaxx repeated less than one time - i.e. the first repetition was truncated!

    So the only interesting answer to return is the shortest possible one.

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

      That's interesting...(not faint praise.). (I'm already thinking of optimisations; like build the longest repeat sequence and then use substr to get shorter versions.)

      For example, given the input abcabca, it could be ... So the only interesting answer to return is the shortest possible one.

      Hm. Damn you for making me think (again) at this time of night :)

      There will always be at least one complete substring.

      If there is more than 1 but less than 2, ie. 1 rep + 1 partial; (I believe) it will always be possible to determine the longest < length string match; because the residual always matches the length( residual ) first characters of the string.

      So, if the string is 'abcabca'; the rep could be 'abcabc' or 'abc'. But if the rep consists entirely of an exact integer number of reps of a subsubstring, then the substring is that subsubstring and the string consists of rep*n(n>1) + a partial.

      Thus, I believe that there is only ever one results.

      It will be interesting to pitch your solution against DamianConway's regex and see how they compare. I simply have no feel for it; but it's a job for tomorrow.

      Thank you.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        UPDATE: WARNING: the following code does not work in all circumstances. Sorry!

        Here is a variant of tobyink's solution that uses index to look ahead when the current candidate string repeats and then enlengthens (is that an English word?) it accordingly.

        sub find_substring { my $input = shift; my $length = length $input; my $i = 0; my $possible; while( 1 ) { $possible = substr $input, 0, $i+1; # increase length by 1 $i = index $input, $possible, $i+1; # find next occurence of c +andidate return $input if $i < 0; # if not found return full + string => no repetition $possible = substr $input, 0, $i; # this is the minimum leng +th candidate return $possible if $input eq substr($possible x (1 + int($len +gth / $i)), 0, $length); # success } }

        UPDATE: Eily's solution below Re^3: Finding repeat sequences. can be used to avoid the construction of the repeated string (as it is the same just with offset). Therefore, this works even better:

        sub find_substring { my $input = shift; my $length = length $input; my $i = 0; while( 1 ) { $i = index( $input, substr( $input, 0, $i+1 ), $i+1); return $input if $i < 0; return substr( $input, 0, $i) if substr( $input, $i ) eq substr($i +nput, 0, $length - $i); } }
Re: Finding repeat sequences.
by choroba (Cardinal) on Jun 18, 2013 at 19:10 UTC
    "abcdabcdabceabcdabcdabceab" =~ /(.+).*\1/
    returns
    abcdabcdabce

    Can you give more input samples to exemplify the other constraints?

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      They are kind of hard to come up with, but okay.

      Given the rep 'aaaabaaaaba' and a string containing one whole and one partial rep 'aaaabaaaabaaaaabaaaab'

      $s = 'aaaabaaaabaaaaabaaaab';; $s =~ m[(.+).*\1] and print $1;; aaaabaaaab

      Which isn't correct because:

      aaaabaaaabaaaaabaaaab aaaabaaaab aaaabaaaab 1 ^2

      You can fix that by removing the redundant .* per LanX's version: m[(.+)\1] but then you get:

      $s = 'aaaabaaaabaaaaabaaaab';; $s =~ m[(.+)\1] and print $1;; aaaab

      Which isn't right:

      aaaabaaaabaaaaabaaaab aaaabaaaab aaaabaaaab 1 2 ^3 4

      I realise that this is a 'cheat' as there in no complete repetition to find, but it is one possible scenario.

      Given the string will always consist of 1 or more repetitions of the substring, whatever partial substring (if any) is at the end of the string should match the same number of characters at the start of the string. That's the bit I'm having trouble wrapping my head around.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding repeat sequences.
by LanX (Saint) on Jun 18, 2013 at 19:09 UTC
    > Eg. In abcdabcdabceabcdabcdabceab; abcd is a false rep; the required rep is abcdabcdabce;

    since regex are greedy, this works for me

    DB<131> $_='abcdabcdabceabcdabcdabceab' => "abcdabcdabceabcdabcdabceab" DB<132> ($match)= /(.+)\1/ => "abcdabcdabce"

    edit: I doubt that the answer is so simple, but this should give you a start to ask more precisely...

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Please see Re^2: Finding repeat sequences..


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding repeat sequences. (only mostly regex)
by tye (Sage) on Jun 18, 2013 at 19:40 UTC

    I assume that the pattern must repeat at least twice, otherwise, the full string is always the longest answer.

    A simple regex can get a good guess and tell you when that guess has failed in such a way that each subsequent guess will be more than twice as long as the previous guess so the regex doesn't have to be run very many times:

    sub repeating { my( $string ) = @_; my( $pattern, $repeat, $end ) = $string =~ /^(.+?)(\1+)(.*)$/; while( defined $pattern ) { return $pattern if length($end) <= length($pattern) && $end eq substr($pattern,0,length($end)); print "($pattern) wasn't long enough.\n"; ( $pattern, $repeat, $end ) = $string =~ /^(\Q$pattern$repeat\E.+?)(\1+)(.*)$/ } return undef; } my $pattern = repeating( "aabaabaabcaabaabaabca" ); printf "(%s) wins\n", $pattern if $pattern; __END__ (a) wasn't long enough. (aab) wasn't long enough. (aabaabaabc) wins

    You likely can optimize this by copying less stuff, of course.

    (Update: Well, I didn't get very rigorous in proving to myself that $pattern.$repeat is always too short. But I believe that to be the case. One should validate or refute that assumption before deciding whether to use this.)

    - tye        

      I assume that the pattern must repeat at least twice, otherwise, the full string is always the longest answer.

      I wish that were the case. It mostly will be, but sometimes the string will consist of 1 complete and 1 partial rep.

      But the partial rep at the end *will* exactly match the same number of characters at the beginning of the string, so it will always be possible to determine the rep.

      But how to encode that in a regex or at least avoid a brute force chop and compare?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Note that, based on that definition, if the first and last characters are the same, then the answer is "the string minus the last character". Which leads to:

        /^(.+?).*\1$/

        Which leads to a full solution of:

        /^((.*?).*?)\2*\1$/

        which might be horribly inefficient (at least for some cases) or might not; I haven't considered it.

        - tye        

        Cannot you find the incomplete repetition with
        /^(.*).*\1$/
        ?

        If it is complete, you get the whole one.

        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Finding repeat sequences.
by rjt (Curate) on Jun 18, 2013 at 20:23 UTC

    This smells an awful lot to me like the Longest Repeated Substring Problem, maybe with a bit of a twist. Have you looked at SuffixTree?

    use SuffixTree; my $stree = create_tree('abcdabcdabceabcdabcdabceab'); print_tree($stree);

    What is not clear to me from your description is whether you are looking for the longest substring with at least one repeat, or whether you are looking for the arbitrary length substring with the highest repeat count, or whether you are looking for the substring which, along with its (adjacent?) repeats comprises the longest length, or something else. Can you provide some more information and examples?

    A Super Search revealed:

      What is not clear to me from your description is whether you are looking for the longest substring with at least one repeat, or whether you are looking for the arbitrary length substring with the highest repeat count, or whether you are looking for the substring which, along with its (adjacent?) repeats comprises the longest length, or something else. Can you provide some more information and examples?

      I thought (believe) I have described the problem exactly. Constructing examples is hard -- I have a program running (for 4+ hours now) generating controlled random string and trying to find exceptional cases.

      I'll try the description (unsatisfactory) again.

      The complete string will consist of, and only of, one or more repetitions of a substring, But the last repetition may be truncated. In code:

      my $substring = getsubstring(); my $string = $substring x int( rand $N ); substr( $string, -int( rand length( $substring) ) ) = '' if length $ss +tring > length $substring;

      That is, all these are valid strings and all have 'fred' as their substring:

      fredf fredfr fredfre fredfred fredfredf fredfredfr fredfredfre

      With regard to suffix trees, I feel I would probably need a prefix tree (Trie) instead, but these string can be very long and every implementation of Trie I've seen would not handle them.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I thought (believe) I have described the problem exactly.

        Oh, I do not question that. What is in question is my own ability to comprehend. :-) Thanks for clarifying. I have a better idea now.

        This problem oozes recursion and backtracking (which is what most of the regex solutions are trying to accomplish). As much as I like punctuation by the kilo, I might try a more explicit recursive sub solution as a first cut, if for no other reason than to sprinkle some debug and trace the algorithm and data rep. Come up with some firm test cases based on that, then optimize.

        Edit: Aha, it looks like Mr. Conway hit it on the head!

Re: Finding repeat sequences.
by AnomalousMonk (Archbishop) on Jun 19, 2013 at 00:24 UTC
    m{ \A (.+?) \1* (.*) (?(?{ 0 != index $1, $2 }) (*FAIL)) \z }xmsg;

    ... seems to do the trick for all the permutations I've seen so far. I have no idea of the performance hit of the  (?(?{ CODE })yes-pattern) thingy. (Update:  index $1, $2 works just as well as  0 != index $1, $2)

    Update: Just noticed needless  /g modifier in posted regex. Final regex should be
        m{ \A (.+?) \1* (.*) (?(?{ index $1, $2 }) (*FAIL)) \z }xms
    which is functionally identical (i.e., passes all tests), but which I would not expect to differ in performance in any detectable way (untested).

Re: Finding repeat sequences. (Results:Part 1)
by BrowserUk (Patriarch) on Jun 19, 2013 at 16:17 UTC

    Below are the results from my first pass -- verifying basic functionality -- of (my adaptions of) the 8 solutions from tye, choroba, DamianConway, tobyink, AnomalousMonk, hdb, Eily, sundialsvc4 in this thread:

    C:\test>1039630-b.pl14 -SHOW=0 Looking for 'fredfre' in 'fredfrefredfr' tye found 'fredfrefred'; excluded from further consideration svc4 found 'redfrefredf'; excluded from further consideration Looking for 'fredfre' in 'fredfrefredf' Looking for 'fredfre' in 'fredfrefred' hdb found 'none found'; excluded from further consideration Looking for 'fredfre' in 'fredfrefre' Looking for 'fredfre' in 'fredfrefr' Looking for 'fredfre' in 'fredfref' Looking for 'fredfre' in 'fredfre' Looking for 'fredfre' in 'fredfrefredfrefredfr' choroba found 'fredfrefredfre'; excluded from further consideration Looking for 'fredfre' in 'fredfrefredfrefredf' Looking for 'fredfre' in 'fredfrefredfrefred' Looking for 'fredfre' in 'fredfrefredfrefre' Looking for 'fredfre' in 'fredfrefredfrefr' Looking for 'fredfre' in 'fredfrefredfref' Looking for 'fredfre' in 'fredfrefredfre' Eily found 'fredfrefred'; excluded from further consideration Partisipants in performance tests: anomalous tobyink damianc

    If authors want to correct (my adaptions of) their solutions that's great, (but please don't moan at me If I screwed the pouch adapting them to subroutines :).

    Here's the test harness:

    #! perl -slw use strict; use 5.014; use Time::HiRes qw[ time ]; my %tests = ( tye => sub { state $re = qr[^((.*?).*?)(?=.)\1*\2$]; my $r = shift; $$r =~ $re and return $1; return; }, choroba => sub { state $re = qr[^((.*).*)\1*\2$]; my $r = shift; $$r =~ $re and return $1; return; }, damianc => sub { state $re = qr[ ^(.+?)(?{$^N}) ## $1 as $^R (?| \1+$() ## Exact rep, no $2 | \1*(.+)$ ## Partial rep as $2 ## Check its a proper prefix (??{ $^N eq substr( $^R, 0, length($^N)) ? '' : '(?! +)' }) ) ]x; my $r = shift; $$r =~ $re and return $1; return; }, tobyink => sub { my $input = shift; my $length = length $$input; for my $i ( 1 .. $length ) { my $possible = substr( $$input, 0, $i ); my $repeated = $possible x ( 1 + int( $length / $i ) ); return $possible if $$input eq substr( $repeated, 0, $leng +th ); } return; }, hdb => sub { my $input = shift; my $length = length $$input; my $i = 0; my $possible; while( 1 ) { $possible = substr $$input, 0, $i+1; # increase length by +1 $i = index $$input, $possible, $i+1; # find next occurence + of candidate return if $i < 0; # if not found return full st +ring => no repetition $possible = substr $$input, 0, $i; # this is the minimum + length candidate return $possible if $$input eq substr( $possible x ( 1 + i +nt( $length / $i ) ), 0, $length ); # success } }, Eily => sub { my $input = shift; local $_ = reverse $$input; /^(.*)(.+?\1)\2*$/ and return reverse $2; return; }, anomalous => sub { state $re = qr[ \A (.+?) \1* (.*) (?(?{ 0 != index $1, $2 }) ( +*FAIL)) \z ]xms; my $r = shift; $$r =~ $re and return $1; return }, svc4 => sub { my $search = shift; my $tail_length = 1; my $tail_step = int( length( $$search ) / 2 ); while ($tail_step > 0) { $tail_length += $tail_step while substr( $$search, 0, $tail_length + $tail_step ) eq substr( $$search, -( $tail_length + $tail_step ) +, ( $tail_length + $tail_step ) ) ; $tail_step = int( $tail_step / 2 ); } my $body_length = length( $$search ) - $tail_length; my $longest = $body_length; my $n = $body_length - 1; while( $n > 1 ) { if( ( $body_length % $n ) == 0 ) { if( substr( $$search, 0, $n ) eq substr( $$search, $n, + $n ) ) { $longest = $n; last; } } $n--; } return substr( $$search, 1, $longest ); }, ); our $SHOW //= 1; ## basic functionality my $base = 'fredfre'; for my $rep ( 2 .. 3 ) { my $full = $base x $rep; for my $x ( 1 .. length( $base ) ) { my $str = substr( $full, 0, -$x ); ## make partial print "Looking for '$base' in '$str'"; for my $test ( keys %tests ) { my $res; eval { local $SIG{ALRM} = sub { die "timeout" }; alarm 10; $res = $tests{ $test }->( \$str ) // 'none found'; alarm 0; }; $SHOW and printf "%10s -> %s\n", $test, $res // $@; if( $res ne $base and not $base eq $str or $@ eq 'timeout' ) { delete $tests{ $test }; printf "\n%10s found '%s'; excluded from further consi +deration\n\n", $test, $res // $@ ; } } } } print "Partisipants in performance tests: @{[ keys %tests ]}"; exit;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      After exclusions on functionality, the three left standing -- DamianConway, tobyink & AnomalousMonk -- went forward to performance testing where the latter quickly fell by the wayside.

      Of the remaining two, tobyink's solution is hands down winner with a cumulative 66 seconds version DamianConway's 1670 seconds:

      ... my @bases = ( 'fred', join( '', 'a'..'z' ), unpack( 'b*', pack 'Q*', 0 .. 99 ), ); my %res; for my $base ( @bases ) { for my $reps ( 1, 10, 100, 1000 ) { my $str = $base x $reps . substr( $base, 0, rand( length $base + ) ); for my $test ( keys %tests ) { my $start = time; my $res; eval { local $SIG{ALRM} = sub { die "timeout\n" }; alarm $T * $reps; $res = $tests{ $test }->( \$str ) // 'none found'; alarm 0; }; if( $@ eq "timeout\n" ) { delete $tests{ $test }; warn "$test timed out [@{[ $T * $reps ]}]; excluded\n" +; next; } my $stop = time; warn "$test: Found '$res' instead of '$base'\n" unless $re +s eq $base; $res{ $test }{ length( $base ) }{ $reps } = $stop - $start +; $res{ $test }{ length( $base ) }{ all } += $stop - $start; $res{ $test }{ all } += $stop - $start; printf "b:%5u in s:%10u %10s :: %f s\n", length( $base ), length( $str ), $test, $stop - $start ; } } } pp \%res; __END__ [18:23:33.12] C:\test>1039630-b.pl14 -SHOW=0 -T=10 tye found 'fredfrefred'; excluded from further consideration svc4 found 'redfrefredf'; excluded from further consideration hdb found 'none found'; excluded from further consideration choroba found 'fredfrefredfre'; excluded from further consideration Eily found 'fredfrefred'; excluded from further consideration Partisipants in performance tests: anomalous tobyink damianc anomalous timed out [10]; excluded { anomalous => { 4 => { 1 => "0.000113964080810547", 10 => "0.000288963317871094 +", 100 => "0.00277900695800781", 1000 => "0.0275580883026123", all => "0.0307400226593018", }, 26 => { 1 => "0.0011751651763916", 10 => "0.00826501846313477", +100 => "0.118575096130371", 1000 => "4.02753186225891", all => "4.15554714202881", }, all => "4.18628716468811", }, damianc => { 4 => { 1 => "0.000234842300415039", 10 => "0.000370979309082031", + 100 => "0.000273942947387695", 1000 => "0.000956058502197266", all => "0.00183582305908203", }, 26 => { 1 => "0.000357866287231445", 10 => "0.000438928604125977" +, 100 => "0.00239300727844238", 1000 => "0.0216219425201416", all => "0.0248117446899414", }, 6400 => { 1 => "1.69514513015747", 10 => "12.9105410575867", 100 +=> "135.586049079895", 1000 => "1520.64571499825", all => "1670.83745026588", }, all => "1670.86409783363", }, tobyink => { 4 => { 1 => "0.000294923782348633", 10 => "0.000179052352905273", + 100 => "5.60283660888672e-005", 1000 => "5.10215759277344e-005", all => "0.000581026077270508", }, 26 => { 1 => "7.60555267333984e-005", 10 => "0.000123977661132813 +", 100 => "0.00015711784362793", 1000 => "0.000526189804077148", all => "0.000883340835571289", }, 6400 => { 1 => "0.071134090423584", 10 => "0.232213973999023", 10 +0 => "4.46547484397888", 1000 => "61.8524870872498", all => "66.6213099956512", }, all => "66.6227743625641", }, }

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Congrats to tobyink!

        Thanks to your efforts, BrowserUK, I realized my mistakes in code and thinking. Should you ever get around to re-run your tests, here is a corrected version, don't know whether it is faster then tobyink's.

        hdb => sub { my $input = shift; my $length = length $$input; my $i = 0; my $possible; my $j; while( 1 ) { $possible = substr $$input, 0, ++$i; $possible = substr $$input, 0, $i=$j if( ($j = index $$input, $p +ossible, $i) > 0 ); return $possible if substr( $$input, $i ) eq substr($$input, 0, +$length - $i); } },
      Looking for 'fredfre' in 'fredfrefredfr' tye found 'fredfrefred'; excluded from further consideration

      Looks like you changed your definition of your desired results, so it was silly to use a "solution" tailored to a different definition.

      This particular example matches my original speculation of what made sense, which means you might want to try my original solution. Though, I didn't wade through everything trying to find the various redefinitions of what was desired (besides the ones given in reply to my nodes) and then sort them chronologically in order to figure out what the new requirements actually are.

      - tye        

        There is no new requirement. Re-read the OP.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding repeat sequences.
by sundialsvc4 (Abbot) on Jun 19, 2013 at 04:06 UTC

    Consider the following:

    use strict; use warnings; # 1 2 3 4 5 + 6 # 12345678901234567890123456789012345678901234567890123456 +7890 my $search = 'abcdefabcababeabcdefabcababeabcdefabcababeabcdefabcababe +abcd'; print "searching '$search'\n length=" . length($search) . "\n"; # FIRST, FIND THE LENGTH OF THE TAIL: THIS IS THE SHORTEST SUBSTRING # WHICH MATCHES THE START OF THE STRING. # TRY TO EXPAND THIS NUMBER AS RAPIDLY AS POSSIBLE. my $tail_length = 1; my $tail_step = int(length($search) / 2); while ($tail_step > 0) { $tail_length += $tail_step while ( substr($search, 0, $tail_length + $tail_step) eq substr($search, -($tail_length+$tail_step), ($tail_length+$tail_step))); $tail_step = int($tail_step / 2); } print "tail is '" . substr($search, 0, $tail_length) . "' length=$tail_length\n"; # === my $body_length = length($search) - $tail_length; print "body is '$body_length' characters long ... \n"; # THE BODY MUST CONTAIN AN INTEGRAL NUMBER OF OCCURRENCES # OF THE STRING. # THEREFORE, FIND THE LONGEST ONE THAT OCCURS TWICE. my $longest = $body_length; my $n = $body_length - 1; while ($n > 1) { if ( ($body_length % $n) == 0) { print "try $n\n"; if ( substr($search, 0, $n) eq substr($search, $n, $n) ) { $longest = $n; last; } } $n--; } print "longest string '" . substr($search, 1, $longest) . "' length=$longest\n";

    The “key” to this puzzle, as I read it, is the tail.   This is the shortest string at the end of the string which matches the beginning of the string.   The example code tries aggressively to find that number by adding power-of-two-smaller successive increments as many times as it can.   (Note: might this introduce a flaw, vs. a simple 2-up count?)

    The second part of the algorithm now tries to find the longest string which occurs an integral number of times in the leading (repeated ...) portion.   We know that the length of this string must be a mathematical factor, i.e. (length mod factor == 0).   Only the first and second occurrence must be considered.   If none can be found, the string consists of one non-repeated occurrence.

      It might even be easier than this ... (pseudocode this time)

      for $t from 1 to int(length(string) / 2): next unless the first $t chars equal the last $t; # i.e. "$t" must be a plausible tail-size. $r = length - $t; $incr = int($r / 2); while( $incr > 0) { next unless ($incr % $r == 0); # i.e. must be a possible repeated-block size in this space solution is found if string(0, $incr) eq string($incr, $incr); # i.e. if we find one repetition then it must be the answer. $incr--; } }
      As before, start by looking for a tail. If you think you found one, look for repetitions in possibly-sized increments of the remaining string. This approach will consider all successively-larger candidates for the "tail," up to and including the largest tail, which is "half of it." For each tail-size, it looks for one repetition of all possible sizes which would fit evenly within this area, knowing that the left-hand portion is defined to consist only of repetitions. Continue for all tail-sizes, and for each of these, all rep-sizes, until the first success is found. It works only for data which does, in fact, match this assertion.

        Oh! I didn't think of that one. But I'm not sure why you have to do all those extra verifications. Since the input data is a repeated pattern, the length of that pattern is how little you can shift your string to the left and have a perfect match.

        $s = 'abcdabcdabceabcdabcdabceab'; for (1..length($s)-1) { print substr($s,0,$_) and last if substr($s,$_) eq substr($s,0 +,length($s)-$_); }
        abcdabcdabce

        I think that the pseudo-coded solution is the strongest one, if it is made to stop after finding the longest possible substring (then last), and if it is allowed to find all possible (reasonably-long) tails, once again starting with the longest one.   (These would be what a human being would probably consider to be the best “right answers.”)

        With very-short tails, as written it might produce wrong answers if it considers only the first two occurrences (consider string 'aaba' if the tail were merely 'a' .. incorrect).   But the essential idea, I think, is still valid.

        One reason why I wrote it this way was in an effort to avoid “churning” memory when dealing with exceptionally-long strings.   You don’t need to consider any string that isn’t a tail, nor, within the head, any repeated-string candidate that won’t fit.   That subdivides the problem into two searches, both of which have only a few possibilities each.   It might not yet be bug-free, but it ought to be close . . .

Re: Finding repeat sequences.
by Eily (Monsignor) on Jun 19, 2013 at 08:18 UTC

    The easiest way to do it with is regex is to use the reversed string, because if you want to check that the end is included in the repeated pattern, you have to be able to make a reference to it.

    $_ = reverse 'abcdabcdabceabcdabcdabceab'; /^(.*)(.+?\1)\2*$/; print scalar reverse $2;
    abcdabcdabce

    Edit: it probably is faster with the pattern /^(.+?)(.*?\1)\2*$/ since it uses a minimum length string as the trailing partial repetition, instead of a maximum one (ie : it does not read the whole input and bactracks). The first one should work in any case though, the second would work only if the base string is repeated at least once, even partially.

Re: Finding repeat sequences.
by LanX (Saint) on Jun 21, 2013 at 01:03 UTC
    Still struggling to understand the task...

    is

        $str = ($pattern x $n ) . substr($pattern,0,$k)

    with

        0 <= $k < ($l = length($pattern))

    and the task is to find maximum $pattern for a given $str to fit these constraints?

    If yes, some simple mathematics should already considerably minimize the set of possible combinations you need to investigate with regexes.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    test
    DB<109> $pattern='abcdabcdabce' => "abcdabcdabce" DB<110> $n=2,$k=2 => (2, 2) DB<111> $str = ($pattern x $n ) . substr($pattern,0,$k) => "abcdabcdabceabcdabcdabceab" DB<112> $str eq 'abcdabcdabceabcdabcdabceab' => 1

      It is to find the shortest pattern, otherwise $n==1 always.

      Correction: replaced $n=1 with $n==1

      and the task is to find maximum $pattern to fit these constraints?

      Um. I cannot see any errors in that. So yes.

      If yes, some simple mathematics should already considerably minimize the set of possible combinations you need to investigate with regexes.

      Hm. A realistic, but relatively small, example from my test harness:

      b:64000 in s: 640028748 hdb :: 24.290438 s

      L=64000, N = 10,000, K=28,740.

      But those could equally well be: L=16,000, N = 40,001, K=12,740; or (thousands*) of other permutations.

      I don't think it helps.

      (*I'm being very, very conservative; my best guess is 100s, of millions.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        here a regex solution which works for the shortest possible tail of length k

        DB<127> $str => "abcdabcdabceabcdabcdabceab" DB<128> $str=~/^((.+?).*)\2$/; $rest=$1, $tail=$2 => ("abcdabcdabceabcdabcdabce", "ab") DB<129> $rest =~ /^(.+?)\1*$/; $1 => "abcdabcdabce"

        needs to be extended for longer possible tails.

        But taking the dimensions of your data I doubt that regexes are appropriate.

        You could test all $patterns which repeat at least once (or x times) and calculate $k = $m % $l with $m =length ($str), and check if $str starts and ends with the same substring $tail of length $k and then check if the pattern continues repeating.

        Or start eliminating all possible $tails and check if $l of a repeating pattern is a divisor of the $rest.

        Had no time to check all the other posted solutions and don't wanna reinvent the wheel, so I better stop here! =)

        HTH

        Cheers Rolf

        ( addicted to the Perl Programming Language)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1039630]
Approved by LanX
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-16 18:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found