Re: Finding repeat sequences.

Here's a possible solution for what—I think—you're trying to achieve:

#! /usr/bin/env perl
use 5.014; use warnings;

# We're going to try matching substrings of this...
my $MASTER_STR = 'abcdabcdabceabcdabcdabceab';

# Using this pattern to find (possibly truncated) repetitions...
my $REPETITION_PAT = qr{

    \A         # Start at the beginning of the string
    (.+?)      # Match minimal initial sequence (as $1)
    (?{$^N})   # Remember it internally (as $^R)

    (?|        # Then either it's not truncated...
        \1+        # Rematch it exactly as often as possible
        \z         # ...until the end of the string
        ()         # Remember that nothing was left over (as $2)

    |          # Or else it is truncated...
        \1*        # Rematch it exactly as often as possible (or not)
        (.+)       # Then grab what's left (as $2)(internally as $^N)
        \z         # ...until the end of the string

        # Then verify that what's left is a proper truncation...
        (??{
            # If what's left ($^N) is substring of repetition ($^R)...
            $^N eq substr($^R,0,length($^N))
                ? q{}       # ...then do nothing (i.e. keep matching)
                : q{(?!)}   # ...else initiate backtracking
        })
    )
}xms;

# Loop through increasingly truncated substrings...
for my $substr (1..length($MASTER_STR)) {

    # Compute the substring...
    my $str = substr($MASTER_STR,0,-$substr);

    # Evaluate the match...
    if ($str =~ $REPETITION_PAT) {
        say "$str\n     Matched: $1*$2";
    }
    else {
        say "$str\n     Didn't match";
    }
}
[download]

Damian

Comment on Re: Finding repeat sequences. Download Code

Replies are listed 'Best First'.
Re^2: Finding repeat sequences. by DamianConway (Beadle) on Jun 19, 2013 at 22:05 UTC
Here's a significantly optimized version of my previous solution, which draws heavily on anomalous's excellent work, but only bothers to capture the initial repeated component (in `$1`) as that's apparently all that's wanted. `my $REPETITION_PAT = qr{ \A # Start at the beginning of the string (.+?) # Match minimal initial sequence (as $1) \1*+ # Rematch it exactly as often as possible (maybe zero) # Then verify what's left is a proper truncation... (?(?{ index($^N, substr($_,pos())) }) (?!) ) }xms;` [download] It still passes BrowserUK's gauntlet, however in my own testing it's approximately 400 times faster than my previous attempt. I suspect that's still not sufficient to make it competitive with the non-regex solutions. (I know which one I'd rather maintain, though ;-) As you can see, the key to the improvement in regex performance was—as usual—to eliminate opportunities for backtracking or capturing. Damian	[reply] [d/l] [select]
Re^3: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 20, 2013 at 01:08 UTC
That is a very substantial improvement and it now beats tobyink's non-regex solution, but cannot touch hdb's: Partisipants in performance tests: hdb tobyink damianc b: 4 in s: 6 hdb :: 0.000089 s b: 4 in s: 6 tobyink :: 0.000336 s b: 4 in s: 6 damianc :: 0.000081 s b: 4 in s: 43 hdb :: 0.000070 s b: 4 in s: 43 tobyink :: 0.000070 s b: 4 in s: 43 damianc :: 0.000266 s b: 4 in s: 402 hdb :: 0.000083 s b: 4 in s: 402 tobyink :: 0.000092 s b: 4 in s: 402 damianc :: 0.000210 s b: 4 in s: 4000 hdb :: 0.000391 s b: 4 in s: 4000 tobyink :: 0.000158 s b: 4 in s: 4000 damianc :: 0.000719 s b: 26 in s: 32 hdb :: 0.000075 s b: 26 in s: 32 tobyink :: 0.000097 s b: 26 in s: 32 damianc :: 0.000109 s b: 26 in s: 260 hdb :: 0.000076 s b: 26 in s: 260 tobyink :: 0.000153 s b: 26 in s: 260 damianc :: 0.000124 s b: 26 in s: 2600 hdb :: 0.000111 s b: 26 in s: 2600 tobyink :: 0.000152 s b: 26 in s: 2600 damianc :: 0.000182 s b: 26 in s: 26016 hdb :: 0.000125 s b: 26 in s: 26016 tobyink :: 0.000947 s b: 26 in s: 26016 damianc :: 0.000620 s b:64000 in s: 100595 hdb :: 0.006378 s b:64000 in s: 100595 tobyink :: 6.614787 s b:64000 in s: 100595 damianc :: 6.247666 s b:64000 in s: 657902 hdb :: 0.017817 s b:64000 in s: 657902 tobyink :: 48.717781 s b:64000 in s: 657902 damianc :: 12.516777 s b:64000 in s: 6444916 hdb :: 0.235547 s b:64000 in s: 6444916 tobyink :: 582.471826 s b:64000 in s: 6444916 damianc :: 179.466052 s b:64000 in s: 64031416 hdb :: 2.763495 s b:64000 in s: 64031416 tobyink :: 6158.077379 s b:64000 in s: 64031416 damianc :: 2104.112186 s { damianc => { 4 => { 1 => "8.10623168945313e-005", 10 => "0.000266075134277344", + 100 => "0.000210046768188477", 1000 => "0.000719070434570313", all = +> "0.00127625465393066", }, 26 => { 1 => "0.000108957290649414", 10 => "0.000123977661132813", + 100 => "0.000181913375854492", 1000 => "0.000619888305664063", all = +> "0.00103473663330078", }, 64000 => { 1 => "6.2476658821106", 10 => "12.5167770385742", 100 = +> "179.466052055359", 1000 => "2104.11218619347", all => "2302.342681 +16951", }, all => "2302.3449921608", }, hdb => { 4 => { 1 => "8.89301300048828e-005", 10 => "7.00950622558594e-005" +, 100 => "8.29696655273438e-005", 1000 => "0.000391006469726563", all + => "0.000633001327514648", }, 26 => { 1 => "7.48634338378906e-005", 10 => "7.60555267333984e-005 +", 100 => "0.000110864639282227", 1000 => "0.000124931335449219", all + => "0.000386714935302734", }, 64000 => { 1 => "0.006378173828125", 10 => "0.0178170204162598", 1 +00 => "0.235547065734863", 1000 => "2.76349496841431", all => "3.0232 +3722839355", }, all => "3.02425694465637", }, tobyink => { 4 => { 1 => "0.000335931777954102", 10 => "7.00950622558594e-005", + 100 => "9.17911529541016e-005", 1000 => "0.000158071517944336", all +=> "0.000655889511108398", }, 26 => { 1 => "9.70363616943359e-005", 10 => "0.000153064727783203" +, 100 => "0.000151872634887695", 1000 => "0.000946998596191406", all +=> "0.00134897232055664", }, 64000 => { 1 => "6.61478710174561", 10 => "48.7177810668945", 100 +=> "582.471826076508", 1000 => "6158.07737898827", all => "6795.88177 +323341", }, all => "6795.88377809525", }, } [download] I'm not sure why you think it would be any easier to maintain that the non-regex solutions? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Finding repeat sequences. by DamianConway (Beadle) on Jun 20, 2013 at 08:02 UTC
I'm not sure why you think it would be any easier to maintain that the non-regex solutions? hdb's solution is extremely clever and obviously vastly superior in performance. Most impressive. Nevertheless, the reasons I'd prefer to have to maintain the regex solution include: The algorithm `find_substring()` uses is extremely clever, perhaps even a little subtle. I always assume that's likely to be a disadvantage for future maintainability. And, indeed, after ten minutes of close inspection, I'm still not entirely sure I fully understand `find_substring()`. By Kernighan's Metric, if I'm not smart enough to understand it, I'm certainly not half smart enough to maintain it. Infinite loops and manually iterated string indexes always make me nervous. They are opportunities for off-by-one errors and overlooked edge-case behaviours to lurk...or to creep in when the code is subsequently updated. I genuinely prefer functional or declarative styles of programming. The regex solution describes exactly what it does (provided you're fluent in the regex dialect...which I am), whereas `find_substring()`'s implementation not at all self-explanatory (to me). Continuing on with the functional/imperative contrast: the regex solution uses exactly one automatically-preset read-only variable. In contrast `find_substring()` uses a couple of manually-assigned read-write variables. In my view that means the latter has several extra places where future well-intentioned modifications could quietly break something. I think that the regex solution would also be much easier to integrate into a larger parsing system, when the current simple text processing task subsequently grows more complicated (which it inevitably will). I can debug the behaviour of the regex-based solution visually using Regexp::Debugger. I'd have to use `perl -d` to debug `find_substring()`. <shudder> The `find_substring()` implementation somehow reminds me of my years of coding in C, and at this point I really don't need that kind of post-traumatic flashback undoing all the therapy. ;-) Damian	[reply] [d/l] [select]
Re^5: Finding repeat sequences. by hdb (Monsignor) on Jun 20, 2013 at 09:00 UTC
Re^6: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 21, 2013 at 16:26 UTC
Some notes below your chosen depth have not been shown here
Re^5: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 20, 2013 at 15:11 UTC
Re^6: Finding repeat sequences. by DamianConway (Beadle) on Jun 20, 2013 at 22:56 UTC
Some notes below your chosen depth have not been shown here
Re^2: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 18, 2013 at 22:10 UTC
That works perfectly and does exactly what I want. Thank you. I suspect it may run rather slowly as is -- /xms etc. -- but given the included test suite, tuning should be simple :) (BTW: Nice to see you: to see you ... ) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]

In Section Seekers of Perl Wisdom