Syntactic Confectionery Delight PerlMonks

### Re: Finding repeat sequences.

 on Jun 18, 2013 at 21:43 UTC ( #1039673=note: print w/replies, xml ) Need Help??

in reply to Finding repeat sequences.

Here's a possible solution for what—I think—you're trying to achieve:
```#! /usr/bin/env perl
use 5.014; use warnings;

# We're going to try matching substrings of this...
my \$MASTER_STR = 'abcdabcdabceabcdabcdabceab';

# Using this pattern to find (possibly truncated) repetitions...
my \$REPETITION_PAT = qr{

\A         # Start at the beginning of the string
(.+?)      # Match minimal initial sequence (as \$1)
(?{\$^N})   # Remember it internally (as \$^R)

(?|        # Then either it's not truncated...
\1+        # Rematch it exactly as often as possible
\z         # ...until the end of the string
()         # Remember that nothing was left over (as \$2)

|          # Or else it is truncated...
\1*        # Rematch it exactly as often as possible (or not)
(.+)       # Then grab what's left (as \$2)(internally as \$^N)
\z         # ...until the end of the string

# Then verify that what's left is a proper truncation...
(??{
# If what's left (\$^N) is substring of repetition (\$^R)...
\$^N eq substr(\$^R,0,length(\$^N))
? q{}       # ...then do nothing (i.e. keep matching)
: q{(?!)}   # ...else initiate backtracking
})
)
}xms;

# Loop through increasingly truncated substrings...
for my \$substr (1..length(\$MASTER_STR)) {

# Compute the substring...
my \$str = substr(\$MASTER_STR,0,-\$substr);

# Evaluate the match...
if (\$str =~ \$REPETITION_PAT) {
say "\$str\n     Matched: \$1*\$2";
}
else {
say "\$str\n     Didn't match";
}
}
Damian

Replies are listed 'Best First'.
Re^2: Finding repeat sequences.
by DamianConway (Beadle) on Jun 19, 2013 at 22:05 UTC

Here's a significantly optimized version of my previous solution, which draws heavily on anomalous's excellent work, but only bothers to capture the initial repeated component (in \$1) as that's apparently all that's wanted.

```my \$REPETITION_PAT = qr{

\A         # Start at the beginning of the string
(.+?)      # Match minimal initial sequence (as \$1)
\1*+       # Rematch it exactly as often as possible (maybe zero)

# Then verify what's left is a proper truncation...
(?(?{ index(\$^N, substr(\$_,pos())) }) (?!) )

}xms;

It still passes BrowserUK's gauntlet, however in my own testing it's approximately 400 times faster than my previous attempt. I suspect that's still not sufficient to make it competitive with the non-regex solutions. (I know which one I'd rather maintain, though ;-)

As you can see, the key to the improvement in regex performance was—as usual—to eliminate opportunities for backtracking or capturing.

Damian

That is a very substantial improvement and it now beats tobyink's non-regex solution, but cannot touch hdb's:

```Partisipants in performance tests: hdb tobyink damianc
b:    4 in s:         6        hdb :: 0.000089 s
b:    4 in s:         6    tobyink :: 0.000336 s
b:    4 in s:         6    damianc :: 0.000081 s
b:    4 in s:        43        hdb :: 0.000070 s
b:    4 in s:        43    tobyink :: 0.000070 s
b:    4 in s:        43    damianc :: 0.000266 s
b:    4 in s:       402        hdb :: 0.000083 s
b:    4 in s:       402    tobyink :: 0.000092 s
b:    4 in s:       402    damianc :: 0.000210 s
b:    4 in s:      4000        hdb :: 0.000391 s
b:    4 in s:      4000    tobyink :: 0.000158 s
b:    4 in s:      4000    damianc :: 0.000719 s
b:   26 in s:        32        hdb :: 0.000075 s
b:   26 in s:        32    tobyink :: 0.000097 s
b:   26 in s:        32    damianc :: 0.000109 s
b:   26 in s:       260        hdb :: 0.000076 s
b:   26 in s:       260    tobyink :: 0.000153 s
b:   26 in s:       260    damianc :: 0.000124 s
b:   26 in s:      2600        hdb :: 0.000111 s
b:   26 in s:      2600    tobyink :: 0.000152 s
b:   26 in s:      2600    damianc :: 0.000182 s
b:   26 in s:     26016        hdb :: 0.000125 s
b:   26 in s:     26016    tobyink :: 0.000947 s
b:   26 in s:     26016    damianc :: 0.000620 s
b:64000 in s:    100595        hdb :: 0.006378 s
b:64000 in s:    100595    tobyink :: 6.614787 s
b:64000 in s:    100595    damianc :: 6.247666 s
b:64000 in s:    657902        hdb :: 0.017817 s
b:64000 in s:    657902    tobyink :: 48.717781 s
b:64000 in s:    657902    damianc :: 12.516777 s
b:64000 in s:   6444916        hdb :: 0.235547 s
b:64000 in s:   6444916    tobyink :: 582.471826 s
b:64000 in s:   6444916    damianc :: 179.466052 s
b:64000 in s:  64031416        hdb :: 2.763495 s
b:64000 in s:  64031416    tobyink :: 6158.077379 s
b:64000 in s:  64031416    damianc :: 2104.112186 s
{
damianc => {
4 => { 1 => "8.10623168945313e-005", 10 => "0.000266075134277344",
+ 100 => "0.000210046768188477", 1000 => "0.000719070434570313", all =
+> "0.00127625465393066", },
26 => { 1 => "0.000108957290649414", 10 => "0.000123977661132813",
+ 100 => "0.000181913375854492", 1000 => "0.000619888305664063", all =
+> "0.00103473663330078", },
64000 => { 1 => "6.2476658821106", 10 => "12.5167770385742", 100 =
+> "179.466052055359", 1000 => "2104.11218619347", all => "2302.342681
+16951", },
all => "2302.3449921608",
},
hdb => {
4 => { 1 => "8.89301300048828e-005", 10 => "7.00950622558594e-005"
+, 100 => "8.29696655273438e-005", 1000 => "0.000391006469726563", all
+ => "0.000633001327514648", },
26 => { 1 => "7.48634338378906e-005", 10 => "7.60555267333984e-005
+", 100 => "0.000110864639282227", 1000 => "0.000124931335449219", all
+ => "0.000386714935302734", },
64000 => { 1 => "0.006378173828125", 10 => "0.0178170204162598", 1
+00 => "0.235547065734863", 1000 => "2.76349496841431", all => "3.0232
+3722839355", },
all => "3.02425694465637",
},
tobyink => {
4 => { 1 => "0.000335931777954102", 10 => "7.00950622558594e-005",
+ 100 => "9.17911529541016e-005", 1000 => "0.000158071517944336", all
+=> "0.000655889511108398", },
26 => { 1 => "9.70363616943359e-005", 10 => "0.000153064727783203"
+, 100 => "0.000151872634887695", 1000 => "0.000946998596191406", all
+=> "0.00134897232055664", },
64000 => { 1 => "6.61478710174561", 10 => "48.7177810668945", 100
+=> "582.471826076508", 1000 => "6158.07737898827", all => "6795.88177
+323341", },
all => "6795.88377809525",
},
}

I'm not sure why you think it would be any easier to maintain that the non-regex solutions?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm not sure why you think it would be any easier to maintain that the non-regex solutions?

hdb's solution is extremely clever and obviously vastly superior in performance. Most impressive.

Nevertheless, the reasons I'd prefer to have to maintain the regex solution include:

• The algorithm find_substring() uses is extremely clever, perhaps even a little subtle. I always assume that's likely to be a disadvantage for future maintainability.
• And, indeed, after ten minutes of close inspection, I'm still not entirely sure I fully understand find_substring(). By Kernighan's Metric, if I'm not smart enough to understand it, I'm certainly not half smart enough to maintain it.
• Infinite loops and manually iterated string indexes always make me nervous. They are opportunities for off-by-one errors and overlooked edge-case behaviours to lurk...or to creep in when the code is subsequently updated.
• I genuinely prefer functional or declarative styles of programming. The regex solution describes exactly what it does (provided you're fluent in the regex dialect...which I am), whereas find_substring()'s implementation not at all self-explanatory (to me).
• Continuing on with the functional/imperative contrast: the regex solution uses exactly one automatically-preset read-only variable. In contrast find_substring() uses a couple of manually-assigned read-write variables. In my view that means the latter has several extra places where future well-intentioned modifications could quietly break something.
• I think that the regex solution would also be much easier to integrate into a larger parsing system, when the current simple text processing task subsequently grows more complicated (which it inevitably will).
• I can debug the behaviour of the regex-based solution visually using Regexp::Debugger. I'd have to use perl -d to debug find_substring(). <shudder>
• The find_substring() implementation somehow reminds me of my years of coding in C, and at this point I really don't need that kind of post-traumatic flashback undoing all the therapy. ;-)

Damian

Re^2: Finding repeat sequences.
by BrowserUk (Pope) on Jun 18, 2013 at 22:10 UTC

That works perfectly and does exactly what I want. Thank you.

I suspect it may run rather slowly as is -- /xms etc. -- but given the included test suite, tuning should be simple :)

(BTW: Nice to see you: to see you ... )

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Create A New User
Node Status?
node history
Node Type: note [id://1039673]
help
Chatterbox?
and John Coltrane plays...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (9)
As of 2017-05-23 20:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My favorite model of computation is ...

Results (182 votes). Check out past polls.