Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^3: A regex that only matches at offset that are multiples of a given N? (Update:almost perfect!)

by smls (Friar)
on Feb 13, 2013 at 17:52 UTC ( #1018588=note: print w/ replies, xml ) Need Help??


in reply to Re^2: A regex that only matches at offset that are multiples of a given N? (Update:almost perfect!)
in thread A regex that only matches at offset that are multiples of a given N?

Maybe you could avoid the double matching, by manually incrementing pos() after each successful match - either inside the body of the while loop, or at the end of the regex using (?{pos() += 4}) ?

(Just an idea, haven't tested it.)


Comment on Re^3: A regex that only matches at offset that are multiples of a given N? (Update:almost perfect!)
Download Code
Re^4: A regex that only matches at offset that are multiples of a given N? (Update:almost perfect!)
by BrowserUk (Pope) on Feb 13, 2013 at 18:45 UTC

    That's a possible approach that is effectively similar to simply ignoring non-aligned matched (the next if pos() % 4; stuff above), but the real saving would only be achieved by only calling out from the regex engine for aligned matches.

    It seems like it ought to be possible to reposition the \G somewhere after the (?:.{$n}) to avoid the duplicates. But I've tried various places including:

    #! perl -slw use strict; my $s = join '', map { ('a'..'z')[ rand 26 ] } 1 .. 1000; our $N //= 4; for my $key ( 'aa' .. 'zz' ) { print "$-[1] : $1" while $s =~ m[(?:.{$N})*(?=($key\G..))]g; }

    And I cannot make it work. (The above silently matches nothing; v-e-r-y s-l-o-w-l-y!).

    And that is pretty much the kind of effect I seem to get every time I try to use \G. I conclude that I have just never managed to acquire a good mental model of what it actually does; despite years of trying on and off.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      that is effectively similar to simply ignoring non-aligned matched

      How so? In johngg's regex, the decision which matches to select is made by the regex engine. Adding the manual pos() incrementation like so...

      m{\G(?:.{$n})*?(?=(fred.*))(?{pos()+=$n})}g
      ...would not change that, it would merely take care of the double-matching (if it works like I imagine).

      The way I understand it, the double-matching is caused by the fact that the automatic pos() incrementation that happens between m/.../g matches will in this case not jump to the right of the last (fred.*) match, but instead to the left of it, because it is within a look-ahead which counts as zero-width. And so the next round matches the same (fred.*) again, since (?:.{$n})*? also matches the empty string.

      It seems like it ought to be possible to reposition the \G somewhere after the (?:.{$n})

      perlop says: "Note also that, currently, \G is only properly supported when anchored at the very beginning of the pattern.". So, apparently that's not possible.

      every time I try to use \G. I conclude that I have just never managed to acquire a good mental model of what it actually does;

      I think a good model is to look at it as the little brother of ^. Normally, a regex could match anywhere inside a string, but prefixing the regex with ^ forces its left end to "stick" to the beginning. Similarly, a repeated m/.../g match could normally match anywhere in the part of the string that is to the right of pos(), but prefixing it with \G forces its left end to stick to pos().

      -----------

      Edit: I tested it now, and embedding the pos()-advancing code in the regex itself like shown here does not work. However, putting it into the body of the while loop (see below) does.

        that is effectively similar to simply ignoring non-aligned matched ... How so?

        Although embedding the called-back code within the regex as a zero-length code assertion looks like its not exiting the regex engine; the effect on performance shows that the need to set-up and tear down a scope and invoke that code has a very significant effect:

        $s='a'x1e6; cmpthese -1, { a => q[ $s =~ m[(.)(?{$c=pos();})]g; ], b => q[ $s =~ m[(.)]g; ], };; Rate a b a 3924/s -- -100% b 4206322/s 107088% --

        If it was possible to avoid running 3 orders of magnitude slower, the benefits would be enormous for my application.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1018588]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (10)
As of 2014-07-28 23:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (210 votes), past polls