http://www.perlmonks.org?node_id=1018598


in reply to Re^3: A regex that only matches at offset that are multiples of a given N? (Update:almost perfect!)
in thread A regex that only matches at offset that are multiples of a given N?

That's a possible approach that is effectively similar to simply ignoring non-aligned matched (the next if pos() % 4; stuff above), but the real saving would only be achieved by only calling out from the regex engine for aligned matches.

It seems like it ought to be possible to reposition the \G somewhere after the (?:.{$n}) to avoid the duplicates. But I've tried various places including:

#! perl -slw use strict; my $s = join '', map { ('a'..'z')[ rand 26 ] } 1 .. 1000; our $N //= 4; for my $key ( 'aa' .. 'zz' ) { print "$-[1] : $1" while $s =~ m[(?:.{$N})*(?=($key\G..))]g; }

And I cannot make it work. (The above silently matches nothing; v-e-r-y s-l-o-w-l-y!).

And that is pretty much the kind of effect I seem to get every time I try to use \G. I conclude that I have just never managed to acquire a good mental model of what it actually does; despite years of trying on and off.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^4: A regex that only matches at offset that are multiples of a given N? (Update:almost perfect!)
  • Select or Download Code

Replies are listed 'Best First'.
Re^5: A regex that only matches at offset that are multiples of a given N? (Update:almost perfect!)
by smls (Friar) on Feb 13, 2013 at 19:29 UTC
    that is effectively similar to simply ignoring non-aligned matched

    How so? In johngg's regex, the decision which matches to select is made by the regex engine. Adding the manual pos() incrementation like so...

    m{\G(?:.{$n})*?(?=(fred.*))(?{pos()+=$n})}g
    ...would not change that, it would merely take care of the double-matching (if it works like I imagine).

    The way I understand it, the double-matching is caused by the fact that the automatic pos() incrementation that happens between m/.../g matches will in this case not jump to the right of the last (fred.*) match, but instead to the left of it, because it is within a look-ahead which counts as zero-width. And so the next round matches the same (fred.*) again, since (?:.{$n})*? also matches the empty string.

    It seems like it ought to be possible to reposition the \G somewhere after the (?:.{$n})

    perlop says: "Note also that, currently, \G is only properly supported when anchored at the very beginning of the pattern.". So, apparently that's not possible.

    every time I try to use \G. I conclude that I have just never managed to acquire a good mental model of what it actually does;

    I think a good model is to look at it as the little brother of ^. Normally, a regex could match anywhere inside a string, but prefixing the regex with ^ forces its left end to "stick" to the beginning. Similarly, a repeated m/.../g match could normally match anywhere in the part of the string that is to the right of pos(), but prefixing it with \G forces its left end to stick to pos().

    -----------

    Edit: I tested it now, and embedding the pos()-advancing code in the regex itself like shown here does not work. However, putting it into the body of the while loop (see below) does.

      that is effectively similar to simply ignoring non-aligned matched ... How so?

      Although embedding the called-back code within the regex as a zero-length code assertion looks like its not exiting the regex engine; the effect on performance shows that the need to set-up and tear down a scope and invoke that code has a very significant effect:

      $s='a'x1e6; cmpthese -1, { a => q[ $s =~ m[(.)(?{$c=pos();})]g; ], b => q[ $s =~ m[(.)]g; ], };; Rate a b a 3924/s -- -100% b 4206322/s 107088% --

      If it was possible to avoid running 3 orders of magnitude slower, the benefits would be enormous for my application.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        If it was possible to avoid running 3 orders of magnitude slower

        Well, in this benchmark the embedded code gets called on every single character in the string, while in the problem at hand it would only get called once for every correct match. Finding the match in the first place would happen completely within the regex engine.

        If you are using the regex with a while loop, you could move the pos()-incrementing into the already existing loop body to avoid creating an additional scope:

        while( $input =~ m{\G(?:.{$n})*?(?=fred(....))}g ) { pos($input) += $n; # ... # do stuff with $1 # ... }

        This could also be adapted to ensure that new occurrences of "fred" may not be matched inside the (....) that belongs to the previous match, by modifying the pos()-incrementing line like this:

        pos($input) += $n * int((length('fred'.$1) - 1) / $n + 1);