Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^3: /g option not making s// find all matches (updated)

by AnomalousMonk (Archbishop)
on May 29, 2018 at 14:51 UTC ( [id://1215385]=note: print w/replies, xml ) Need Help??


in reply to Re^2: /g option not making s// find all matches (updated)
in thread /g option not making s// find all matches

... the \G (?! \A) construct ... how it works ...

This simply asserts that \G (at previous match point or at absolute start of string if first match) is true, and that \A (at absolute start of string) is not true; i.e., that \G is not matching at the start of the string. This is a little confusing in that  (?! ...) is a negative look-ahead and you may wonder how one can look ahead to the absolute start of a string. However, \G and \A are both zero-width assertions, so it doesn't matter which way you look as long as you're negating. The following all work identically:
    \G (?! \A)    \G (?<! \A)    (?! \A) \G    (?<! \A) \G

The /ms options seem unnecessary here; did you include them just to make the solution as generic as possible?

In line with TheDamian's regex Perl Best Practices, I always use an  /xms tail on every  qr// m// s/// expression I write. Of course, /x allows whitespace and comments; can't be bad. In addition,  ^ $ . always behave in the same way and I don't have to think about it any more; regexes are complicated enough as it is. E.g., the behavior of  . (dot) is "by default, dot matches everything except a newline unless the /s modifier is asserted, in which case dot matches everything — now let's see whether or not there's a /s around anywhere". TheDamian recommends and I prefer to always use the /s modifier and just think about the behavior of dot as "dot matches all". Period. Similarly for the  ^ $ assertions and the /m modifier. (In general, I have quite a bit of respect for TheDamian's PBPs. I don't agree with them all, but I have embraced the regex PBPs completely and wholeheartedly.)

So in answer to your direct question, the universal  /xms tail is not used to make regexes generic so much as to make them less-thought-needed-ic.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^4: /g option not making s// find all matches (updated)
by raygun (Scribe) on May 31, 2018 at 08:35 UTC
    The following all work identically:
    \G (?! \A) \G (?<! \A) (?! \A) \G (?<! \A) \G
    That's brilliant and kind of hurts my brain... :-)
    In line with TheDamian's regex Perl Best Practices, I always use an /xms tail on every qr// m// s/// expression I write.
    Thanks for the pointer to that. I'll look more into it. In general, I like to use default behaviors unless I need to do something that the default can't accomplish. Then the mechanism to override the default (appending /xms in this case) becomes part of the code's self-documentation, alerting the reader that something outside the norm is happening. (That philosophy fails if some program or interface's defaults are insane, but in my experience, perl's are pretty solid.)

    Also, from a readability standpoint, if you have ten regexes, nine of which end in / and the tenth ending in /m, it's easy to see at a glance that the tenth is doing something outside the default. But if you define your default to be /xms, in code with nine regexes ending in /xms and a tenth ending in /xs, the reader is much more likely to overlook the fact that the tenth instance is overriding the local default.

    But, again, I say all this without having digested the rationale for TheDamian's recommendations, so it's all FWIW.
      I like to use default behaviors unless I need to do something that the default can't accomplish.

      IMHO this is a good philosophy and I wholeheartedly condone it.

      Caveats about house coding rules aside, when working on previously unseen code, if I come across a regex with say /xms and there's no whitespace or no dots or no anchors in it that can cause confusion. What has happened to it? Did it have such things previously and they were edited out? Does the coder not know or understand what the modifiers mean? Is there something in the regex which might change at runtime to make the modifiers useful?

      Having read and understood TheDamian's rationale for this I respectfully disagree with it. The beauty of TIMTOWTDI is that everyone can code in the way that they think best. So let's embrace the diversity where it exists for such good reasons.

      ... if you define your default to be /xms, in code with nine regexes ending in /xms and a tenth ending in /xs ...

      But that's the point: you never use anything other than an /xms tail in your own code. If you're dealing with someone else's code, you're on your own, and you may have much bigger problems than just regexes to contend with; that's the way of the world.


      Give a man a fish:  <%-{-{-{-<

        But that's the point: you never use anything other than an /xms tail in your own code.
        I dunno, that seems unnecessarily rigid. A "best practice" should mean "do this unless there's a good reason to do otherwise," not "always blindly do this no matter what." All three of those options modify the regex behavior. What if you need the unmodified behavior?

        I realize "need" might be too strong a word; with /x, for instance, you can always just escape any literal space characters your regex needs. But if there are several of them, and your regex is otherwise simple, all those escapes clutter the code more than just omitting the /x. And someone else encountering m/a\ b\ c/xms in your code will wonder what tricky thing you're trying to do by telling the regex engine to ignore whitespace and then escaping all your whitespace.

        I'm not saying your system is wrong — I'm certain it serves you well — but I don't think I'm sold on it yet.

      AnomalousMonk is probably a better Perl hacker than I am and TheDamian most certainly is but I side with hippo here; hey look another hacker who is better than I. :P I take the regex/substitution flags to be meaningful in the context of the code. If they are not, it might be confusing or waste my time trying to confirm why they are not.

      \G (?<! \A)   (?! \A) \G ... kind of hurts my brain ...

      I got to wondering about all that and thought I might try to clarify it a bit, if only for my own benefit. Say we have the problem "match (and capture) the first  \w character that is not at the start of the string and that is also on a  \b boundary." From the foregoing discussion,  m{ (?! \A) \b (\w) }xms does the trick:

      c:\@Work\Perl\monks>perl -wMstrict -le "print qq{'$1'} if 'ab-cd' =~ m{ (?! \A) (\w) }xms; print qq{'$1'} if 'ab-cd' =~ m{ \b (\w) }xms; print qq{'$1'} if 'ab-cd' =~ m{ (?! \A) \b (\w) }xms; print qq{'$1'} if 'ab-cd' =~ m{ \b (?! \A) (\w) }xms; " 'b' 'a' 'c' 'c'
      Leaving out either zero-width assertion makes the match fail | incorrect. The order of the two assertions doesn't matter because it's a logical conjunction, and if there are no side-effects (and there aren't: we're just examining match position and not matching and consumng any characters, i.e., changing the match position), then A and B and B and A are equivalent expressions.

      So what about the  (?! \A) versus  (?<! \A) look-ahead/behind business? Here's how I think of it: If you're at the North Pole, in which direction do you have to go to get to the North Pole? The question is moot: You can go exactly zero meters in any direction because you're at the North Pole! Similarly, if your match position is at the start of a string, in which direction do you have to "look" to "see" the | that you are at the start of the string? All you have to do is examine the match position; "direction" is meaningless. For the  \A zero-width assertion,  \A  (?= \A)  (?<= \A) are all exactly equivalent. The same reasoning applies to negated assertions:  (?! \A)  (?<! \A) are equivalent. Indeed, I think the same reasoning applies to all zero-width assertions. Here's a Test::More demo to bolster your confidence:

      c:\@Work\Perl\monks>perl -wMstrict -le "use Test::More 'no_plan'; use Test::NoWarnings; ;; my @regexes = ( 'negative look-ahead to \A', qr{ (?! \A) \b (\w) }xms, qr{ \b (?! \A) (\w) }xms, qr{ (?! \A) (?! \B) (\w) }xms, qr{ (?! \B) (?! \A) (\w) }xms, qr{ (?! \A) (?<! \B) (\w) }xms, qr{ (?<! \B) (?! \A) (\w) }xms, 'negative look-behind to \A', qr{ (?<! \A) \b (\w) }xms, qr{ \b (?<! \A) (\w) }xms, qr{ (?<! \A) (?! \B) (\w) }xms, qr{ (?! \B) (?<! \A) (\w) }xms, qr{ (?<! \A) (?<! \B) (\w) }xms, qr{ (?<! \B) (?<! \A) (\w) }xms, 'all together now', qr{ \b (?! \A) (?! \B) (?<! \A) (?<! \B) (\w) }xms, ); ;; REGEX: for my $rx (@regexes) { if (ref $rx ne 'Regexp') { note $rx; next REGEX; } 'ab-cd' =~ $rx; ok $1 eq 'c', qq{$rx works}; } ;; done_testing; " # negative look-ahead to \A ok 1 - (?msx-i: (?! \A) \b (\w) ) works ok 2 - (?msx-i: \b (?! \A) (\w) ) works ok 3 - (?msx-i: (?! \A) (?! \B) (\w) ) works ok 4 - (?msx-i: (?! \B) (?! \A) (\w) ) works ok 5 - (?msx-i: (?! \A) (?<! \B) (\w) ) works ok 6 - (?msx-i: (?<! \B) (?! \A) (\w) ) works # negative look-behind to \A ok 7 - (?msx-i: (?<! \A) \b (\w) ) works ok 8 - (?msx-i: \b (?<! \A) (\w) ) works ok 9 - (?msx-i: (?<! \A) (?! \B) (\w) ) works ok 10 - (?msx-i: (?! \B) (?<! \A) (\w) ) works ok 11 - (?msx-i: (?<! \A) (?<! \B) (\w) ) works ok 12 - (?msx-i: (?<! \B) (?<! \A) (\w) ) works # all together now ok 13 - (?msx-i: \b (?! \A) (?! \B) (?<! \A) (?<! \B) (\w) ) works 1..13 ok 14 - no warnings 1..14


      Give a man a fish:  <%-{-{-{-<

      DUP of Re^5: /g option not making s// find all matches (updated): Please REAP.

      \G (?<! \A)   (?! \A) \G ... kind of hurts my brain ...

      I got to wondering about all that and thought I might try to clarify it a bit, if only for my own benefit. Say we have the problem "match (and capture) the first  \w character that is not at the start of the string that is also on a  \b boundary." From the foregoing discussion,  m{ (?! \A) \b (\w) }xms does the trick:

      c:\@Work\Perl\monks>perl -wMstrict -le "print qq{'$1'} if 'ab-cd' =~ m{ (?! \A) (\w) }xms; print qq{'$1'} if 'ab-cd' =~ m{ \b (\w) }xms; print qq{'$1'} if 'ab-cd' =~ m{ (?! \A) \b (\w) }xms; print qq{'$1'} if 'ab-cd' =~ m{ \b (?! \A) (\w) }xms; " 'b' 'a' 'c' 'c'
      Leaving out either zero-width assertion makes the match fail. The order of the two assertions doesn't matter because it's a logical conjunction, and if there are no side-effects (and there aren't: we're just examining match position and not matching and consumng any characters, i.e., changing the match position), then A and B and B and A are equivalent.

      So what about the  (?! \A) versus  (?<! \A) look-ahead/behind business. Here's how I think of it: If you're at the North Pole, in which direction do you have to go to get to the North Pole? The question is moot: You can go exactly zero meters in any direction because you're at the North Pole! Similarly, if your match position is at the start of a string, in which direction do you have to "look" to "see" the start of the string? For the  \A zero-width assertion,  \A  (?= \A)  (?<= \A) are all exactly equivalent. The same reasoning applies to negated assertions:  (?! \A)  (?<! \A) are equivalent. Indeed, I think the same reasoning applies to all zero-width assertions. Here's a Test::More demo to bolster your confidence (as it did mine):

      c:\@Work\Perl\monks>perl -wMstrict -le "use Test::More 'no_plan'; use Test::NoWarnings; ;; my @regexes = ( 'negative look-ahead to \A', qr{ (?! \A) \b (\w) }xms, qr{ \b (?! \A) (\w) }xms, qr{ (?! \A) (?! \B) (\w) }xms, qr{ (?! \B) (?! \A) (\w) }xms, qr{ (?! \A) (?<! \B) (\w) }xms, qr{ (?<! \B) (?! \A) (\w) }xms, 'negative look-behind to \A', qr{ (?<! \A) \b (\w) }xms, qr{ \b (?<! \A) (\w) }xms, qr{ (?<! \A) (?! \B) (\w) }xms, qr{ (?! \B) (?<! \A) (\w) }xms, qr{ (?<! \A) (?<! \B) (\w) }xms, qr{ (?<! \B) (?<! \A) (\w) }xms, 'all together now', qr{ \b (?! \A) (?! \B) (?<! \A) (?<! \B) (\w) }xms, ); ;; REGEX: for my $rx (@regexes) { if (ref $rx ne 'Regexp') { note $rx; next REGEX; } 'ab-cd' =~ $rx; ok $1 eq 'c', qq{$rx works}; } ;; done_testing; " # negative look-ahead to \A ok 1 - (?msx-i: (?! \A) \b (\w) ) works ok 2 - (?msx-i: \b (?! \A) (\w) ) works ok 3 - (?msx-i: (?! \A) (?! \B) (\w) ) works ok 4 - (?msx-i: (?! \B) (?! \A) (\w) ) works ok 5 - (?msx-i: (?! \A) (?<! \B) (\w) ) works ok 6 - (?msx-i: (?<! \B) (?! \A) (\w) ) works # negative look-behind to \A ok 7 - (?msx-i: (?<! \A) \b (\w) ) works ok 8 - (?msx-i: \b (?<! \A) (\w) ) works ok 9 - (?msx-i: (?<! \A) (?! \B) (\w) ) works ok 10 - (?msx-i: (?! \B) (?<! \A) (\w) ) works ok 11 - (?msx-i: (?<! \A) (?<! \B) (\w) ) works ok 12 - (?msx-i: (?<! \B) (?<! \A) (\w) ) works # all together now ok 13 - (?msx-i: \b (?! \A) (?! \B) (?<! \A) (?<! \B) (\w) ) works 1..13 ok 14 - no warnings 1..14


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1215385]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (2)
As of 2024-04-25 06:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found