in reply to Re: On zero-width negative lookahead assertions
in thread On zero-width negative lookahead assertions

That works, and I thank you for explaining why. Unfortunately, I can't understand why, in the first case, the parentheses match the leading space, and why putting the \S makes it match, even if there is no non-space character at the end...

It would be glad if you (or anyone else) could further explain that. I think I'll discover what I didn't understand of zwnla assertions

Thanks a lot!

Ciao!
--bronto


In theory, there is no difference between theory and practice. In practice, there is.
  • Comment on Re^2: On zero-width negative lookahead assertions

Replies are listed 'Best First'.
How backtracking works in regular expressions
by ikegami (Pope) on Sep 10, 2004 at 15:33 UTC

    Note: Perl regexp matching is not necessarily implemented as described below. I'm totally ignorant as to how it is actually implemented. One could say this document describes the specs rather than the implementation.

    It has nothing to do with lookaheads, really. For example, let's look at
    /^ab*bc/

    The regexp can be read as:
    1. Starting at the begining of the string
    2. Match an 'a'.
    3. Match as many 'b's as possible, but not matching any is ok.
    4. Match a 'b'.
    5. Match a 'c'.

    Match against 'abbbbbbc' 01234567 1) ok! pos = 0. (zw) 2) ok! Found an 'a' at pos 0. pos = 1. 3) ok! Found 6 'b's at pos 1 through 6. pos = 7. 4) fail! Did not find a 'b' at pos 7. Backtrack! 3) ok! Found 5 'b's at pos 1 through 5. pos = 6. 4) ok! Found a 'b' at pos 6. pos = 7. 5) ok! Found a 'c' at pos 7. pos = 8. Match!

    Something similiar is occuring with your
    /^root:\s*(?!email)/

    The regexp can be read as:
    1. Starting at the begining of the string
    2. Match 'root:'.
    3. Match as many '\s's as possible, but not matching any is ok.
    4. Match something other than 'email'.

    Match against 'root: email' 01234567890 1) ok! pos = 0. (zw) 2) ok! Found a 'root:' at pos 0 through 4. pos = 5. 3) ok! Found 1 '\s' at pos 5. pos = 6. 4) fail! Found 'email' at pos 6 through 10. Backtrack! 3) ok! Found 0 '\s' at pos 5. pos = 5. (zw) 4) ok! Found something other than 'email' at pos 5. pos = 5. (zwla) (found ' email') Match!

    Now let's look at my solution
    /^root:\s*(?!email)\S/

    The regexp can be read as:
    1. Starting at the begining of the string
    2. Match 'root:'.
    3. Match as many '\s's as possible, but not matching any is ok.
    4. Match something other than 'email'.
    5. Match a '\S'.

    Match against 'root: email' 01234567890 1) ok! pos = 0. (zw) 2) ok! Found a 'root:' at pos 0 through 4. pos = 5. 3) ok! Found 1 '\s' at pos 5. pos = 6. 4) fail! Found 'email' at pos 6 through 10. Backtrack! 3) ok! Found 0 '\s' at pos 5. pos = 5. (zw) 4) ok! Found something other than 'email' at pos 5. pos = 5. (zwla) (found ' email') 5) fail! Did not find a '\S' at pos 5. Backtrack! Nothing more to try. No match!
    Match against 'root: hisemail' 01234567890123 1) ok! pos = 0. (zw) 2) ok! Found a 'root:' at pos 0 through 4. pos = 5. 3) ok! Found 1 '\s' at pos 5. pos = 6. 4) ok! Found something other than 'email' at pos 6. pos = 6. (zwla) (found 'hisemail') 5) ok! Found a '\S' at pos 6. pos = 6. Match!

    Backtracking means: (might not be an exhaustive list)

    In the case of the first rule
    Look for a match further on.
    In the case of a * rule or ? rule,
    try matching less.
    In the case of a *? rule or ?? rule,
    try matching more.
    In the case of a | or [] rule,
    try matching the next choice.
    else,
    no match, so backtrack the last matching rule.
Re^3: On zero-width negative lookahead assertions
by Eimi Metamorphoumai (Deacon) on Sep 10, 2004 at 14:46 UTC
    The regexp engine will match if it can find any way to. So what you're asking for is "root, followed by some number (possibly zero) of whitespace characters, followed by something that is not 'admin@somewhere.here'". So it matches with root, followed by zero spaces, followed by ' admin@somewhere.here' (with a leading space). Since the string ' admin@somewhere.here' isn't 'admin@somewhere.here' (without the space), the lookahead works. That's why you need the \s* inside the lookahead, making it "try to find spaces followed by admin@somewhere.here, and if you can, fail" instead of "look for spaces, but make sure it's not followed by admin@somewhere.here". Subtle, but important.
      not exactly, not

      "followed by something that is not 'admin@somewhere.here'"

      it is

      "not followed by 'admin@somewhere.here'

      That is a difference, because it matches, if nothing follows at all.

      Uhmmmmm... so the old adagio that "* is greedy" has an exception when zwnlaa come into play; I expected that the \s* had eat all the whitespace before the e-mail address. Ok. Now I am still to understand why that \S thing works...

      Oh, by the way, I am doing:

      perl -i.bak -pe 'BEGIN { $status = 0 } /^root:(?!\s*admin\@somewhere\.here\s*$)/ and $status = 1 ; END { exit $status }' aliases

      and it seems to work great!

      Ciao!
      --bronto


      In theory, there is no difference between theory and practice. In practice, there is.
        so the old adagio that "* is greedy" has an exception
        No, it is always greedy, but its greed is not absolute. It will eat as much as it can, but if that results in failure to match, then it will relinquish some of what it ate (try not to picture that) to allow the whole expression to match. Greed (and the anti-greed of minimal-matching) is tempered by persistence in regexen.

        Recently, hv wrote a tutorial explaining the rules the regex engine uses in trying to find a match.


        Caution: Contents may have been coded under pressure.
        To make it behave as you describe, use (?>\s*). The (?> ) says whatever is in it will match whatever it would match at that point in the string as an independent expression. So if matching all the spaces makes something later on fail, it won't backtrack and try having the \s* match fewer spaces.

        (It's really time to unmark all of the extensions as experimental, except perhaps how variables in (?{}) and (??{}) bind.)

Re^3: On zero-width negative lookahead assertions
by Anonymous Monk on Sep 10, 2004 at 14:54 UTC
    There is a non-space character after the \s*. The (?!) part is a zero-width assertion. Zero-width means just that - it doesn't consume anything of the string to match. In stead of using the \S, one could also have used:
    /root:(?>\s*)(?!...)/