http://www.perlmonks.org?node_id=929162


in reply to Why do these regex variants behave as they do?

Why does regex 1, with the error, produce the desired output, while regex 2 fails to capture the terminal "g" in the desired link-text and regex3 fails almost entirely?
  1. / > ( (?: \s | \w )+ ) (?! <\td> ) /mx

    This works because the error is irrelevant, and redundant, to what is captured.

    Without it, the resultant regex / > ( (?: \s | \w )+ ) /mx still works exactly the same.

    The only place in the string where '>' is immediately followed by a space (\s) or word (\w) character, starts at '>Moving'.

    And the string of 1 or more space or word characters ends with the first '<'.

  2. / > ( (?: \s| \w ) + ) (?! <\/td> ) /mx says that the last captured character in the string must not be followed by </td>.

    So the regex omits the 'g' which is followed by that string.

  3. / > (?: ( \s | \w ) + ) (?! <\/td> ) /mx only captures a single character because that's what it asks for.

    ( \s | \w ) says 'capture either a single space, or a single word character', so it does.

    The presence of the quantifier '+' outside the capturing parens does not change that.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Why do these regex variants behave as they do?
by AnomalousMonk (Archbishop) on Oct 02, 2011 at 20:34 UTC

    ww:
    Further to BrowserUk's reply, it may be helpful, particularly in the third example, to see the entirety of what is matched (seen in $&, a naughty fellow whom we normally shun) versus what is captured (to $1 from the first capture group).

    Note: in the examples below, I use the character set  [\s \w] as equivalent to  (?:\s|\w) to emphasize the character-set nature of the grouping. The presence of an extra space in the character set is used in an attempt, possibly ill-conceived, to get everything to 'line up right'; the space is redundant because it is included in the  \s 'whitespace' set.

    Note also that I have used a simplified string in the examples, and the 'closing' tag is just '<t>' and '<X>' is the incorrect closing tag (the forward/backward slashes just confuse the issue).

    >perl -wMstrict -le "my $s = '<t>Abcd efgh ijK<t>'; ;; print qq{1a '$&' ($1)} if $s =~ m{ > ( (?:\s|\w)+ ) (?!<X>) }xms; print qq{1b '$&' ($1)} if $s =~ m{ > ( [\s \w]+ ) (?!<X>) }xms; ;; print qq{2a '$&' ($1)} if $s =~ m{ > ( (?:\s|\w)+ ) (?!<t>) }xms; print qq{2b '$&' ($1)} if $s =~ m{ > ( [\s \w]+ ) (?!<t>) }xms; ;; print qq{3a '$&' ($1)} if $s =~ m{ > (?: ( \s|\w )+ ) (?!<t>) }xms; print qq{3b '$&' ($1)} if $s =~ m{ > (?: ([\s \w])+ ) (?!<t>) }xms; " 1a '>Abcd efgh ijK' (Abcd efgh ijK) 1b '>Abcd efgh ijK' (Abcd efgh ijK) 2a '>Abcd efgh ij' (Abcd efgh ij) 2b '>Abcd efgh ij' (Abcd efgh ij) 3a '>Abcd efgh ij' (j) 3b '>Abcd efgh ij' (j)

    Update: Changed example code to print $& first (in single-quotes), then $1 (in parentheses, symbolic of capture) to match the order of their discussion in the text.

Re^2: Why do these regex variants behave as they do?
by ww (Archbishop) on Oct 02, 2011 at 22:05 UTC

    /me slaps head;
    ...wishes he could award multiple ++ to both BrowserUk and AnomalousMonk for clear, concise and brilliantly on-target replies.

    I had played with a char class at some point in this evolution... and didn't quite nail it (didn't event come close?). But your answers made it clear that one "right" (YMMV) approach with a class is:

    #4 Non-capture (grouping) paren-pair contains the capture parens which + use a char_class if ( $string4 =~ />(?:([\s|\w]+))(?:<\/td>)/m ) { say "regex 4: $1"; }

    Again, thanks for spotting my blind-spots!

      if ( $string4 =~ />(?:([\s|\w]+))(?:<\/td>)/m ) { ... }

      In  (?:([\s|\w]+)) the '|' (pipe) character in the character set is taken literally (i.e., the set matches any whitespace, word or '|' character) and so is probably not what you intend! Also, the non-capturing grouping is redundant:  ([\s\w]+) should work just as well. Further, the  (?:<\/td>) at the end (in which the non-capturing grouping is also redundant) requires a positive match on this sequence of characters, whereas in previous code this was a  (?!<\/td>) zero-width, negative look-ahead assertion; don't know if this difference was intended or not (Update: although on second thought it probably was intended since it was the negative look-ahead that led to the missing-final-character puzzlement).