Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Why do these regex variants behave as they do?

by BrowserUk (Pope)
on Oct 02, 2011 at 17:15 UTC ( #929162=note: print w/ replies, xml ) Need Help??

in reply to Why do these regex variants behave as they do?

Why does regex 1, with the error, produce the desired output, while regex 2 fails to capture the terminal "g" in the desired link-text and regex3 fails almost entirely?
  1. / > ( (?: \s | \w )+ ) (?! <\td> ) /mx

    This works because the error is irrelevant, and redundant, to what is captured.

    Without it, the resultant regex / > ( (?: \s | \w )+ ) /mx still works exactly the same.

    The only place in the string where '>' is immediately followed by a space (\s) or word (\w) character, starts at '>Moving'.

    And the string of 1 or more space or word characters ends with the first '<'.

  2. / > ( (?: \s| \w ) + ) (?! <\/td> ) /mx says that the last captured character in the string must not be followed by </td>.

    So the regex omits the 'g' which is followed by that string.

  3. / > (?: ( \s | \w ) + ) (?! <\/td> ) /mx only captures a single character because that's what it asks for.

    ( \s | \w ) says 'capture either a single space, or a single word character', so it does.

    The presence of the quantifier '+' outside the capturing parens does not change that.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Comment on Re: Why do these regex variants behave as they do?
Select or Download Code
Replies are listed 'Best First'.
Re^2: Why do these regex variants behave as they do?
by AnomalousMonk (Canon) on Oct 02, 2011 at 20:34 UTC

    Further to BrowserUk's reply, it may be helpful, particularly in the third example, to see the entirety of what is matched (seen in $&, a naughty fellow whom we normally shun) versus what is captured (to $1 from the first capture group).

    Note: in the examples below, I use the character set  [\s \w] as equivalent to  (?:\s|\w) to emphasize the character-set nature of the grouping. The presence of an extra space in the character set is used in an attempt, possibly ill-conceived, to get everything to 'line up right'; the space is redundant because it is included in the  \s 'whitespace' set.

    Note also that I have used a simplified string in the examples, and the 'closing' tag is just '<t>' and '<X>' is the incorrect closing tag (the forward/backward slashes just confuse the issue).

    >perl -wMstrict -le "my $s = '<t>Abcd efgh ijK<t>'; ;; print qq{1a '$&' ($1)} if $s =~ m{ > ( (?:\s|\w)+ ) (?!<X>) }xms; print qq{1b '$&' ($1)} if $s =~ m{ > ( [\s \w]+ ) (?!<X>) }xms; ;; print qq{2a '$&' ($1)} if $s =~ m{ > ( (?:\s|\w)+ ) (?!<t>) }xms; print qq{2b '$&' ($1)} if $s =~ m{ > ( [\s \w]+ ) (?!<t>) }xms; ;; print qq{3a '$&' ($1)} if $s =~ m{ > (?: ( \s|\w )+ ) (?!<t>) }xms; print qq{3b '$&' ($1)} if $s =~ m{ > (?: ([\s \w])+ ) (?!<t>) }xms; " 1a '>Abcd efgh ijK' (Abcd efgh ijK) 1b '>Abcd efgh ijK' (Abcd efgh ijK) 2a '>Abcd efgh ij' (Abcd efgh ij) 2b '>Abcd efgh ij' (Abcd efgh ij) 3a '>Abcd efgh ij' (j) 3b '>Abcd efgh ij' (j)

    Update: Changed example code to print $& first (in single-quotes), then $1 (in parentheses, symbolic of capture) to match the order of their discussion in the text.

Re^2: Why do these regex variants behave as they do?
by ww (Bishop) on Oct 02, 2011 at 22:05 UTC

    /me slaps head;
    ...wishes he could award multiple ++ to both BrowserUk and AnomalousMonk for clear, concise and brilliantly on-target replies.

    I had played with a char class at some point in this evolution... and didn't quite nail it (didn't event come close?). But your answers made it clear that one "right" (YMMV) approach with a class is:

    #4 Non-capture (grouping) paren-pair contains the capture parens which + use a char_class if ( $string4 =~ />(?:([\s|\w]+))(?:<\/td>)/m ) { say "regex 4: $1"; }

    Again, thanks for spotting my blind-spots!

      if ( $string4 =~ />(?:([\s|\w]+))(?:<\/td>)/m ) { ... }

      In  (?:([\s|\w]+)) the '|' (pipe) character in the character set is taken literally (i.e., the set matches any whitespace, word or '|' character) and so is probably not what you intend! Also, the non-capturing grouping is redundant:  ([\s\w]+) should work just as well. Further, the  (?:<\/td>) at the end (in which the non-capturing grouping is also redundant) requires a positive match on this sequence of characters, whereas in previous code this was a  (?!<\/td>) zero-width, negative look-ahead assertion; don't know if this difference was intended or not (Update: although on second thought it probably was intended since it was the negative look-ahead that led to the missing-final-character puzzlement).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://929162]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2016-05-27 01:17 GMT
Find Nodes?
    Voting Booth?