Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Why do these regex variants behave as they do?

by ww (Bishop)
on Oct 02, 2011 at 16:26 UTC ( #929160=perlquestion: print w/ replies, xml ) Need Help??
ww has asked for the wisdom of the Perl Monks concerning the following question:

OfficeLinebacker asked ( in Question why this Regex isn't matching ) for a regex to extract the link text from some (HTML) data and was correctly advised to use a module instead. In fact, /me was one of those with that advice ...but the OP stuck in my mind and today I got to playing around with some possibilities.

Naturally, I screwed up a few times... that's no great suprise -- "Typos is use," just for starters.

But, IMO, the results of some of those typos/tests raised interesting questions. Here's the base code, with comments indicating how the regex (and the output label -- 1, 2, or 3) varied in separate but otherwise identical scripts:

#!/usr/bin/perl use strict; use warnings; use 5.012; # 928860a $\=undef; my $string = <DATA>; #1 capture paren pair contains the non-grouping parens if ( $string =~ />((?:\s|\w)+)(?!<\td>)/m ) { say "regex 1: $1"; } # cf #1 />((?:\s|\w)+)(?!<\td>)/m # "</td>" erroneously written as "<\td>" (escaping the "t") captures p +recisely as intended # cf #2 />((?:\s|\w)+)(?!<\/td>)/m # regex as intended fails to capture the terminal "g" # cf #3 />(?:(\s|\w)+)(?!<\/td>)/m # captures only the (penultimate?) "n" __DATA__ /td><td class="stdViewCon">Moving Services for MSCD Student Success Bu +ilding</td></tr></table></TD></TR> <TR VALIGN=top><TD><table class="stdViewITbl"><tr><td class="stdViewLn +kLbl">

Here's a transcript of execution and output from the three almost-idential tests:

C:\>928860a.pl regex 1: Moving Services for MSCD Student Success Building C:\>928860b.pl regex 2: Moving Services for MSCD Student Success Buildin C:\>928860c.pl regex 3: n

Admittedly, I've re-read the regex perldocs only in a cursory manner and have reviewed only the index in Mastering Regular Expressions, Second Edition (the one in my lib; Third Edition is current), but does anyone have a pointer to documentation to explain what is for me, inexplicable: Why does regex 1, with the error, produce the desired output, while regex 2 fails to capture the terminal "g" in the desired link-text and regex3 fails almost entirely?

Comment on Why do these regex variants behave as they do?
Select or Download Code
Re: Why do these regex variants behave as they do?
by BrowserUk (Pope) on Oct 02, 2011 at 17:15 UTC
    Why does regex 1, with the error, produce the desired output, while regex 2 fails to capture the terminal "g" in the desired link-text and regex3 fails almost entirely?
    1. / > ( (?: \s | \w )+ ) (?! <\td> ) /mx

      This works because the error is irrelevant, and redundant, to what is captured.

      Without it, the resultant regex / > ( (?: \s | \w )+ ) /mx still works exactly the same.

      The only place in the string where '>' is immediately followed by a space (\s) or word (\w) character, starts at '>Moving'.

      And the string of 1 or more space or word characters ends with the first '<'.

    2. / > ( (?: \s| \w ) + ) (?! <\/td> ) /mx says that the last captured character in the string must not be followed by </td>.

      So the regex omits the 'g' which is followed by that string.

    3. / > (?: ( \s | \w ) + ) (?! <\/td> ) /mx only captures a single character because that's what it asks for.

      ( \s | \w ) says 'capture either a single space, or a single word character', so it does.

      The presence of the quantifier '+' outside the capturing parens does not change that.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      ww:
      Further to BrowserUk's reply, it may be helpful, particularly in the third example, to see the entirety of what is matched (seen in $&, a naughty fellow whom we normally shun) versus what is captured (to $1 from the first capture group).

      Note: in the examples below, I use the character set  [\s \w] as equivalent to  (?:\s|\w) to emphasize the character-set nature of the grouping. The presence of an extra space in the character set is used in an attempt, possibly ill-conceived, to get everything to 'line up right'; the space is redundant because it is included in the  \s 'whitespace' set.

      Note also that I have used a simplified string in the examples, and the 'closing' tag is just '<t>' and '<X>' is the incorrect closing tag (the forward/backward slashes just confuse the issue).

      >perl -wMstrict -le "my $s = '<t>Abcd efgh ijK<t>'; ;; print qq{1a '$&' ($1)} if $s =~ m{ > ( (?:\s|\w)+ ) (?!<X>) }xms; print qq{1b '$&' ($1)} if $s =~ m{ > ( [\s \w]+ ) (?!<X>) }xms; ;; print qq{2a '$&' ($1)} if $s =~ m{ > ( (?:\s|\w)+ ) (?!<t>) }xms; print qq{2b '$&' ($1)} if $s =~ m{ > ( [\s \w]+ ) (?!<t>) }xms; ;; print qq{3a '$&' ($1)} if $s =~ m{ > (?: ( \s|\w )+ ) (?!<t>) }xms; print qq{3b '$&' ($1)} if $s =~ m{ > (?: ([\s \w])+ ) (?!<t>) }xms; " 1a '>Abcd efgh ijK' (Abcd efgh ijK) 1b '>Abcd efgh ijK' (Abcd efgh ijK) 2a '>Abcd efgh ij' (Abcd efgh ij) 2b '>Abcd efgh ij' (Abcd efgh ij) 3a '>Abcd efgh ij' (j) 3b '>Abcd efgh ij' (j)

      Update: Changed example code to print $& first (in single-quotes), then $1 (in parentheses, symbolic of capture) to match the order of their discussion in the text.

      /me slaps head;
      ...wishes he could award multiple ++ to both BrowserUk and AnomalousMonk for clear, concise and brilliantly on-target replies.

      I had played with a char class at some point in this evolution... and didn't quite nail it (didn't event come close?). But your answers made it clear that one "right" (YMMV) approach with a class is:

      #4 Non-capture (grouping) paren-pair contains the capture parens which + use a char_class if ( $string4 =~ />(?:([\s|\w]+))(?:<\/td>)/m ) { say "regex 4: $1"; }

      Again, thanks for spotting my blind-spots!

        if ( $string4 =~ />(?:([\s|\w]+))(?:<\/td>)/m ) { ... }

        In  (?:([\s|\w]+)) the '|' (pipe) character in the character set is taken literally (i.e., the set matches any whitespace, word or '|' character) and so is probably not what you intend! Also, the non-capturing grouping is redundant:  ([\s\w]+) should work just as well. Further, the  (?:<\/td>) at the end (in which the non-capturing grouping is also redundant) requires a positive match on this sequence of characters, whereas in previous code this was a  (?!<\/td>) zero-width, negative look-ahead assertion; don't know if this difference was intended or not (Update: although on second thought it probably was intended since it was the negative look-ahead that led to the missing-final-character puzzlement).

Reaped: Re: Why do these regex variants behave as they do?
by NodeReaper (Curate) on Oct 03, 2011 at 13:22 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://929160]
Approved by planetscape
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2015-02-01 20:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    On my keyboard, Caps lock is:








    Results (10 votes), past polls