Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Why do these regex variants behave as they do?

by ww (Bishop)
on Oct 02, 2011 at 16:26 UTC ( #929160=perlquestion: print w/ replies, xml ) Need Help??
ww has asked for the wisdom of the Perl Monks concerning the following question:

OfficeLinebacker asked ( in Question why this Regex isn't matching ) for a regex to extract the link text from some (HTML) data and was correctly advised to use a module instead. In fact, /me was one of those with that advice ...but the OP stuck in my mind and today I got to playing around with some possibilities.

Naturally, I screwed up a few times... that's no great suprise -- "Typos is use," just for starters.

But, IMO, the results of some of those typos/tests raised interesting questions. Here's the base code, with comments indicating how the regex (and the output label -- 1, 2, or 3) varied in separate but otherwise identical scripts:

#!/usr/bin/perl use strict; use warnings; use 5.012; # 928860a $\=undef; my $string = <DATA>; #1 capture paren pair contains the non-grouping parens if ( $string =~ />((?:\s|\w)+)(?!<\td>)/m ) { say "regex 1: $1"; } # cf #1 />((?:\s|\w)+)(?!<\td>)/m # "</td>" erroneously written as "<\td>" (escaping the "t") captures p +recisely as intended # cf #2 />((?:\s|\w)+)(?!<\/td>)/m # regex as intended fails to capture the terminal "g" # cf #3 />(?:(\s|\w)+)(?!<\/td>)/m # captures only the (penultimate?) "n" __DATA__ /td><td class="stdViewCon">Moving Services for MSCD Student Success Bu +ilding</td></tr></table></TD></TR> <TR VALIGN=top><TD><table class="stdViewITbl"><tr><td class="stdViewLn +kLbl">

Here's a transcript of execution and output from the three almost-idential tests:

C:\>928860a.pl regex 1: Moving Services for MSCD Student Success Building C:\>928860b.pl regex 2: Moving Services for MSCD Student Success Buildin C:\>928860c.pl regex 3: n

Admittedly, I've re-read the regex perldocs only in a cursory manner and have reviewed only the index in Mastering Regular Expressions, Second Edition (the one in my lib; Third Edition is current), but does anyone have a pointer to documentation to explain what is for me, inexplicable: Why does regex 1, with the error, produce the desired output, while regex 2 fails to capture the terminal "g" in the desired link-text and regex3 fails almost entirely?

Comment on Why do these regex variants behave as they do?
Select or Download Code
Re: Why do these regex variants behave as they do?
by BrowserUk (Pope) on Oct 02, 2011 at 17:15 UTC
    Why does regex 1, with the error, produce the desired output, while regex 2 fails to capture the terminal "g" in the desired link-text and regex3 fails almost entirely?
    1. / > ( (?: \s | \w )+ ) (?! <\td> ) /mx

      This works because the error is irrelevant, and redundant, to what is captured.

      Without it, the resultant regex / > ( (?: \s | \w )+ ) /mx still works exactly the same.

      The only place in the string where '>' is immediately followed by a space (\s) or word (\w) character, starts at '>Moving'.

      And the string of 1 or more space or word characters ends with the first '<'.

    2. / > ( (?: \s| \w ) + ) (?! <\/td> ) /mx says that the last captured character in the string must not be followed by </td>.

      So the regex omits the 'g' which is followed by that string.

    3. / > (?: ( \s | \w ) + ) (?! <\/td> ) /mx only captures a single character because that's what it asks for.

      ( \s | \w ) says 'capture either a single space, or a single word character', so it does.

      The presence of the quantifier '+' outside the capturing parens does not change that.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      ww:
      Further to BrowserUk's reply, it may be helpful, particularly in the third example, to see the entirety of what is matched (seen in $&, a naughty fellow whom we normally shun) versus what is captured (to $1 from the first capture group).

      Note: in the examples below, I use the character set  [\s \w] as equivalent to  (?:\s|\w) to emphasize the character-set nature of the grouping. The presence of an extra space in the character set is used in an attempt, possibly ill-conceived, to get everything to 'line up right'; the space is redundant because it is included in the  \s 'whitespace' set.

      Note also that I have used a simplified string in the examples, and the 'closing' tag is just '<t>' and '<X>' is the incorrect closing tag (the forward/backward slashes just confuse the issue).

      >perl -wMstrict -le "my $s = '<t>Abcd efgh ijK<t>'; ;; print qq{1a '$&' ($1)} if $s =~ m{ > ( (?:\s|\w)+ ) (?!<X>) }xms; print qq{1b '$&' ($1)} if $s =~ m{ > ( [\s \w]+ ) (?!<X>) }xms; ;; print qq{2a '$&' ($1)} if $s =~ m{ > ( (?:\s|\w)+ ) (?!<t>) }xms; print qq{2b '$&' ($1)} if $s =~ m{ > ( [\s \w]+ ) (?!<t>) }xms; ;; print qq{3a '$&' ($1)} if $s =~ m{ > (?: ( \s|\w )+ ) (?!<t>) }xms; print qq{3b '$&' ($1)} if $s =~ m{ > (?: ([\s \w])+ ) (?!<t>) }xms; " 1a '>Abcd efgh ijK' (Abcd efgh ijK) 1b '>Abcd efgh ijK' (Abcd efgh ijK) 2a '>Abcd efgh ij' (Abcd efgh ij) 2b '>Abcd efgh ij' (Abcd efgh ij) 3a '>Abcd efgh ij' (j) 3b '>Abcd efgh ij' (j)

      Update: Changed example code to print $& first (in single-quotes), then $1 (in parentheses, symbolic of capture) to match the order of their discussion in the text.

      /me slaps head;
      ...wishes he could award multiple ++ to both BrowserUk and AnomalousMonk for clear, concise and brilliantly on-target replies.

      I had played with a char class at some point in this evolution... and didn't quite nail it (didn't event come close?). But your answers made it clear that one "right" (YMMV) approach with a class is:

      #4 Non-capture (grouping) paren-pair contains the capture parens which + use a char_class if ( $string4 =~ />(?:([\s|\w]+))(?:<\/td>)/m ) { say "regex 4: $1"; }

      Again, thanks for spotting my blind-spots!

        if ( $string4 =~ />(?:([\s|\w]+))(?:<\/td>)/m ) { ... }

        In  (?:([\s|\w]+)) the '|' (pipe) character in the character set is taken literally (i.e., the set matches any whitespace, word or '|' character) and so is probably not what you intend! Also, the non-capturing grouping is redundant:  ([\s\w]+) should work just as well. Further, the  (?:<\/td>) at the end (in which the non-capturing grouping is also redundant) requires a positive match on this sequence of characters, whereas in previous code this was a  (?!<\/td>) zero-width, negative look-ahead assertion; don't know if this difference was intended or not (Update: although on second thought it probably was intended since it was the negative look-ahead that led to the missing-final-character puzzlement).

Reaped: Re: Why do these regex variants behave as they do?
by NodeReaper (Curate) on Oct 03, 2011 at 13:22 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://929160]
Approved by planetscape
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (12)
As of 2014-08-29 12:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (280 votes), past polls