comment on

OfficeLinebacker asked ( in Question why this Regex isn't matching ) for a regex to extract the link text from some (HTML) data and was correctly advised to use a module instead. In fact, /me was one of those with that advice ...but the OP stuck in my mind and today I got to playing around with some possibilities.

Naturally, I screwed up a few times... that's no great suprise -- "Typos is use," just for starters.

But, IMO, the results of some of those typos/tests raised interesting questions. Here's the base code, with comments indicating how the regex (and the output label -- 1, 2, or 3) varied in separate but otherwise identical scripts:

#!/usr/bin/perl
use strict;
use warnings;
use 5.012;

# 928860a

$\=undef;
my $string = <DATA>;

#1 capture paren pair contains the non-grouping parens
if ( $string =~ />((?:\s|\w)+)(?!<\td>)/m ) { 
    say "regex 1: $1";
}

# cf #1  />((?:\s|\w)+)(?!<\td>)/m 
# "</td>" erroneously written as "<\td>" (escaping the "t") captures p
+recisely as intended

# cf #2  />((?:\s|\w)+)(?!<\/td>)/m
# regex as intended fails to capture the terminal "g"

# cf #3  />(?:(\s|\w)+)(?!<\/td>)/m
# captures only the (penultimate?) "n"

__DATA__
/td><td class="stdViewCon">Moving Services for MSCD Student Success Bu
+ilding</td></tr></table></TD></TR>

<TR VALIGN=top><TD><table class="stdViewITbl"><tr><td class="stdViewLn
+kLbl">
[download]

Here's a transcript of execution and output from the three almost-idential tests:

C:\>928860a.pl
regex 1: Moving Services for MSCD Student Success Building

C:\>928860b.pl
regex 2: Moving Services for MSCD Student Success Buildin

C:\>928860c.pl
regex 3: n
[download]

Admittedly, I've re-read the regex perldocs only in a cursory manner and have reviewed only the index in Mastering Regular Expressions, Second Edition (the one in my lib; Third Edition is current), but does anyone have a pointer to documentation to explain what is for me, inexplicable: Why does regex 1, with the error, produce the desired output, while regex 2 fails to capture the terminal "g" in the desired link-text and regex3 fails almost entirely?

In reply to Why do these regex variants behave as they do? by ww

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks