Re^4: Breaking The Rules II

by wfsp (Abbot)
on Jul 03, 2007 at 19:26 UTC ( #624756=note: print w/replies, xml ) Need Help??

in reply to Re^3: Breaking The Rules II
in thread Breaking The Rules II

Parsing HTML with a parser:
"...far quicker to use a 'bunch of regex' to extract the data I want... ...when things change, I find it easier to adjust the regex than figure out which other module or modules and methods I now require. ... I think that there is an element of 'laziness gone to far'...".

Really, how much real world HTML have you had to deal with? Some monks do indeed argue that "you can parse HTML with a regex" (e.g. tye) but I've never seen any argue against using a parser. Most monks urge that a parser be at least considered.

As with other aspects of Perl where there are many ways to do it, you tend to settle on what you're most comfortable with. So, naturally, monks will have different favorites. Mine is HTML::TokeParser::Simple, a wonderful module. :-)

But there's more!

Mostly, take the time to learn to use regex well and you'll not need to run off to cpan to grab, and spend time learning to use, one of ten new modules, each of which purport to do what you need, but each of which has its own set of limitations and caveats.
Replace the word 'regex' with 'Perl' and you'll have encompassed all of cpan. What cpan modules do you use? If you took the time to learn perl well you'll not need to. And I say that with no limitations or caveats. :-)

could you kindly leave my gums out of this.

Re^5: Breaking The Rules II (regex HTML)
by tye (Sage) on Jul 03, 2007 at 21:38 UTC

    Oh, it depends on the data. I mostly agree with you, but I also know that it can be quite appropriate to just grab stuff out of the middle with a simpler regex in many cases. Ironically, I find this particularly true of HTML and XML because you are unlikely to find <em>Price:</em> in a place where it doesn't mean the obvious thing. That is, for example, <a href="<em>Price:</em>"> is very unlikely, even if XML (stupidly) doesn't just outright forbid it (it should, IMIO, require the inner angle brackets to be escaped).

    But, of course, there are HTML comments and XML CDATA that do allow for such confusing situations (and I find CDATA a pretty pointless feature). In practice, I find that it is often quite safe to discount this risk, however.

    My personal theory is that this mantra against using regexes against HTML (etc.) probably originated from more narrow situations where there were repeated attempts by novice coders to do more systematic manipulation of HTML. For example, someone proposing to use


    in a web spider is more deserving of being warned to use a proper parser than someone trying to pull one stock price from an internal web site for personal use.

    So the warning has some value even in the latter case, but it is more likely to be overstated. I don't see a rash of hot ranting about such things. It is almost always mentioned, which probably annoys some people.

    I rarely see people admit that a simpler regex against something like HTML can often be a reasonable solution, yet I've certainly had much practical success doing such over the years, so I consider myself qualified to state that it can work well. But designing such a solution well requires taking this risk into account, so it can be important to point out the risk. And the response to such comments often encourages restressing the point.

    So I can see how someone would think that this point is often being made dogmatically. But I don't see any value in trying to lump them into some imaginary group or apply insulting language to them. (:

    - tye        

Re^5: Breaking The Rules II
by BrowserUk (Pope) on Jul 03, 2007 at 21:18 UTC
    Some monks do indeed argue that "you can parse HTML with a regex" (e.g. tye) but I've never seen any argue against using a parser.

    You didn't read what I wrote. I never said "use regex to parse html". If i wanted to parse html, I would use HTML::Parser (which I mentioned above as a fine pragmatic module). But if I want to extract some strings from amongst some other strings, I use regex.

    To bring the linked example above up to date as the page it used went away many moons ago. One of the web pages I visit frequently is the BBC Weather forecast.

    Here is a script to extract and print a selection of the data that page contains:

    #! perl -slw use strict; use LWP::Simple; our $area ||= 1040; my $url = "$area"; my $html = get $url or die "Failed to get html: $!, $^E"; my $reTag = qr[< [^>]+ > \s*? ]x; my( $state, $temp, $wDir, $wSpeed, $humid, $press, $updown, $vis ) = $ +html =~ m[ Current\s+Nearest\s+Observations $reTag : $reTag+ ( \S+ ) $reTag \s+ ( \d+ ) .+? ( [NSEW]+ ) \s+ \( ( \d+ ) .+? Relative\s+Humidity .+? : \s+ ( \d+ ) .+? Pressure .+? : \s+ ( \d+ ), \s+ ( \S+ ) , .+? Visibility .+? : \s+ ( [^<]+ ) ]smx or warn 'Regex failed!' and getstore $url, 'weather.failed'; print <<EOP; Sky: $state Temp: $temp Wind Direction: $wDir Speed: $wSpeed Humidity: $humid% Pressure: $press $updown Visibility: $vis EOP

    When run it produces output like this:

    C:\test> Sky: cloudy Temp: 16 Wind Direction: S Speed: 5 Humidity: 77% Pressure: 999 Rising Visibility: Very good C:\test> -area=3200 Sky: cloudy Temp: 17 Wind Direction: SW Speed: 17 Humidity: 67% Pressure: 999 Rising Visibility: Very good

    Your mission, should you choose to accept it, is to reproduce that script using HTML::TokeParser::Simple and post it here.

    If you took the time to learn perl well you'll not need to. And I say that with no limitations or caveats. :-)

    That's simply not true. For one, I couldn't replace HTML::Parser with Perl code, because it's a Perl wrapper around an XS wrapper around 41k of intensely involved C code. Nor GD for similar reasons. Nor Time::HiRes. Nor... about 60 more modules, but that would just belabour the point.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      #! perl use strict; use warnings; use LWP::Simple; use HTML::TokeParser::Simple; our $area ||= 1040; my $url = "$area"; my $html = get $url or die "Failed to get html: $!, $^E"; my $p = HTML::TokeParser::Simple->new(\$html); while (my $t = $p->get_tag('div')){ last if $t->get_attr('class') and $t->get_attr('class') eq 'display'; } my @data; while (my $t = $p->get_token){ last if $t->is_end_tag('div'); push @data, $t->as_is if $t->is_text; } # raw data #for (0..$#data){ # print "$_: $data[$_]\n"; #} my ($dir, $speed) = split(/\(/, $data[5]); print <<EOP; Sky: $data[2] Temp: $data[3] Wind Direction: $dir Speed: $speed Humidity$data[11] Pressure$data[15] Visibility$data[17] EOP __DATA__ Sky: cloudy Temp: 14 Wind Direction: SW Speed: 10 Humidity: 80, Pressure: 1004, Rising, Visibility: Very good ## raw data ## 0: Current Nearest Observations 1: : 2: cloudy 3: 14 4: C 5: SW (10 6: mph 7: ) 8: Relative Humidity ( 9: &#37; 10: ) 11: : 80, 12: Pressure ( 13: mB 14: ) 15: : 1004, Rising, 16: Visibility 17: : Very good

