Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^3: Breaking The Rules II

by BrowserUk (Pope)
on Jul 02, 2007 at 23:12 UTC ( #624567=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Breaking The Rules II
in thread Breaking The Rules II

I wasn't targetting P::RD. It is a perfectly fine module for those situations where you need to extract full semantic information from the language you are analysing. But even for this, it's certainly not the only game in town, nor necessarily the best choice for any given application.

With respect to shift/reduce conflicts and Parse::YAPP: It's possible to construct ambiguous grammars regardless of which type of parser one targets, and equally possible to resolve them.

My main point was that parsers in general aren't an easy to learn and use, alternative to regex. Especially when a lot of the time when people say: I want to parse ...; they often don't want to parse at all. They simply want to extract some information, that may happen to be embedded within some other information.

For example, for the vast majority of screen scraping applications, the user has no interest whatsoever in extracting any semantic or syntactic information from the surrounding text. Even if that surrounding text happens to be in a form that may or may not comply with one of the myriad variations of some gml-like markup.

Their only interest is locating a specific piece of text that happens to be embedded within a lot of other text. There may be some clues in that other text that they will need to locate the text they are after, but they couldn't give two hoots whether that other text is self-consistant with some gml/html/xhtml standard.

For this type of application, not only does parsing the surrounding html require a considerable amount of effort and time--both programmer time and processor time--given the flexibility of browsers to DWIM with badly written HTML/XTML, it would often set the programmer on a hiding to nothing to even try. Luckily, HTML::Parser and freinds are pragmatically and specifically written to gloss over the finer points of those standards and operate in a manner that DWTALTPWAMs (Do What The Average, Less Than Perfect, Web Author Means).

Even so, after 5 years, I have still to see any convincing argument against the opinions I expressed when I wrote Being a heretic and going against the party line.. I still find it far quicker to use a 'bunch of regex' to extract the data I want from the average, subject-to-change, website than to work out which combination of modules and methods are required to 'do it properly'. And when things change, I find it easier to adjust the regex than figure out which other module or modules and methods I now require.

I think that there is an element of 'laziness gone to far' in the dogma that regex is "unreadable, unmaintainable and hard". It is a complex tool with complex rules, just as every parser out there. You have to learn to use it, just as with every other parsing tool out there. It has limitations just like every other parser out there.

And there are several significant advantages of learning to use regex, over every other parsing tool out there.

  1. It's always available.
  2. It is applicable to every situation.

    Left recursive; right recursive; top down; bottom up; nibbling; lookahead; maximal chunk; whatever.

  3. You have complete control.

    Need to perform some program logic part way through a parse? No problem, use /gc and while.

    Need to parse a datstream on the fly. No problem, same technique applies.

    Want to just skip over stuff that doesn't matter to your application. No problem. Parse what you need to, skip over what you don't. You don't have to cater for all eventualities, nor restrict yourself to dealing with data that complies to some formalised, published set of rules.

  4. It's fast.

Mostly, take the time to learn to use regex well and you'll not need to run off to cpan to grab, and spend time learning to use, one of ten new modules, each of which purport to do what you need, but each of which has its own set of limitations and caveats.

I have a regex based parser for math expressions, with precedence and identifiers, assignment and variadic functions. It's all of 60 lines including the comprehensive test suite! One day I'll get around to cleaning it up and posting it somewhere.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^3: Breaking The Rules II
Download Code
Re^4: Breaking The Rules II
by wfsp (Abbot) on Jul 03, 2007 at 19:26 UTC
    Parsing HTML with a parser:
    "...far quicker to use a 'bunch of regex' to extract the data I want... ...when things change, I find it easier to adjust the regex than figure out which other module or modules and methods I now require. ... I think that there is an element of 'laziness gone to far'...".
    No!

    Really, how much real world HTML have you had to deal with? Some monks do indeed argue that "you can parse HTML with a regex" (e.g. tye) but I've never seen any argue against using a parser. Most monks urge that a parser be at least considered.

    As with other aspects of Perl where there are many ways to do it, you tend to settle on what you're most comfortable with. So, naturally, monks will have different favorites. Mine is HTML::TokeParser::Simple, a wonderful module. :-)

    But there's more!

    Mostly, take the time to learn to use regex well and you'll not need to run off to cpan to grab, and spend time learning to use, one of ten new modules, each of which purport to do what you need, but each of which has its own set of limitations and caveats.
    Replace the word 'regex' with 'Perl' and you'll have encompassed all of cpan. What cpan modules do you use? If you took the time to learn perl well you'll not need to. And I say that with no limitations or caveats. :-)

    p.s.
    could you kindly leave my gums out of this.

      Some monks do indeed argue that "you can parse HTML with a regex" (e.g. tye) but I've never seen any argue against using a parser.

      You didn't read what I wrote. I never said "use regex to parse html". If i wanted to parse html, I would use HTML::Parser (which I mentioned above as a fine pragmatic module). But if I want to extract some strings from amongst some other strings, I use regex.

      To bring the linked example above up to date as the page it used went away many moons ago. One of the web pages I visit frequently is the BBC Weather forecast.

      Here is a script to extract and print a selection of the data that page contains:

      #! perl -slw use strict; use LWP::Simple; our $area ||= 1040; my $url = "http://www.bbc.co.uk/weather/5day.shtml?id=$area"; my $html = get $url or die "Failed to get html: $!, $^E"; my $reTag = qr[< [^>]+ > \s*? ]x; my( $state, $temp, $wDir, $wSpeed, $humid, $press, $updown, $vis ) = $ +html =~ m[ Current\s+Nearest\s+Observations $reTag : $reTag+ ( \S+ ) $reTag \s+ ( \d+ ) .+? ( [NSEW]+ ) \s+ \( ( \d+ ) .+? Relative\s+Humidity .+? : \s+ ( \d+ ) .+? Pressure .+? : \s+ ( \d+ ), \s+ ( \S+ ) , .+? Visibility .+? : \s+ ( [^<]+ ) ]smx or warn 'Regex failed!' and getstore $url, 'weather.failed'; print <<EOP; Sky: $state Temp: $temp Wind Direction: $wDir Speed: $wSpeed Humidity: $humid% Pressure: $press $updown Visibility: $vis EOP

      When run it produces output like this:

      C:\test>getWeather.pl Sky: cloudy Temp: 16 Wind Direction: S Speed: 5 Humidity: 77% Pressure: 999 Rising Visibility: Very good C:\test>getWeather.pl -area=3200 Sky: cloudy Temp: 17 Wind Direction: SW Speed: 17 Humidity: 67% Pressure: 999 Rising Visibility: Very good

      Your mission, should you choose to accept it, is to reproduce that script using HTML::TokeParser::Simple and post it here.

      If you took the time to learn perl well you'll not need to. And I say that with no limitations or caveats. :-)

      That's simply not true. For one, I couldn't replace HTML::Parser with Perl code, because it's a Perl wrapper around an XS wrapper around 41k of intensely involved C code. Nor GD for similar reasons. Nor Time::HiRes. Nor... about 60 more modules, but that would just belabour the point.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        #! perl use strict; use warnings; use LWP::Simple; use HTML::TokeParser::Simple; our $area ||= 1040; my $url = "http://www.bbc.co.uk/weather/5day.shtml?id=$area"; my $html = get $url or die "Failed to get html: $!, $^E"; my $p = HTML::TokeParser::Simple->new(\$html); while (my $t = $p->get_tag('div')){ last if $t->get_attr('class') and $t->get_attr('class') eq 'display'; } my @data; while (my $t = $p->get_token){ last if $t->is_end_tag('div'); push @data, $t->as_is if $t->is_text; } # raw data #for (0..$#data){ # print "$_: $data[$_]\n"; #} my ($dir, $speed) = split(/\(/, $data[5]); print <<EOP; Sky: $data[2] Temp: $data[3] Wind Direction: $dir Speed: $speed Humidity$data[11] Pressure$data[15] Visibility$data[17] EOP __DATA__ Sky: cloudy Temp: 14 Wind Direction: SW Speed: 10 Humidity: 80, Pressure: 1004, Rising, Visibility: Very good ## raw data ## 0: Current Nearest Observations 1: : 2: cloudy 3: 14 4: C 5: SW (10 6: mph 7: ) 8: Relative Humidity ( 9: &#37; 10: ) 11: : 80, 12: Pressure ( 13: mB 14: ) 15: : 1004, Rising, 16: Visibility 17: : Very good

      Oh, it depends on the data. I mostly agree with you, but I also know that it can be quite appropriate to just grab stuff out of the middle with a simpler regex in many cases. Ironically, I find this particularly true of HTML and XML because you are unlikely to find <em>Price:</em> in a place where it doesn't mean the obvious thing. That is, for example, <a href="<em>Price:</em>"> is very unlikely, even if XML (stupidly) doesn't just outright forbid it (it should, IMIO, require the inner angle brackets to be escaped).

      But, of course, there are HTML comments and XML CDATA that do allow for such confusing situations (and I find CDATA a pretty pointless feature). In practice, I find that it is often quite safe to discount this risk, however.

      My personal theory is that this mantra against using regexes against HTML (etc.) probably originated from more narrow situations where there were repeated attempts by novice coders to do more systematic manipulation of HTML. For example, someone proposing to use

      /href="([^"]+)"/g

      in a web spider is more deserving of being warned to use a proper parser than someone trying to pull one stock price from an internal web site for personal use.

      So the warning has some value even in the latter case, but it is more likely to be overstated. I don't see a rash of hot ranting about such things. It is almost always mentioned, which probably annoys some people.

      I rarely see people admit that a simpler regex against something like HTML can often be a reasonable solution, yet I've certainly had much practical success doing such over the years, so I consider myself qualified to state that it can work well. But designing such a solution well requires taking this risk into account, so it can be important to point out the risk. And the response to such comments often encourages restressing the point.

      So I can see how someone would think that this point is often being made dogmatically. But I don't see any value in trying to lump them into some imaginary group or apply insulting language to them. (:

      - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://624567]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (10)
As of 2014-09-30 20:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (383 votes), past polls