Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Breaking The Rules II

by BrowserUk (Pope)
on Jul 02, 2007 at 15:51 UTC ( #624496=note: print w/ replies, xml ) Need Help??


in reply to Breaking The Rules II

I am so glad that you have posted this, even in its incomplete state.

Maybe. Just maybe. It will stop the "Ew. The post contains the words 'parse', so it must be a job for Parse::RecDescent!" crowd from wobbling their collective gums quite so frequently.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: Breaking The Rules II
Re^2: Breaking The Rules II
by Porculus (Hermit) on Jul 02, 2007 at 20:57 UTC

    To be fair, while it's not a panacea, Parse::RecDescent is actually a pretty good tool for exactly this kind of job. When I recently tried a similar exercise myself, my first readthrough of the manual left me sceptical, but in practice I was pleasantly surprised at how easy it turned out to be to parse expressions of this precise sort with Parse:RecDescent. Whereas I've never managed to get my head round LALR(1) parsing of the sort that Limbic~Region found so intuitive.

    Though I suppose that just reinforces the point that one shouldn't assume that a tool is the right tool for one's own project just because other people prefer it...

    (I do find myself wondering whether the love affair with Yapp would have continued beyond the first shift/reduce conflict...)

      I wasn't targetting P::RD. It is a perfectly fine module for those situations where you need to extract full semantic information from the language you are analysing. But even for this, it's certainly not the only game in town, nor necessarily the best choice for any given application.

      With respect to shift/reduce conflicts and Parse::YAPP: It's possible to construct ambiguous grammars regardless of which type of parser one targets, and equally possible to resolve them.

      My main point was that parsers in general aren't an easy to learn and use, alternative to regex. Especially when a lot of the time when people say: I want to parse ...; they often don't want to parse at all. They simply want to extract some information, that may happen to be embedded within some other information.

      For example, for the vast majority of screen scraping applications, the user has no interest whatsoever in extracting any semantic or syntactic information from the surrounding text. Even if that surrounding text happens to be in a form that may or may not comply with one of the myriad variations of some gml-like markup.

      Their only interest is locating a specific piece of text that happens to be embedded within a lot of other text. There may be some clues in that other text that they will need to locate the text they are after, but they couldn't give two hoots whether that other text is self-consistant with some gml/html/xhtml standard.

      For this type of application, not only does parsing the surrounding html require a considerable amount of effort and time--both programmer time and processor time--given the flexibility of browsers to DWIM with badly written HTML/XTML, it would often set the programmer on a hiding to nothing to even try. Luckily, HTML::Parser and freinds are pragmatically and specifically written to gloss over the finer points of those standards and operate in a manner that DWTALTPWAMs (Do What The Average, Less Than Perfect, Web Author Means).

      Even so, after 5 years, I have still to see any convincing argument against the opinions I expressed when I wrote Being a heretic and going against the party line.. I still find it far quicker to use a 'bunch of regex' to extract the data I want from the average, subject-to-change, website than to work out which combination of modules and methods are required to 'do it properly'. And when things change, I find it easier to adjust the regex than figure out which other module or modules and methods I now require.

      I think that there is an element of 'laziness gone to far' in the dogma that regex is "unreadable, unmaintainable and hard". It is a complex tool with complex rules, just as every parser out there. You have to learn to use it, just as with every other parsing tool out there. It has limitations just like every other parser out there.

      And there are several significant advantages of learning to use regex, over every other parsing tool out there.

      1. It's always available.
      2. It is applicable to every situation.

        Left recursive; right recursive; top down; bottom up; nibbling; lookahead; maximal chunk; whatever.

      3. You have complete control.

        Need to perform some program logic part way through a parse? No problem, use /gc and while.

        Need to parse a datstream on the fly. No problem, same technique applies.

        Want to just skip over stuff that doesn't matter to your application. No problem. Parse what you need to, skip over what you don't. You don't have to cater for all eventualities, nor restrict yourself to dealing with data that complies to some formalised, published set of rules.

      4. It's fast.

      Mostly, take the time to learn to use regex well and you'll not need to run off to cpan to grab, and spend time learning to use, one of ten new modules, each of which purport to do what you need, but each of which has its own set of limitations and caveats.

      I have a regex based parser for math expressions, with precedence and identifiers, assignment and variadic functions. It's all of 60 lines including the comprehensive test suite! One day I'll get around to cleaning it up and posting it somewhere.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Parsing HTML with a parser:
        "...far quicker to use a 'bunch of regex' to extract the data I want... ...when things change, I find it easier to adjust the regex than figure out which other module or modules and methods I now require. ... I think that there is an element of 'laziness gone to far'...".
        No!

        Really, how much real world HTML have you had to deal with? Some monks do indeed argue that "you can parse HTML with a regex" (e.g. tye) but I've never seen any argue against using a parser. Most monks urge that a parser be at least considered.

        As with other aspects of Perl where there are many ways to do it, you tend to settle on what you're most comfortable with. So, naturally, monks will have different favorites. Mine is HTML::TokeParser::Simple, a wonderful module. :-)

        But there's more!

        Mostly, take the time to learn to use regex well and you'll not need to run off to cpan to grab, and spend time learning to use, one of ten new modules, each of which purport to do what you need, but each of which has its own set of limitations and caveats.
        Replace the word 'regex' with 'Perl' and you'll have encompassed all of cpan. What cpan modules do you use? If you took the time to learn perl well you'll not need to. And I say that with no limitations or caveats. :-)

        p.s.
        could you kindly leave my gums out of this.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://624496]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2014-07-26 11:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls