|Perl: the Markov chain saw|
Re^3: Breaking The Rules IIby BrowserUk (Pope)
|on Jul 02, 2007 at 23:12 UTC||Need Help??|
I wasn't targetting P::RD. It is a perfectly fine module for those situations where you need to extract full semantic information from the language you are analysing. But even for this, it's certainly not the only game in town, nor necessarily the best choice for any given application.
With respect to shift/reduce conflicts and Parse::YAPP: It's possible to construct ambiguous grammars regardless of which type of parser one targets, and equally possible to resolve them.
My main point was that parsers in general aren't an easy to learn and use, alternative to regex. Especially when a lot of the time when people say: I want to parse ...; they often don't want to parse at all. They simply want to extract some information, that may happen to be embedded within some other information.
For example, for the vast majority of screen scraping applications, the user has no interest whatsoever in extracting any semantic or syntactic information from the surrounding text. Even if that surrounding text happens to be in a form that may or may not comply with one of the myriad variations of some gml-like markup.
Their only interest is locating a specific piece of text that happens to be embedded within a lot of other text. There may be some clues in that other text that they will need to locate the text they are after, but they couldn't give two hoots whether that other text is self-consistant with some gml/html/xhtml standard.
For this type of application, not only does parsing the surrounding html require a considerable amount of effort and time--both programmer time and processor time--given the flexibility of browsers to DWIM with badly written HTML/XTML, it would often set the programmer on a hiding to nothing to even try. Luckily, HTML::Parser and freinds are pragmatically and specifically written to gloss over the finer points of those standards and operate in a manner that DWTALTPWAMs (Do What The Average, Less Than Perfect, Web Author Means).
Even so, after 5 years, I have still to see any convincing argument against the opinions I expressed when I wrote Being a heretic and going against the party line.. I still find it far quicker to use a 'bunch of regex' to extract the data I want from the average, subject-to-change, website than to work out which combination of modules and methods are required to 'do it properly'. And when things change, I find it easier to adjust the regex than figure out which other module or modules and methods I now require.
I think that there is an element of 'laziness gone to far' in the dogma that regex is "unreadable, unmaintainable and hard". It is a complex tool with complex rules, just as every parser out there. You have to learn to use it, just as with every other parsing tool out there. It has limitations just like every other parser out there.
And there are several significant advantages of learning to use regex, over every other parsing tool out there.
Mostly, take the time to learn to use regex well and you'll not need to run off to cpan to grab, and spend time learning to use, one of ten new modules, each of which purport to do what you need, but each of which has its own set of limitations and caveats.
I have a regex based parser for math expressions, with precedence and identifiers, assignment and variadic functions. It's all of 60 lines including the comprehensive test suite! One day I'll get around to cleaning it up and posting it somewhere.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.