|Syntactic Confectionery Delight|
Re: Scraping HTML: orthodoxy and realityby BrowserUk (Pope)
|on Jul 08, 2003 at 11:13 UTC||Need Help??|
Amen grinder++. I whole heartedly agree.
I came to much the same conclusion in Being a heretic and going against the party line. after having only been using perl for a relatively short time. My experiences since have done little to change my mind.
Back in that old post I tried to make a distinction between the need to parse HTML and the need to extract something that just happens to be embedded within stuff that happens to be HTML. This distinction was roundly set upon as being wrong. I still hold with this distinction.
The dictionary definition of parse is
Whilst the dictionary definition of extract is:
From my perspective, when the need is to locate and capture one or more pieces of information from within any amount or structure of other stuff, without regard to the structural or semantic positioning of those pieces within the overall structure, the term extraction more applicable than parsing. If I need to understand the structure, derive semantic meaning from the structure or verify its correctness, then I need to parse, otherwise I just need to extract. After all, Practical Extraction and Reporting is what that Language was first designed to do.
My final, and strongest argument lies in a simple premise. If the information I was after was embedded amongst a lot of Arabic, Greek or Chinese, then noone would expect me to find and use a module that understood those languages, just to extract the bits I needed.
Examine what is said, not who speaks."Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller