in reply to Scraping HTML: orthodoxy and reality
Amen grinder++. I whole heartedly agree.
I came to much the same conclusion in Being a heretic and going against the party line. after having only been using perl for a relatively short time. My experiences since have done little to change my mind.
Back in that old post I tried to make a distinction between the need to parse HTML and the need to extract something that just happens to be embedded within stuff that happens to be HTML. This distinction was roundly set upon as being wrong. I still hold with this distinction.
The dictionary definition of parse is
- To break (a sentence) down into its component parts of speech with an explanation of the form, function, and syntactical relationship of each part.
- To describe (a word) by stating its part of speech, form, and syntactical relationships in a sentence.
- To examine closely or subject to detailed analysis, especially by breaking up into components: “What are we missing by parsing the behavior of chimpanzees into the conventional categories recognized largely from our own behavior?” (Stephen Jay Gould).
- To make sense of; comprehend: I simply couldn't parse what you just said.
Whilst the dictionary definition of extract is:
- To draw or pull out, often with great force or effort: <cite>extract a wisdom tooth; used tweezers to extract the splinter.</cite>
- To obtain despite resistance: <cite>extract a promise.</cite>
- To obtain from a substance by chemical or mechanical action, as by pressure, distillation, or evaporation.
- To remove for separate consideration or publication; excerpt.
- To derive or obtain (information, for example) from a source.
- To deduce (a principle or doctrine); construe (a meaning).
- To derive (pleasure or comfort) from an experience.
- Mathematics. To determine or calculate (the root of a number).
From my perspective, when the need is to locate and capture one or more pieces of information from within any amount or structure of other stuff, without regard to the structural or semantic positioning of those pieces within the overall structure, the term extraction more applicable than parsing. If I need to understand the structure, derive semantic meaning from the structure or verify its correctness, then I need to parse, otherwise I just need to extract. After all, Practical Extraction and Reporting is what that Language was first designed to do.
My final, and strongest argument lies in a simple premise. If the information I was after was embedded amongst a lot of Arabic, Greek or Chinese, then noone would expect me to find and use a module that understood those languages, just to extract the bits I needed.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Re: Scraping HTML: orthodoxy and reality
by tilly (Archbishop) on Jul 08, 2003 at 16:25 UTC | |
by demerphq (Chancellor) on Jul 08, 2003 at 17:00 UTC |