in reply to Scraping HTML: orthodoxy and reality

Amen grinder++. I whole heartedly agree.

I came to much the same conclusion in Being a heretic and going against the party line. after having only been using perl for a relatively short time. My experiences since have done little to change my mind.

Back in that old post I tried to make a distinction between the need to parse HTML and the need to extract something that just happens to be embedded within stuff that happens to be HTML. This distinction was roundly set upon as being wrong. I still hold with this distinction.

The dictionary definition of parse is

  1. To break (a sentence) down into its component parts of speech with an explanation of the form, function, and syntactical relationship of each part.
  2. To describe (a word) by stating its part of speech, form, and syntactical relationships in a sentence.
    1. To examine closely or subject to detailed analysis, especially by breaking up into components: “What are we missing by parsing the behavior of chimpanzees into the conventional categories recognized largely from our own behavior?” (Stephen Jay Gould).
    2. To make sense of; comprehend: I simply couldn't parse what you just said.

Whilst the dictionary definition of extract is:

  1. To draw or pull out, often with great force or effort: <cite>extract a wisdom tooth; used tweezers to extract the splinter.</cite>
  2. To obtain despite resistance: <cite>extract a promise.</cite>
  3. To obtain from a substance by chemical or mechanical action, as by pressure, distillation, or evaporation.
  4. To remove for separate consideration or publication; excerpt.
    1. To derive or obtain (information, for example) from a source.
    2. To deduce (a principle or doctrine); construe (a meaning).
    3. To derive (pleasure or comfort) from an experience.
  5. Mathematics. To determine or calculate (the root of a number).

From my perspective, when the need is to locate and capture one or more pieces of information from within any amount or structure of other stuff, without regard to the structural or semantic positioning of those pieces within the overall structure, the term extraction more applicable than parsing. If I need to understand the structure, derive semantic meaning from the structure or verify its correctness, then I need to parse, otherwise I just need to extract. After all, Practical Extraction and Reporting is what that Language was first designed to do.

My final, and strongest argument lies in a simple premise. If the information I was after was embedded amongst a lot of Arabic, Greek or Chinese, then noone would expect me to find and use a module that understood those languages, just to extract the bits I needed.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

  • Comment on Re: Scraping HTML: orthodoxy and reality

Replies are listed 'Best First'.
Re: Re: Scraping HTML: orthodoxy and reality
by tilly (Archbishop) on Jul 08, 2003 at 16:25 UTC
    Ironically the distinction that you draw is the same one that I use to argue against using regular expressions for parsing problems.

    Regular expressions are designed as a tool for locating specific patterns in a sea of stuff. (Well until Perl 6 that is...) Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it. Parsing is a lot more work, but for structured text is going to give much more robust solutions. For instance you avoid different kinds of data being mistaken for each other.

    The problem is that people are used to using regular expressions for text manipulation, and then set out to solve what is really a parsing probem with regular expressions. Then fail (and may or may not realize this). This happens so routinely that the knee-jerk response is that virtually anything which can be done with parsing should be, rather than regular expressions. And indeed this is good advice to give to someone who doesn't understand the parsing wheels - if only to avoid the problem of all problems looking like nails for the one hammer (regexps) that you have.

    However the two kinds of problems are different and do overlap. Where they do overlap, it isn't necessarily obvious which is more practical. It isn't even necessarily obvious from the problem specification - sometimes you need to make a guess about how the code will evolve to know that...

      Parsing is the task of taking structured information and analyzing the structure. This is a very different task, and regular expressions (as they currently are) are simply not designed to do it.

      Parsing typically has two phases though, the first is Tokenization and the second Parse Tree Generation (Im sure there is a better term but I forget what it is). These phases more often then not occur in synch but they need not. Either way regexes are perfectly suited to tokenization.

      I learned the most about regexes from writing a regex tokenizer and parser. I learned a lot more from the tokenizer than from the parser tho. :-) Writing regexes to tokenize regexes is a fun head trip. (Incidentally the whole idea was to be able to use regexes to specify and generate random test data.)


      <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...