|The stupid question is the question not asked|
Re: Scraping HTML: orthodoxy and realityby chunlou (Curate)
|on Jul 08, 2003 at 19:17 UTC||Need Help??|
"Parse" vs "extract" or "regular language" vs "context free," etc. are indeed important distinctions to be made, as pointed out by some monks. Parsing data is a (more or less) mechanical process; extracting info is a human (A.I.) process.
Suppose you want to extract info by paragraph. Consider the following text fragment:
Look at the table below...
Could you behold the secret this unfolds?
A bit more, a bit more, irrelevant thought, a new paragraph...
________________________________________You might see either two or three paragraphs (if you consider "Look... unfolds?" as one paragraph). Now, let's look at the html of the above text fragment:
A parser might only see one paragraph between the <p> and </p> tags. There is a <p></p> pair in the table. Is it a paragraph? A parser might ask.
Suppose the parser takes into consideration that some people use <br><br> to denote the end of a paragraph. "Look..." and "Could..." might be considered two paragraphs. What about "A bit..."? Or are "Look..." and the table two paragraphs?
Human can read semantically; machine mostly syntactically. That's why extracting info is not the same problem as parsing data.