P is for Practical | |
PerlMonks |
Re: Scraping HTML: orthodoxy and realityby chunlou (Curate) |
on Jul 08, 2003 at 19:17 UTC ( [id://272397]=note: print w/replies, xml ) | Need Help?? |
"Parse" vs "extract" or "regular language" vs "context free," etc. are indeed important distinctions to be made, as pointed out by some monks. Parsing data is a (more or less) mechanical process; extracting info is a human (A.I.) process. Suppose you want to extract info by paragraph. Consider the following text fragment: ________________________________________ Look at the table below...
Could you behold the secret this unfolds? A bit more, a bit more, irrelevant thought, a new paragraph... ________________________________________ You might see either two or three paragraphs (if you consider "Look... unfolds?" as one paragraph). Now, let's look at the html of the above text fragment:
A parser might only see one paragraph between the <p> and </p> tags. There is a <p></p> pair in the table. Is it a paragraph? A parser might ask. Suppose the parser takes into consideration that some people use <br><br> to denote the end of a paragraph. "Look..." and "Could..." might be considered two paragraphs. What about "A bit..."? Or are "Look..." and the table two paragraphs? Human can read semantically; machine mostly syntactically. That's why extracting info is not the same problem as parsing data.
In Section
Meditations
|
|