Re^3: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)by BrowserUk (Pope)
|on Nov 26, 2010 at 04:34 UTC||Need Help??|
in my opinion the example html is so bad as to be practically of no use, and you might as well use module or whatever to strip html altogether, and just base the scraping of the well defined terms that have a following colon.
There is a simple maxim taught to me by my first boss in programming: don't do what won't benefit you.
All we have to go on is that bad html snippet the OP posted. In all likelihood, all he has to go on is that html snippet grabbed from whatever website it came from. We could try to predict what might happen in the future and cater for it, but the highest probability is that whatever we guess will be wrong.
The only sensible thing to do is work with what we know. And what we know for now is that the simple regex used works. If, in the future it changes, then the 5 minutes it took to construct the program above maybe be required to be repeated. If it then changes again, maybe there would be some pattern to the change that might suggest a better approach. But, it might never change; and any effort expended now to try and cater for unknown changes that might never happen would be entirely wasted.
If these numbers were embedded in a plain text document, no one here would blink an eye about using regex. But add a few <> into the mix and suddenly many start trotting out cargo-cult wisdoms: "Don't parse HTML/XML/XHTML/whatever with regex"; completely missing that most of the time nobody wants to parse the html; just extract some small subset of text from a larger set of text. Ie. They want to do exactly what regex are designed to do.
basing a regex for html scraping on the value of a particular attribute is particularly bad, e.g. don't look for "font size="1">"....if you must base it on the font tag, just look for the tag and nearest closing brace, as an anchor.
I'll take your word for the quality or lack thereof of the html, because I neither know nor care. It's just text within text to me.
For now, what I've suggested to the OP works. And it works 500 times more quickly that his existing solution. If he gets to use it once before the sources changes, he can afford to spend 3 working days re-writing it and still have gained. And it took me less than 5 minutes to write this version and maybe 10 to test it; most of which was taken up generating 1000 test pages. If he gets to use it 10 times, he's saved himself enough time to take a month's vacation.
It's simple. It works. Job done. And if it requires change next week, or next month or next year, it is simple enough that it won't require deep knowledge of half a dozen co-dependant packages and APIs in order to fix it.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.