in reply to
Re: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)
in thread how to quickly parse 50000 html documents?
the example html is truly appaling, using outdated html attributes for presentation, and not providing any elements that specify document structure. it is just the conditions that these sorts of html's live in, that make it conducive to break frequently and unexpectedly...as the person owning the document uses frontpage (or worse) msword to generate html content, with outdated html attributes all over the place. as there's no real structural html elements, with just look/feel elements and data intetmixed, changes to document by owner are usually very naive in terms of valid/sane html. for example you may start seeing at some stage several empty opening and closing font tags. still looks exactly same as before, but the naive generation of html using msword or older frontpage, now breaks the scraping code and you're back at square one trying to figure it out.
whilst scraping (sometimes very bad) html is inevitable, the way you go about it can make some difference. basing a regex for html scraping on the value of a particular attribute is particularly bad, e.g. don't look for "font size="1">"....if you must base it on the font tag, just look for the tag and nearest closing brace, as an anchor.
i've worked with a federated searching java based engine for some time, and it is exactly when the vendor wrote scraping code to match frequently changing html (e.g. html attributes) that often ended up breaking these scrapers. so instead of moving onto bigger and better things..you end up maitaining a whole lot of scrapers that break all the time, and you're at the pointy end of the "fix it now".
in my opinion the example html is so bad as to be practically of no use, and you might as well use module or whatever to strip html altogether, and just base the scraping of the well defined terms that have a following colon.
btw when i mentioned "naive" html document owner, i don't mean to be nasty, just means they don't know or do any better for whatever reason. it's naive in terms of using html with regard to spec and current best practice.
the hardest line to type correctly is: stty erase ^H