in reply to
HTML Parsing (ick)
What you would have liked to have done (at the time) was to wrap all of the significant content into <span>s which had, say, an identifying class=, even if that class-definition specified nothing further as to the content. The class-name would have served as a semantic tag to conclusively identify, within the data-stream itself, what the relevant bits of content were, so that an XPath expression (like the one shown in a previous comment) could have been used consistently to extract it. Otherwise, “parsing the HTML is the easy part, and reliably picking-out the data within that HTML is the hard part.” It will depend on finding totally-reliable place markers within the templates, and making 100% sure that it gets all the right data in every case.
That being said ... what are you chances, now, of being able to make changes to the templates which (I hope ...) drive the production of those web pages? Or do you work for someone else now? ;-) If you could add span-tags with dummy class-names, that certainly would make this job far more reliable and easy. (With such tags, the whole job could be done using XSLT stylesheets.)