There are several twisty corridors here in the Monastery in which demoronizer cobwebs hang from the ceiling; IMO they're well worth pursuing by anyone interested in cleaning up the .html produced by ANY of MS's Word, Excel or supposedly WYSIWYG products. Look under the covers, and what you got was remarkable bloat and non-conformant code.
So, a few keywords for future Super_Searchers: "HTML, html MS, Microsoft, Office, Word, Excel, FrontPage, PowerPoint, Publisher, cleanup, parse" ...and there surely could be more (arguably even Notepad, which when in word-wrap mode adds MS-ish lineends at every displayed wrap position).
But (... sigh! )...even the the lastest Word->html output does not exactly demonstrate that the allegedly-enlightened giant in Redmond has learned to avoid making the same mistakes in different (ie, incompatible) ways.
...and, oh yes, a (deprecated) disclaimer: I don't hate W32; I just hate cleaning up MS .html to w3c standards.Fair warning, also: I should probably use a sig like html 4.01 dinosaur