in reply to
Re: Best of the Best Users in Perlmonks site
in thread Best of the Best Users in Perlmonks site
There are several twisty corridors here in the Monastery in which demoronizer cobwebs hang from the ceiling; IMO they're well worth pursuing by anyone interested in cleaning up the .html produced by ANY of MS's Word, Excel or supposedly WYSIWYG products. Look under the covers, and what you got was remarkable bloat and non-conformant code.
So, a few keywords for future Super_Searchers: "HTML, html MS, Microsoft, Office, Word, Excel, FrontPage, PowerPoint, Publisher, cleanup, parse" ...and there surely could be more (arguably even Notepad, which when in word-wrap mode adds MS-ish lineends at every displayed wrap position).
davidrw and astroboy offered links to useful alternate tools in Word HTML issues. There also a bit of discussion re the issues implied in samtregar's remark in this thread.
Self-updating of demoronizer is laid out very nicely by derby in Re^3: Reg Ex to strip MS smart quotes
But (... sigh! )...even the the lastest Word->html output does not exactly demonstrate that the allegedly-enlightened giant in Redmond has learned to avoid making the same mistakes in different (ie, incompatible) ways.
...and, oh yes, a (deprecated) disclaimer: I don't hate W32; I just hate cleaning up MS .html to w3c standards.
Fair warning, also: I should probably use a sig like html 4.01 dinosaur