There are several twisty corridors here in the Monastery in which demoronizer cobwebs hang from the ceiling; IMO they're well worth pursuing by anyone interested in cleaning up the .html produced by ANY of MS's Word, Excel or supposedly WYSIWYG products. Look under the covers, and what you got was remarkable bloat and non-conformant code.
So, a few keywords for future Super_Searchers: "HTML, html MS, Microsoft, Office, Word, Excel, FrontPage, PowerPoint, Publisher, cleanup, parse" ...and there surely could be more (arguably even Notepad, which when in word-wrap mode adds MS-ish lineends at every displayed wrap position).
davidrw and astroboy offered links to useful alternate tools in Word HTML issues. There also a bit of discussion re the issues implied in samtregar's remark in this thread.
Self-updating of demoronizer is laid out very nicely by derby in Re^3: Reg Ex to strip MS smart quotes
But (... sigh! )...even the the lastest Word->html output does not exactly demonstrate that the allegedly-enlightened giant in Redmond has learned to avoid making the same mistakes in different (ie, incompatible) ways.
...and, oh yes, a (deprecated) disclaimer: I don't hate W32; I just hate cleaning up MS .html to w3c standards.
Fair warning, also: I should probably use a sig like html 4.01 dinosaur
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||