http://www.perlmonks.org?node_id=457311


in reply to Re: Word HTML issues
in thread Word HTML issues

Unfortunately, Demoronizer worked better on the html generated by the version M$Word which was current when Demoronizer (Oh, I love that name) was written than it does on the output from more recent Word versions; the newer ones use all manner of new and sometimes unpleasant, non-standard html (or, more recently, XML, which also tends to be unpleasant to try to convert).

Corion's advice to have your users to provide RTF (or even, plain text) for conversion should work better than (the latest version I've found) of Demoronizer... and I even took at whack at updating it to deal with additional versions of what Word claims is .html.

However, I see other recommendations for cleanup below... and I, for one, am going to check them out. You may find them valuable (and easier) than either Demoronizer or than learning enough (standards complaint) .html to convert .txt or .rtf.