<?xml version="1.0" encoding="windows-1252"?>
<node id="524507" title="Re^2: &quot;Demoronizer&quot; in Best of the Best Users in Perlmonks site" created="2006-01-20 11:07:20" updated="2006-01-20 06:07:20">
<type id="11">
note</type>
<author id="352046">
ww</author>
<data>
<field name="doctext">
&lt;p&gt;There are several twisty corridors here in the Monastery in which demoronizer cobwebs hang from the ceiling; IMO they're well worth pursuing by anyone interested in cleaning up the .html produced by &lt;big&gt;&lt;b&gt;ANY &lt;/b&gt;&lt;/big&gt; of MS's Word, Excel or supposedly WYSIWYG products.  Look under the covers, and what you got was remarkable bloat and non-conformant code.&lt;/p&gt;

&lt;p&gt;So, a few keywords for future Super_Searchers: "HTML, html MS, Microsoft, Office, Word, Excel, FrontPage, PowerPoint, Publisher, cleanup, parse" ...and there surely could be more (arguably even Notepad, which when in word-wrap mode adds MS-ish lineends at every displayed wrap position).&lt;/p&gt;

&lt;p&gt;[davidrw] and [astroboy] offered links to useful alternate tools in [id://457280]. There also a bit of discussion re the issues implied in [samtregar]'s remark in this thread.&lt;/p&gt;

&lt;p&gt;Self-updating of demoronizer is laid out very nicely by [derby] in [Re^3: Reg Ex to strip MS smart quotes]&lt;/p&gt;

&lt;p&gt;&lt;b&gt;But &lt;/b&gt; (&lt;i&gt;... sigh!&lt;/i&gt; )...even the the lastest Word-&gt;html output does not exactly demonstrate that the allegedly-enlightened giant in Redmond has learned to avoid making the same mistakes in different (ie, incompatible) ways.&lt;/p&gt;

&lt;p&gt; ...and, oh yes, a (deprecated) disclaimer: I don't hate W32; I just hate cleaning up MS .html to w3c standards.&lt;/p&gt;

Fair warning, also: I should probably use a sig like &lt;b&gt;&lt;small&gt;html 4.01 dinosaur&lt;/small&gt;&lt;/b&gt;</field>
<field name="root_node">
524176</field>
<field name="parent_node">
524238</field>
</data>
</node>
