note
ww
<p>If your "trustworthy" means "valid" (and I infer that it does; please correct me if that's not so), then the only reliable method I can suggest is largely (and painstakingly) manual. I don't know a module that can do all you ask.</p>
<ol><li>take a look (with your Mark I eyeball) at the rendered version in an IE version contemporary with the version of MS-Word that produced the page</li>
<li>extract <b>only</b> the rendered (editorial) content (something that's amenable to automation)</li>
<li>manually build valid html (and CSS) to match any aspect of the writer's rendering you consider important/relevant.</li>
</ol>
<p>Clearly, it would help to know the character of the material you're trying to scrape/repost: if it's mere blather you're trying to record for historical interest, much of the formatting may be irrelevant; if it's a paper with a lot of math that needs to be rendered just as the writer prepared it, IN THE MS-WORD ORIGINAL, you have a quite different challenge.</p>
<p>Older version of MS-Word use enormously verbose CSS constructs with gay abandon (redundancy) and proprietary xml schemas . The html is typically non-compliant, but not -- as you appear (from the penultimate and final sentences of your first para) to believe -- chiefly by virtue of missing elements ("tags") -- with the exception of an utter failure to single- or double-quote attributes.</p>
<p>Now, fixing the quotes problem can be fairly straightforward. Here's a snippet whose sole purpose is to render a blank line between a table and a normal para:</p>
<c><p class=MsoNormal align=center style='text-align:center'><![if !supportEmptyParas]> <![endif]><o:p></o:p></p></c>
<p>OK, to fix that your hypothetical module need only be able to parse <c>class=....</c> and insert quotes between the equal sign and the attribute... and similarly between <c>align=" and its attribute, "center" -- or, preferably, read the whole thing, and throw out everything except <c><p></p></c></p>
<p>But (oops!), look at the code for just one piece of the table I mentioned:</p>
<c><tr style='height:31.5pt'>
<td width=181 valign=top style='width:135.75pt;border:solid windowtext .5pt;
border-top:none;mso-border-top-alt:solid windowtext .5pt;padding:0in 5.4pt 0in 5.4pt;
height:31.5pt'></c>
<p>w3c compliant browsers ignore tags they don't understand, so this will probably render more-or-less as intended in most modern browsers, even if you do nothing. But using point measure for sizes, rather than <tt>ems</tt> gives some browsers headaches; analysing this much hoo-hah for what could have been a simple <c><tr class="x"><td class="y">...</td></tr></c> (after the initial overhead of reading the style sheet) chews up ticks/CPU cycles while the would-be reader fidgets ... or maybe goes away mad.</p>
<p>All this assumes your reference to vetting content means you don't want to censor, paraphrase or revise the original writer's work; that your goal is solely translation from MS-html to html.<br> ... and, a question: <b>In your opinion or experience, what "<i>tags are missing</i>" from your source?</b></p>
935506
935506