in reply to Looking for an HTML structure-cleaner
If your "trustworthy" means "valid" (and I infer that it does; please correct me if that's not so), then the only reliable method I can suggest is largely (and painstakingly) manual. I don't know a module that can do all you ask.
- take a look (with your Mark I eyeball) at the rendered version in an IE version contemporary with the version of MS-Word that produced the page
- extract only the rendered (editorial) content (something that's amenable to automation)
- manually build valid html (and CSS) to match any aspect of the writer's rendering you consider important/relevant.
Clearly, it would help to know the character of the material you're trying to scrape/repost: if it's mere blather you're trying to record for historical interest, much of the formatting may be irrelevant; if it's a paper with a lot of math that needs to be rendered just as the writer prepared it, IN THE MS-WORD ORIGINAL, you have a quite different challenge.
Older version of MS-Word use enormously verbose CSS constructs with gay abandon (redundancy) and proprietary xml schemas . The html is typically non-compliant, but not -- as you appear (from the penultimate and final sentences of your first para) to believe -- chiefly by virtue of missing elements ("tags") -- with the exception of an utter failure to single- or double-quote attributes.
Now, fixing the quotes problem can be fairly straightforward. Here's a snippet whose sole purpose is to render a blank line between a table and a normal para:
<p class=MsoNormal align=center style='text-align:center'><![if !supportEmptyParas]> <![endif]><o:p></o:p></p>OK, to fix that your hypothetical module need only be able to parse class=.... and insert quotes between the equal sign and the attribute... and similarly between align=" and its attribute, "center" -- or, preferably, read the whole thing, and throw out everything except <c><p></p>
But (oops!), look at the code for just one piece of the table I mentioned:
<tr style='height:31.5pt'> <td width=181 valign=top style='width:135.75pt;border:solid windowte +xt .5pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;padding:0in + 5.4pt 0in 5.4pt; height:31.5pt'>
w3c compliant browsers ignore tags they don't understand, so this will probably render more-or-less as intended in most modern browsers, even if you do nothing. But using point measure for sizes, rather than ems gives some browsers headaches; analysing this much hoo-hah for what could have been a simple <tr class="x"><td class="y">...</td></tr> (after the initial overhead of reading the style sheet) chews up ticks/CPU cycles while the would-be reader fidgets ... or maybe goes away mad.
All this assumes your reference to vetting content means you don't want to censor, paraphrase or revise the original writer's work; that your goal is solely translation from MS-html to html.
... and, a question: In your opinion or experience, what "tags are missing" from your source?
|
---|