|Just another Perl shrine|
Re: Looking for an HTML structure-cleanerby ww (Bishop)
|on Nov 03, 2011 at 03:04 UTC||Need Help??|
If your "trustworthy" means "valid" (and I infer that it does; please correct me if that's not so), then the only reliable method I can suggest is largely (and painstakingly) manual. I don't know a module that can do all you ask.
Clearly, it would help to know the character of the material you're trying to scrape/repost: if it's mere blather you're trying to record for historical interest, much of the formatting may be irrelevant; if it's a paper with a lot of math that needs to be rendered just as the writer prepared it, IN THE MS-WORD ORIGINAL, you have a quite different challenge.
Older version of MS-Word use enormously verbose CSS constructs with gay abandon (redundancy) and proprietary xml schemas . The html is typically non-compliant, but not -- as you appear (from the penultimate and final sentences of your first para) to believe -- chiefly by virtue of missing elements ("tags") -- with the exception of an utter failure to single- or double-quote attributes.
Now, fixing the quotes problem can be fairly straightforward. Here's a snippet whose sole purpose is to render a blank line between a table and a normal para:<p class=MsoNormal align=center style='text-align:center'><![if !supportEmptyParas]> <![endif]><o:p></o:p></p>
OK, to fix that your hypothetical module need only be able to parse class=.... and insert quotes between the equal sign and the attribute... and similarly between align=" and its attribute, "center" -- or, preferably, read the whole thing, and throw out everything except <c><p></p>
But (oops!), look at the code for just one piece of the table I mentioned:
w3c compliant browsers ignore tags they don't understand, so this will probably render more-or-less as intended in most modern browsers, even if you do nothing. But using point measure for sizes, rather than ems gives some browsers headaches; analysing this much hoo-hah for what could have been a simple <tr class="x"><td class="y">...</td></tr> (after the initial overhead of reading the style sheet) chews up ticks/CPU cycles while the would-be reader fidgets ... or maybe goes away mad.
All this assumes your reference to vetting content means you don't want to censor, paraphrase or revise the original writer's work; that your goal is solely translation from MS-html to html.