Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Looking for an HTML structure-cleaner

by ww (Archbishop)
on Nov 03, 2011 at 03:04 UTC ( [id://935570]=note: print w/replies, xml ) Need Help??


in reply to Looking for an HTML structure-cleaner

If your "trustworthy" means "valid" (and I infer that it does; please correct me if that's not so), then the only reliable method I can suggest is largely (and painstakingly) manual. I don't know a module that can do all you ask.

  1. take a look (with your Mark I eyeball) at the rendered version in an IE version contemporary with the version of MS-Word that produced the page
  2. extract only the rendered (editorial) content (something that's amenable to automation)
  3. manually build valid html (and CSS) to match any aspect of the writer's rendering you consider important/relevant.

Clearly, it would help to know the character of the material you're trying to scrape/repost: if it's mere blather you're trying to record for historical interest, much of the formatting may be irrelevant; if it's a paper with a lot of math that needs to be rendered just as the writer prepared it, IN THE MS-WORD ORIGINAL, you have a quite different challenge.

Older version of MS-Word use enormously verbose CSS constructs with gay abandon (redundancy) and proprietary xml schemas . The html is typically non-compliant, but not -- as you appear (from the penultimate and final sentences of your first para) to believe -- chiefly by virtue of missing elements ("tags") -- with the exception of an utter failure to single- or double-quote attributes.

Now, fixing the quotes problem can be fairly straightforward. Here's a snippet whose sole purpose is to render a blank line between a table and a normal para:

<p class=MsoNormal align=center style='text-align:center'><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

OK, to fix that your hypothetical module need only be able to parse class=.... and insert quotes between the equal sign and the attribute... and similarly between align=" and its attribute, "center" -- or, preferably, read the whole thing, and throw out everything except <c><p></p>

But (oops!), look at the code for just one piece of the table I mentioned:

<tr style='height:31.5pt'> <td width=181 valign=top style='width:135.75pt;border:solid windowte +xt .5pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;padding:0in + 5.4pt 0in 5.4pt; height:31.5pt'>

w3c compliant browsers ignore tags they don't understand, so this will probably render more-or-less as intended in most modern browsers, even if you do nothing. But using point measure for sizes, rather than ems gives some browsers headaches; analysing this much hoo-hah for what could have been a simple <tr class="x"><td class="y">...</td></tr> (after the initial overhead of reading the style sheet) chews up ticks/CPU cycles while the would-be reader fidgets ... or maybe goes away mad.

All this assumes your reference to vetting content means you don't want to censor, paraphrase or revise the original writer's work; that your goal is solely translation from MS-html to html.
    ... and, a question: In your opinion or experience, what "tags are missing" from your source?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://935570]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-25 17:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found