Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Looking for an HTML structure-cleaner

by sundialsvc4 (Abbot)
on Nov 02, 2011 at 22:17 UTC ( #935506=perlquestion: print w/replies, xml ) Need Help??
sundialsvc4 has asked for the wisdom of the Perl Monks concerning the following question:

I have an application that needs to display trustworthy HTML content produced by an application (an old Microsoft Word) that did not necessarily produce “complete and correct” HTML as perceived by Internet Browsers of the present day (e.g. Internet Explorer 8/9).   What I would like to quickly find is a module that, given an HTML text-string as input, will do what is necessary to clean-up the structure of that string.   For example, if tags are missing it will insert them.

What is happening right now is that the HTML provided is being blindly inserted into the template (Toolkit, of course ...) and sometimes that results in an ill-formed HTML page.   Most browsers are pretty tolerant of these things, but Microsoft’s (of course...) generally are not.

Again, I am not trying to “vet” the HTML content, merely to find a way to compensate graciously for its structural shortcomings (whatever those may be).

Replies are listed 'Best First'.
Re: Looking for an HTML structure-cleaner
by ww (Archbishop) on Nov 03, 2011 at 03:04 UTC

    If your "trustworthy" means "valid" (and I infer that it does; please correct me if that's not so), then the only reliable method I can suggest is largely (and painstakingly) manual. I don't know a module that can do all you ask.

    1. take a look (with your Mark I eyeball) at the rendered version in an IE version contemporary with the version of MS-Word that produced the page
    2. extract only the rendered (editorial) content (something that's amenable to automation)
    3. manually build valid html (and CSS) to match any aspect of the writer's rendering you consider important/relevant.

    Clearly, it would help to know the character of the material you're trying to scrape/repost: if it's mere blather you're trying to record for historical interest, much of the formatting may be irrelevant; if it's a paper with a lot of math that needs to be rendered just as the writer prepared it, IN THE MS-WORD ORIGINAL, you have a quite different challenge.

    Older version of MS-Word use enormously verbose CSS constructs with gay abandon (redundancy) and proprietary xml schemas . The html is typically non-compliant, but not -- as you appear (from the penultimate and final sentences of your first para) to believe -- chiefly by virtue of missing elements ("tags") -- with the exception of an utter failure to single- or double-quote attributes.

    Now, fixing the quotes problem can be fairly straightforward. Here's a snippet whose sole purpose is to render a blank line between a table and a normal para:

    <p class=MsoNormal align=center style='text-align:center'><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

    OK, to fix that your hypothetical module need only be able to parse class=.... and insert quotes between the equal sign and the attribute... and similarly between align=" and its attribute, "center" -- or, preferably, read the whole thing, and throw out everything except <c><p></p>

    But (oops!), look at the code for just one piece of the table I mentioned:

    <tr style='height:31.5pt'> <td width=181 valign=top style='width:135.75pt;border:solid windowte +xt .5pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;padding:0in + 5.4pt 0in 5.4pt; height:31.5pt'>

    w3c compliant browsers ignore tags they don't understand, so this will probably render more-or-less as intended in most modern browsers, even if you do nothing. But using point measure for sizes, rather than ems gives some browsers headaches; analysing this much hoo-hah for what could have been a simple <tr class="x"><td class="y">...</td></tr> (after the initial overhead of reading the style sheet) chews up ticks/CPU cycles while the would-be reader fidgets ... or maybe goes away mad.

    All this assumes your reference to vetting content means you don't want to censor, paraphrase or revise the original writer's work; that your goal is solely translation from MS-html to html.
        ... and, a question: In your opinion or experience, what "tags are missing" from your source?

Re: Looking for an HTML structure-cleaner
by Anonymous Monk on Nov 03, 2011 at 08:24 UTC

    You could try HTML::Tidy, which is a front-end to tidyp (aka HTML Tidy). I doubt it cleans up your CSS, though.

    "tidyp will validate your HTML, and output cleaned-up HTML."

Re: Looking for an HTML structure-cleaner
by Anonymous Monk on Nov 02, 2011 at 22:59 UTC
Re: Looking for an HTML structure-cleaner
by sundialsvc4 (Abbot) on Nov 03, 2011 at 13:44 UTC

    The documents in question are documents, e.g. literary passages, sometimes-elaborate math formatting and so on, all of which are being presented as a part of a vocational test-giving application.   So, the formatting (egregious as it sometimes is...) is important.   My problem is that, sometimes it is incomplete.   Therefore, when my template embeds it into a <div> tag, without the proper closing-tags within the embedded text that <div> tag is not seen as enclosing it anymore.

    So, thinking about this requirement a little bit more, I guess that I am really most concerned with “DOM structure” matters ... of making sure that the content, whatever it is, gets wedged into the container.   I really don’t want to delve into the guts of that content.   I simply want to keep it inside the box.   Of course I thought about using the <frame> tag, but the deployment is so bandwidth-constrained that the result looks perfectly dreadful.

    You’re right about the Word-generated HTML content, tho’ ... it is hideous.   But, it works.   And I basically want to keep it working.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://935506]
Approved by keszler
and the fog begins to lift...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2018-01-20 21:12 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (227 votes). Check out past polls.