Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

If your "trustworthy" means "valid" (and I infer that it does; please correct me if that's not so), then the only reliable method I can suggest is largely (and painstakingly) manual. I don't know a module that can do all you ask.

  1. take a look (with your Mark I eyeball) at the rendered version in an IE version contemporary with the version of MS-Word that produced the page
  2. extract only the rendered (editorial) content (something that's amenable to automation)
  3. manually build valid html (and CSS) to match any aspect of the writer's rendering you consider important/relevant.

Clearly, it would help to know the character of the material you're trying to scrape/repost: if it's mere blather you're trying to record for historical interest, much of the formatting may be irrelevant; if it's a paper with a lot of math that needs to be rendered just as the writer prepared it, IN THE MS-WORD ORIGINAL, you have a quite different challenge.

Older version of MS-Word use enormously verbose CSS constructs with gay abandon (redundancy) and proprietary xml schemas . The html is typically non-compliant, but not -- as you appear (from the penultimate and final sentences of your first para) to believe -- chiefly by virtue of missing elements ("tags") -- with the exception of an utter failure to single- or double-quote attributes.

Now, fixing the quotes problem can be fairly straightforward. Here's a snippet whose sole purpose is to render a blank line between a table and a normal para:

<p class=MsoNormal align=center style='text-align:center'><![if !supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

OK, to fix that your hypothetical module need only be able to parse class=.... and insert quotes between the equal sign and the attribute... and similarly between align=" and its attribute, "center" -- or, preferably, read the whole thing, and throw out everything except <c><p></p>

But (oops!), look at the code for just one piece of the table I mentioned:

<tr style='height:31.5pt'> <td width=181 valign=top style='width:135.75pt;border:solid windowte +xt .5pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;padding:0in + 5.4pt 0in 5.4pt; height:31.5pt'>

w3c compliant browsers ignore tags they don't understand, so this will probably render more-or-less as intended in most modern browsers, even if you do nothing. But using point measure for sizes, rather than ems gives some browsers headaches; analysing this much hoo-hah for what could have been a simple <tr class="x"><td class="y">...</td></tr> (after the initial overhead of reading the style sheet) chews up ticks/CPU cycles while the would-be reader fidgets ... or maybe goes away mad.

All this assumes your reference to vetting content means you don't want to censor, paraphrase or revise the original writer's work; that your goal is solely translation from MS-html to html.
    ... and, a question: In your opinion or experience, what "tags are missing" from your source?


In reply to Re: Looking for an HTML structure-cleaner by ww
in thread Looking for an HTML structure-cleaner by sundialsvc4

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (9)
As of 2024-04-23 09:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found