Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Word HTML issues

by mlhmich (Novice)
on May 15, 2005 at 19:44 UTC ( #457280=perlquestion: print w/ replies, xml ) Need Help??
mlhmich has asked for the wisdom of the Perl Monks concerning the following question:

I am using HTMLarea3 and I have my users pasting word HTML in to it. Sometimes it looks on in htmlarea sometimes it does not. Either way when it gets posted to a site, it looks VERY bad. In dreamweaver there is a "Clean up word HTML" option. Is there any way to do something like that in perl, with regex. I am not very good with regex's but has someone done something like this?

Thanks for all the monks help,
Mike

Update: How does that script work with all the # in it?

20050515 Edit by ysth: restore original question

20050516 Edit by Corion: Unconsidered. Was considered: Animator: retitle: issues with HTML generated by (Microsoft) Word (Keep/Edit/Delete: 7/10/1)

Comment on Word HTML issues
Re: Word HTML issues
by Corion (Pope) on May 15, 2005 at 19:57 UTC

    Although I haven't used it, there is the Demoronizer, which purports to clean up the HTML generated by Word. I'm not sure whether it will help you. You could also disallow pasting Word stuff, because I'm not sure how HTMLArea3 handles pasted Word documents, as it doesn't have access to the special Word formatting. You could consider having your users paste or upload RTF, and then convert the RTF to proper HTML.

      Unfortunately, Demoronizer worked better on the html generated by the version M$Word which was current when Demoronizer (Oh, I love that name) was written than it does on the output from more recent Word versions; the newer ones use all manner of new and sometimes unpleasant, non-standard html (or, more recently, XML, which also tends to be unpleasant to try to convert).

      Corion's advice to have your users to provide RTF (or even, plain text) for conversion should work better than (the latest version I've found) of Demoronizer... and I even took at whack at updating it to deal with additional versions of what Word claims is .html.

      However, I see other recommendations for cleanup below... and I, for one, am going to check them out. You may find them valuable (and easier) than either Demoronizer or than learning enough (standards complaint) .html to convert .txt or .rtf.

Re: Word HTML issues
by davidrw (Prior) on May 15, 2005 at 20:03 UTC
    Yuck. Below is a script that i used recently to script out a lot of word-generated junk .. caveat emptor--it was a quick & dirty solution for my specific files. But some of the regex's maybe of some use. Note that it gets rid of everything inbetween <!...> tags, and also pretty much strips also style junk with mso in it.

    As for a more generic approach, I haven't used one, but a quick cpan search or HTML yields HTML::Scrubber and HTML::Sanitizer which (at a 2-s glance) look promising.

      The OP changed his question into:

      How does that script work with all the # in it?

      I assume this was ment to be a reply to this post, so I'll post my answer here.

      The # in s#class=section#class="Section"#sg; for example is used as a regex-delimiter, not as a comment. It is the same as s/class=section/class="Section"/sg; except that with s/// you would need to escape the / (which could make it less readable, but that does not apply in this case since it has no / in the regex)

Re: Word HTML issues
by astroboy (Chaplain) on May 15, 2005 at 20:11 UTC
Re: Word HTML issues
by davidrw (Prior) on May 15, 2005 at 21:07 UTC
        How does that script work with all the # in it?

    First, please don't replace your original content like that (i think the editors are going to fix it)--just reply to the replies instead..

    Anyways, i think you're asking about my usage of things like s#foo#stuff# Because of how the operators (see perlop and perlre man pages), all of these are do the same thing:
    s/foo/stuff/ s#foo#stuff# s!foo!stuff! s?foo?stuff?
    In this case, i used s### instead of s/// for two reasons:
    • The # is pretty legible since it's visually a block.
    • Since i'm dealing w/html tags, i don't have to worry about escaping /'s. For example, these two are identical, but one is obvisouly easier to read & write:
      s/<tr><td>.*?<\/td><\/tr>/FOO/; s#<tr><td>.*?</td></tr>#FOO#;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://457280]
Approved by sweetblood
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (17)
As of 2014-07-11 16:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (232 votes), past polls