Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Converting Word97 (or later) exported HTML to valid HTML

by projekt21 (Friar)
on Nov 06, 2001 at 15:35 UTC ( #123551=perlquestion: print w/replies, xml ) Need Help??
projekt21 has asked for the wisdom of the Perl Monks concerning the following question:

The problem arose when one of our customers wanted to feed formated text from a word97 doc file into an RDBMS to generate web pages with dynamic content. So I exported the text as HTML from word, opened the file in a text editor and was confronted with a horror of HTML (you may know that).

My first approach was to use HTML::Parser and a modified version of one of its example scripts to drop some tags (like <font>). HTML::Parser did a good job on that but left ugly things like <b><i> ... </b></i>, which isn't valid.

So I took a look at HTML::TreeBuilder and wrote the following sub to do the work. It works fine, but I want to ask my fellow monks for deeper knowledge.

Are there other ways to handle word's html output and get valid html from it? Please give me some directions (others than htmltidy which can't be used). Thanks.

# ... snippet ... # tags to ignore my @ignore_tags = qw(font big small body dir html); # teags to drop with content @ignore_elements = qw(script style head); ########################################################## sub clean_up_htmltree { ########################################################## my $input = shift; my $warn = 0; my $htmlex; use HTML::TreeBuilder; my $h = HTML::TreeBuilder->new; $h->ignore_unknown(0); $h->warn($warn); $h->parse($input); foreach (@ignore_tags) { $htmlex = 1, next if lc($_) eq "html"; # remove <html>...</html>? while (my $ok = $h->look_down('_tag', "$_")) { $ok->replace_with_content; } } foreach (@ignore_elements) { while (my $ok = $h->look_down('_tag', "$_")) { $ok->detach; } } my $output = $h->as_HTML(undef, " ", {}); # entities to encode, inde +nt, optional endtags $h = $h->delete(); # nuke it! if ($htmlex) { $output =~ s:^\s*<html>::m; $output =~ s:</html>\s*$::m; } return $output; }

alex pleiner <>
zeitform Internet Dienste

Replies are listed 'Best First'.
Re: Converting Word97 (or later) exported HTML to valid HTML
by Corion (Pope) on Nov 06, 2001 at 15:50 UTC
    Honestly, as I read the title of your node, HTML tidy sprang immediately to my mind, as it even has command line switches used to specifically clean up Office HTML. On that website, there is also code on how to call HTML tidy from Perl, including some proposed error checking which seems mostly geared for Unix. On the second thought, it is not really clear why they use the code they use, so I'll post it here, together with my replacement :
    ## This is what I think is needed beforehand : open( TIDY, "html-tidy $commandline|") or die "Couldn't spawn html-tid +y : $!\n"; my @output; @output = <TIDY>; ## Here begins their code : if (close(TIDY) == 0) { my $exitcode = $? >> 8; if ($exitcode == 1) { printf STDERR "tidy issued warning messages\n"; } elsif ($exitcode == 2) { printf STDERR "tidy issued error messages\n"; } else { die "tidy exited with code: $exitcode\n"; } } else { printf STDERR "tidy detected no errors\n"; }
    I think this could simply be done with the following code, but I haven't checked all possible outcomes...
    my @output = qx(html-tidy $commandline); my $exitcode = $? >> 8; if ($exitcode == 1) { printf STDERR "tidy issued warning messages\n"; } elsif ($exitcode == 2) { printf STDERR "tidy issued error messages\n"; } else { die "tidy exited with code: $exitcode\n"; }

    Wrapping it up, unless you tell us a really convincing reason why html-tidy is not possible (and with not possible I also mean putting html-tidy into a Perl script, writing it out to /tmp, starting it there and afterwards deleting the file again), I'll stick with this solution :-)

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: Converting Word97 (or later) exported HTML to valid HTML
by jmcnamara (Monsignor) on Nov 06, 2001 at 15:52 UTC

    The demoroniser might help.

    Update: Here is the version that was updated by Larry Rosler and TomC.


Re: Converting Word97 (or later) exported HTML to valid HTML
by projekt21 (Friar) on Nov 06, 2001 at 16:50 UTC

    Thanks for the reply.

    I've checked all of those, but:

    • demoronizer removes the biggest horrors but leaves some left (e.g. <b><i> ... </b></i>. Maybe I can change the code.
    • tidy is the tool of choice (under normal conditions). As I mentioned in CB, the script/website runs on a provider's server where I am not allowed to install software (poor customer's choice). Anyway, I need to drop all CSS stuff, which requires to post-parse tidy's output.
    • wvHtml looks interesting, too. I may implement a doc file upload. Anyway, both restrictions mentioned before (no install of software, no CSS stuff) apply here, too.

    Thanks for your comments and wisdom, I'll will have a sleep about this (or two) before I go on.

    alex pleiner <>
    zeitform Internet Dienste

      If you can run CGIs, chances are you can upload precompiled binarie or, compile your own binaries on their server from CGIs and then call them from other scripts. Unless they need to approve scripts and they then put hem live - in which case, obfuscate anything and see if they put it live when they don't understand it.

      /msg me if you want some more specific hints on doing things on shared servers that the admin thought they could stop.

      the hatter

Re: Converting Word97 (or later) exported HTML to valid HTML
by jeroenes (Priest) on Nov 06, 2001 at 16:11 UTC
    It is a noble goal to produce nice HTML from the stuff that word spits out. Noble, but difficult.

    There is a tool for that. I'm browsing now to find that tool... . here it is: 'mswordview'. Let me download and try.... oh, new projectpage here. Looks nice, there should be HTML 4.0, LaTeX, plain text, PS, PDF output.... compiling/testing (oh you only need wv, skip the libwv)

    On a glance the output is decent HTML. The authors claims W3C HTML 4.0 compliance. Methinks that 'wordview' is the way to go.


Re: Converting Word97 (or later) exported HTML to valid HTML
by andye (Curate) on Nov 06, 2001 at 18:00 UTC
    You're so right - it's really quite horrendous. I've used two solutions for this in the past (neither Perl though, sorry) :
    • Microsoft themselves have released a utility to do this - presumably available from their website
    • Macromedia Dreamweaver has a specific function to do this
    The second of these obviously can't be incorporated in a script, the first probably can't, but perhaps you could persuade your users to run their html files through the Microsoft utility, on their Windows desktop?

    hth a little,

      I've found a Word filter from Microsoft that is supposed to output cleaner HTML. (I assume this is what you were talking about.)

      I also tend to use Dreamweaver for this task, but it does leave some of the CSS stuff behind, so some cleanup is still required.

      Update: Although I still haven't tested the output, it appears that the MS Word filter can be used from the command line, as a standalone GUI application, or from within Word, and can batch process multiple files.

      Impossible Robot
Re: Converting Word97 (or later) exported HTML to valid HTML
by Hero Zzyzzx (Curate) on Nov 06, 2001 at 21:36 UTC

    I do this with a file upload and wp2html, which creates really lean HTML and has the added bonus of working with WordPerfect docs too. I'm really happy with this solution- it's fast as heck, the HTML is pretty good and you have mucho control over the generated HTML.

    While you can get the source, there is a 5 pound licensing fee. (very reasonable, considering the amount of work that must have gone into this). The author is very responsive, too.

    I've tried wvHTML too, I like wp2html better because it keeps the intent of the document, and a good amount of the formatting without trying to stay TOO true to the original format of the document. Basically, wp2html gets the good stuff, while wvHTML jumps through too many hoops to keep the converted document looking like the original Word doc.

    If you can upload a compiled binary, I highly suggest you check it out. It rocks!

    -Any sufficiently advanced technology is
    indistinguishable from doubletalk.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://123551]
Approved by root
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2017-06-27 05:34 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (598 votes). Check out past polls.