Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

How to clean-up Microsoft Word HTML

by starlight (Novice)
on Dec 06, 2002 at 01:32 UTC ( #217959=perlquestion: print w/ replies, xml ) Need Help??
starlight has asked for the wisdom of the Perl Monks concerning the following question:

Hi everybody,
My question concerns removing the dreadful HTML tags created by Microsoft Word's "Save as HTML..." feature.

(I know, I know... Nevermind why I have to deal with it in the first place.)

Do there exist any freely available scripts or modules to clean up this mess? Perhaps using Html-Parser and/or Html-Tagset?

Comment on How to clean-up Microsoft Word HTML
Re: How to clean-up Microsoft Word HTML
by pfaut (Priest) on Dec 06, 2002 at 01:53 UTC

    Would this help? I've never used it successfully myself (the documents I tried to fix might not have suffered from the problems this tool addresses) and I don't know exactly what problems you're trying to solve.

      Here's my solution to make pretty (hand editable) HTML ,set $dos for cr/lf and $nostyle to remove all style information:

      #!/usr/bin/perl $nostyle=1; $dos=1; while(<>){$text.=$_;} $text=~s/content="Microsoft Word \d+"/content=""/g; $text=~s/(\r|\n)+/ /g; $text=~s/<\/?o:.+?>//g; $text=~s/<!--.+-->//g; $text=~s/xmlns(:.+?)?=".+?"//g; $text=~s/mso-.+?:\s?.+?'/'/g; $text=~s/mso-.+?:\s?.+?;//g; $text=~s/style=''//g; $text=~s#style='.+?'##g if ($nostyle); $text=~s/<link rel=File-List href=".+?">//g; $text=~s/class=\w+//g; $text=~s/<\/?st1:\w+>//g; $text=~s/\s+>/>/g; $text=~s/>\s+</></g; $text=~s/\s+/ /g; $text=~s#</?span>##g if ($nostyle); $text=~s#<span style='font-size:12.0pt;\s?'>(.+?)</span>#$1#g; $text=~s#<span[^>]*>\s*</span>##g; $text=~s#<span>(.+)</span>#$1#g; $text=~s/(<\w.+?>)/\n$1/g; $text=~s/\n<b>/<b>/g; $text=~s#</(html|body|head|tr|td|table|div)>#\n</$1>#g; $text=~s#\n<html>#<html>#; $text=~s#\n#\r\n#g if ($dos); print $text;
Re: How to clean-up Microsoft Word HTML
by reclaw (Curate) on Dec 06, 2002 at 02:52 UTC


      Cleaning up Word HTML is actually the exact purpose for which Tidy was created. It started as a W3C project, or at least was hosted there for a time. I understand it's an excellent piece of software though I have only tinkered with it because I write my HTML in Notepad. *grin*

Re: How to clean-up Microsoft Word HTML
by impossiblerobot (Deacon) on Dec 06, 2002 at 02:56 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://217959]
Approved by TStanley
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2014-07-12 06:21 GMT
Find Nodes?
    Voting Booth?

    When choosing user names for websites, I prefer to use:

    Results (238 votes), past polls