Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

How to clean-up Microsoft Word HTML

by starlight (Novice)
on Dec 06, 2002 at 01:32 UTC ( #217959=perlquestion: print w/ replies, xml ) Need Help??
starlight has asked for the wisdom of the Perl Monks concerning the following question:

Hi everybody,
My question concerns removing the dreadful HTML tags created by Microsoft Word's "Save as HTML..." feature.

(I know, I know... Nevermind why I have to deal with it in the first place.)

Do there exist any freely available scripts or modules to clean up this mess? Perhaps using Html-Parser and/or Html-Tagset?

Comment on How to clean-up Microsoft Word HTML
Re: How to clean-up Microsoft Word HTML
by pfaut (Priest) on Dec 06, 2002 at 01:53 UTC

    Would this help? I've never used it successfully myself (the documents I tried to fix might not have suffered from the problems this tool addresses) and I don't know exactly what problems you're trying to solve.

      Here's my solution to make pretty (hand editable) HTML ,set $dos for cr/lf and $nostyle to remove all style information:

      #!/usr/bin/perl $nostyle=1; $dos=1; while(<>){$text.=$_;} $text=~s/content="Microsoft Word \d+"/content="wordclean.pl"/g; $text=~s/(\r|\n)+/ /g; $text=~s/<\/?o:.+?>//g; $text=~s/<!--.+-->//g; $text=~s/xmlns(:.+?)?=".+?"//g; $text=~s/mso-.+?:\s?.+?'/'/g; $text=~s/mso-.+?:\s?.+?;//g; $text=~s/style=''//g; $text=~s#style='.+?'##g if ($nostyle); $text=~s/<link rel=File-List href=".+?">//g; $text=~s/class=\w+//g; $text=~s/<\/?st1:\w+>//g; $text=~s/\s+>/>/g; $text=~s/>\s+</></g; $text=~s/\s+/ /g; $text=~s#</?span>##g if ($nostyle); $text=~s#<span style='font-size:12.0pt;\s?'>(.+?)</span>#$1#g; $text=~s#<span[^>]*>\s*</span>##g; $text=~s#<span>(.+)</span>#$1#g; $text=~s/(<\w.+?>)/\n$1/g; $text=~s/\n<b>/<b>/g; $text=~s#</(html|body|head|tr|td|table|div)>#\n</$1>#g; $text=~s#\n<html>#<html>#; $text=~s#\n#\r\n#g if ($dos); print $text;
Re: How to clean-up Microsoft Word HTML
by reclaw (Curate) on Dec 06, 2002 at 02:52 UTC

      ++Reclaw

      Cleaning up Word HTML is actually the exact purpose for which Tidy was created. It started as a W3C project, or at least was hosted there for a time. I understand it's an excellent piece of software though I have only tinkered with it because I write my HTML in Notepad. *grin*

Re: How to clean-up Microsoft Word HTML
by impossiblerobot (Deacon) on Dec 06, 2002 at 02:56 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://217959]
Approved by TStanley
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2014-12-25 16:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (160 votes), past polls