http://www.perlmonks.org?node_id=310195

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Can anyone suggest me on how I can convert word file(.doc) file into html format . Is there any module for it . Please let me know on my problem a few suggestions. Thankyou !

Replies are listed 'Best First'.
Re: Convert word(.doc) file to html file
by Corion (Patriarch) on Nov 26, 2003 at 11:12 UTC

    Perl under Windows has the great capability of automating other programs through the Win32::OLE module. That way you can remotely control MS Word the same as with Visual Basic for Applications through the Office Object Model.

    The easiest way to get a (Visual Basic) stub of what you want to do is:

    1. Practice what you want to do with the program manually
    2. Switch on the macro recorder
    3. Do what you want to automate one final time
    4. Stop the macro recorder
    5. Look at the recorded macros
    6. Convert the recorded Visual Basic macro to Perl

    Conversion of the recorded Visual Basic macro to Perl is a fairly mechanical process, and what you need to know beyond that should be explained in the Win32::OLE documentation.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: Convert word(.doc) file to html file
by davis (Vicar) on Nov 26, 2003 at 11:29 UTC

    It's not perl, and the results are not perfect, but WvWare has a program called wvHtml, which does exactly what you want.
    cheers


    davis
    It's not easy to juggle a pregnant wife and a troubled child, but somehow I managed to fit in eight hours of TV a day.
      wvWare++

      I had to convert about 80 or so Word Docs into HTML and even though wvHtml didn't work very for me, wvText did. I created a 'pipeline':

      WORD -> wvText -> custom Perl -> HTML::FromText -> HTML Tidy  
      
      The results are still up over at the Nashville Film Festival 2003 films page.

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
Re: Convert word(.doc) file to html file
by falic (Beadle) on Nov 26, 2003 at 13:22 UTC
    Use Win32::OLE, should allow you to open a word doc and save it as a HTML file.

    Should be easy enough, along the lines of,

    use Win32::OLE;
    use Win32::OLE::Const 'Microsoft.Word';

    my $Word = Win32::OLE->new('Word.Application', 'Quit');
    my $Doc = $Word->Documents->Open($File);

    $Word->ActiveDocument->SaveAs( { Filename => $HTMLFile, FileFormat => wdFormatHTML } );


    $Word->ActiveDocument->Close();
    $Word->Close();

    Where $File is the Word.doc file and $HTMLFile is the Word.html file
      I assume then that this can be automated if you supply all filenames etc in advance?
Re: Convert word(.doc) file to html file
by wine (Scribe) on Nov 26, 2003 at 11:32 UTC
    If you don't mind using an external program you might also want to look at wvware, which is an open source library for converting a whole range of word documents to a number of output formats, among which html. The library comes with some easy to use front-ends.

    Look at: http://wvware.sourceforge.net/

    - wine

Re: Convert word(.doc) file to html file
by EvdB (Deacon) on Nov 26, 2003 at 11:14 UTC
    Word files are in a proprietary format. Your best bet is to open it in word and save it to HTML from there... If you need to do this to many files perhaps word can be scripted on your platform?

    --tidiness is the memory loss of environmental mnemonics

      Word will convert .doc files to HTML easily but there's a drawback. Word has a tendency to throw a lot of junk into HTML files--I've seen webpages at my workplace that specifically request that nobody use Word to edit them because the author doesn't want to deal with the messes Word makes.
Re: Convert word(.doc) file to html file
by thraxil (Prior) on Nov 26, 2003 at 17:06 UTC

    have Word save as html as others have suggested, then, you'll probably want to run it through something like htmltidy to clean up the horrible markup that Word produces.

    also, if you ever switch to openoffice, you can convert those documents to xhtml pretty easily.

Re: Convert word(.doc) file to html file
by warthurton (Sexton) on Nov 26, 2003 at 21:52 UTC