Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

docx to html conversion.

by huchister (Acolyte)
on Feb 26, 2013 at 20:59 UTC ( #1020755=perlquestion: print w/replies, xml ) Need Help??
huchister has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to convert a docx file into HTML using Perl.

I've searched cpan modules,tried bunch different linux conversion softwares, and some of them are not working correctly + others have version compatibility issues. *I'm running centos 5.9.

Instead of going through version upgrade, I'm looking for any Perl module that let me change docx input into html output.

So far I've tried...

'docbook2html', 'abiword', 'unoconv', 'MSWord::ToHTML'(fail)

Recommand me if you know any good conversion software (free) or perl module.

Thank you for your future support.

UPDATE 2/27/12, found a solution, check below for my reply.

Replies are listed 'Best First'.
Re: docx to html conversion.
by ww (Archbishop) on Feb 26, 2013 at 22:19 UTC
    First: I second kennethk's questions... (second)here are a few remarks of my own:
    1. libreoffice will probably do the job very nicely.
    2. If it won't, it will at least reduce the .docx to .rtf or even text, which may be a lot easier to work with
      • ...unless the document has some substantive reason for fancy formatting -- which is to say, a reason the data would be less useful if not colorized, italicized, boldfaced, indented, outdented... blah, blah, blah.
    3. Alternately, CPAN has many offerings which may be useful -- though I didn't take the time to check their applicabiltiy to centos. A simple google of "site: docx" will show you the wealth of options

    If you didn't program your executable by toggling in binary, it wasn't really programming!

      I've tried use unoconv again to convert docx into html, the final product was not really desirable, but was able to produce docx into html with couple tweeaks.
      If anybody needs reference, below is the example code / unix line I've worked with.

      `unoconv --stdout -f html "$docxfileloc" > "$htmfile"`; my $t = HTML::TreeBuilder ->new_from_file("$upload_dir/$htmfile"); my $body = $t->look_down(_tag => q{body}); my @content = $body->detach_content; #grep body my $html = $_->as_HTML for @content; #exclude <body>, </body> tag $html = decode_entities($html); #decode special characters

      i.e, If its possible, use abiword for docx -> html. Output is better than unoconv, I just couldn't use it due to version compatibility issue.

      Thank you for your replies and hope my solution help the other.
      I've tried look upon CPAN, only 6 of packages return and none of them were best suited. Turning into RTF format won't help since there are few tables, indentations. I shall try out libreoffice as you mentioned. It's always frustrating when you just forced with limited work resource. Thanks!
Re: docx to html conversion.
by kennethk (Abbot) on Feb 26, 2013 at 21:52 UTC

    I don't mean to be unhelpful, but is there a reason you aren't doing this on a Windows platform? Is there a reason your input has to be docx? The most common answer to this type of issue is Win32::OLE, i.e. make Windows do it because they don't play well with others.

    See also Convert word(.doc) file to html file.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Right, Client bring in input only as docx, and I have no choice to work on linux platform. I shall try out Win32::OLE as you told

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1020755]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2018-11-18 17:21 GMT
Find Nodes?
    Voting Booth?
    My code is most likely broken because:

    Results (205 votes). Check out past polls.