http://www.perlmonks.org?node_id=1020772


in reply to docx to html conversion.

First: I second kennethk's questions... (second)here are a few remarks of my own:
  1. libreoffice will probably do the job very nicely.
  2. If it won't, it will at least reduce the .docx to .rtf or even text, which may be a lot easier to work with
    • ...unless the document has some substantive reason for fancy formatting -- which is to say, a reason the data would be less useful if not colorized, italicized, boldfaced, indented, outdented... blah, blah, blah.
  3. Alternately, CPAN has many offerings which may be useful -- though I didn't take the time to check their applicabiltiy to centos. A simple google of "site: CPAN.org docx" will show you the wealth of options

If you didn't program your executable by toggling in binary, it wasn't really programming!

Replies are listed 'Best First'.
Re^2: docx to html conversion.
by huchister (Acolyte) on Feb 27, 2013 at 20:46 UTC

    I've tried use unoconv again to convert docx into html, the final product was not really desirable, but was able to produce docx into html with couple tweeaks.
    If anybody needs reference, below is the example code / unix line I've worked with.

    `unoconv --stdout -f html "$docxfileloc" > "$htmfile"`; my $t = HTML::TreeBuilder ->new_from_file("$upload_dir/$htmfile"); my $body = $t->look_down(_tag => q{body}); my @content = $body->detach_content; #grep body my $html = $_->as_HTML for @content; #exclude <body>, </body> tag $html = decode_entities($html); #decode special characters

    i.e, If its possible, use abiword for docx -> html. Output is better than unoconv, I just couldn't use it due to version compatibility issue.

    Thank you for your replies and hope my solution help the other.
Re^2: docx to html conversion.
by huchister (Acolyte) on Feb 27, 2013 at 13:58 UTC
    I've tried look upon CPAN, only 6 of packages return and none of them were best suited. Turning into RTF format won't help since there are few tables, indentations. I shall try out libreoffice as you mentioned. It's always frustrating when you just forced with limited work resource. Thanks!