Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: docx to html conversion.

by ww (Bishop)
on Feb 26, 2013 at 22:19 UTC ( #1020772=note: print w/ replies, xml ) Need Help??


in reply to docx to html conversion.

First: I second kennethk's questions... (second)here are a few remarks of my own:

  1. libreoffice will probably do the job very nicely.
  2. If it won't, it will at least reduce the .docx to .rtf or even text, which may be a lot easier to work with
    • ...unless the document has some substantive reason for fancy formatting -- which is to say, a reason the data would be less useful if not colorized, italicized, boldfaced, indented, outdented... blah, blah, blah.
  3. Alternately, CPAN has many offerings which may be useful -- though I didn't take the time to check their applicabiltiy to centos. A simple google of "site: CPAN.org docx" will show you the wealth of options

If you didn't program your executable by toggling in binary, it wasn't really programming!


Comment on Re: docx to html conversion.
Download Code
Re^2: docx to html conversion.
by huchister (Acolyte) on Feb 27, 2013 at 13:58 UTC
    I've tried look upon CPAN, only 6 of packages return and none of them were best suited. Turning into RTF format won't help since there are few tables, indentations. I shall try out libreoffice as you mentioned. It's always frustrating when you just forced with limited work resource. Thanks!
Re^2: docx to html conversion.
by huchister (Acolyte) on Feb 27, 2013 at 20:46 UTC

    I've tried use unoconv again to convert docx into html, the final product was not really desirable, but was able to produce docx into html with couple tweeaks.
    If anybody needs reference, below is the example code / unix line I've worked with.

    `unoconv --stdout -f html "$docxfileloc" > "$htmfile"`; my $t = HTML::TreeBuilder ->new_from_file("$upload_dir/$htmfile"); my $body = $t->look_down(_tag => q{body}); my @content = $body->detach_content; #grep body my $html = $_->as_HTML for @content; #exclude <body>, </body> tag $html = decode_entities($html); #decode special characters

    i.e, If its possible, use abiword for docx -> html. Output is better than unoconv, I just couldn't use it due to version compatibility issue.

    Thank you for your replies and hope my solution help the other.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1020772]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (13)
As of 2014-12-21 08:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (104 votes), past polls