http://www.perlmonks.org?node_id=882419


in reply to Re: docx to txt
in thread docx to txt

I am back from some days of holidays ;)

Oh, I would really appreciate if you could post your solution!

Replies are listed 'Best First'.
Re^3: docx to txt
by elef (Friar) on Jan 19, 2011 at 20:12 UTC
    I'm guessing that was directed towards me.
    Apologies for not answering sooner, I just found this post now.

    Anyway, here's the code (keep in mind that I'm by no means an experienced coder... also, this code was taken from a much larger project, so some of the silly variable names have a logic you can't see here. It's also missing some logging and reporting I left out etc. I have only tested this on Windows, but it's designed to work on most linux distros and OSX as well.):
    #!/usr/bin/perl use strict; use warnings; use File::Spec; # IDENTIFY OS my $OS; if ($^O =~ /mswin/i) {$OS = "Windows";print "OS detected: Windows\n"} elsif ($^O =~ /linux/i) {$OS = "Linux";print "OS detected: Linux\n"} elsif ($^O =~ /darwin/i) {$OS = "Mac";print "OS detected: Mac OS X\n"} + else {print "\nUnable to detect OS type, choose your OS:\n\nWindows + Any version of Microsoft Windows\nMac Any flavour of Mac OS X\nLi +nux Linux of some sort\n\n"; do { chomp ($OS = <STDIN>); print "\nIncorrect OS type. Try again.\n\n" unless $OS eq "Windows +" or $OS eq "Mac" or $OS eq "Linux";} until ($OS eq "Windows" or $OS +eq "Mac" or $OS eq "Linux"); } # IDENTIFY SCRIPT PATH my $script = File::Spec->rel2abs( __FILE__ ); $script =~ /(.*)[\/|\\](.*)/; my $scriptpath = $1; if (-d "$scriptpath/scripts/docx2txt") { # print "\nScript folder found.\n";# comment out } else { do { print "\nThe script path found automatically (${scriptpath}) i +s not correct.\nPlease drag and drop the aligner script here and pres +s enter. (If your OS doesn't support drag & drop, copy-paste the path + here. You can paste by right clicking in the window or right clickin +g the icon in the top left corner of this window.)\n"; chomp ($script = <STDIN>); $script =~ / *[\"\'](.*)[\/\\](.*)[\"\'] */; $scriptpath = $1; $scriptpath =~ s/^\s+//; # strip leading wh +itespace $scriptpath =~ s/\s+$//; # strip trailing w +hitespace if (-e "$scriptpath/scripts/docx2txt") {print "\nScript folder + identified correctly.\n"} } until (-e "$scriptpath/scripts/docx2txt"); } # DRAG AND DROP INPUT FILE my $file1_full; print "\n\nDrag and drop your input file here and press enter.\n"; chomp ($file1_full = <STDIN>); $file1_full =~ s/^\s+//; # strip leading whitespace $file1_full =~ s/\s+$//; # strip trailing whitespac +e $file1_full =~ /^[\"\']?(.*)[\/\\]([^\"\']*)[\"\']?$/; my $folder = $1; my $file1 = $2; $file1 =~ /(.*)\.(.*)/; my $f1 = $1; my $ext = lc($2); # CONVERT DOCX TO UTF-8 TXT if ($OS eq "Windows") { # create config file, run docx2txt.exe modded to use win config fi +le open (DOCX2TXTCONFIG, "<", "$scriptpath/scripts/docx2txt/docx2txt. +config") or die "Can't open file: $!"; unlink "$scriptpath/scripts/docx2txt/docx2txt_win.config"; open (DOCX2TXTCONFIG_WIN, ">>", "$scriptpath/scripts/docx2txt/docx +2txt_win.config") or die "Can't open file: $!"; while (<DOCX2TXTCONFIG>) { s/^unzip *=>.*$/unzip => \'$scriptpath\\scripts\\docx2 +txt\\unzip\\unzip\.exe\',/; print DOCX2TXTCONFIG_WIN $_; } close DOCX2TXTCONFIG; close DOCX2TXTCONFIG_WIN; system ("\"$scriptpath\\scripts\\docx2txt\\docx2txt_win.exe\" \"$f +older/$file1\" \"$folder/${f1}.txt\""); } else { # linux and mac both use the original docx2txt.pl and both ha +ve unzip at usr/bun/unzip system ("perl \"$scriptpath/scripts/docx2txt/docx2txt.pl\" \"$fold +er/$file1\" \"$folder/${f1}.txt\""); } #work with the txt file from now on $file1 = "${f1}.txt"; # CHECK FILE SIZE, ABORT IF 0 my $file_1_size = -s "$folder/$file1"; if ($file_1_size == 0) { print "\n\nThe file conversion seems to have failed: the generated + file is empty. ABORTING.\n\n"; sleep 3; die; } # DONE print "\n$file1 created ($file_1_size bytes).\nPress enter to quit.\n" +; <STDIN>;
    Now, this requires docx2txt.pl for *nix, and docx2txt.exe and unzip.exe on windows. It looks for these in scripts/docx2txt, I have uploaded the necessary files here. Of course you can get your own unzip binary and generate docx2txt.exe yourself with pp, which is what I did, or just use the .pl on Windows as well if your users can be expected to have perl installed.
    The file won't be up here for long, so here's a summary in case someone reads this when I've already yanked it:
    Docx2txt needs to unzip the docx (zip) files. To make this work on Windows, I have modded the original perl script to use a different config file (docx2txt_win.config) which the main script generates at runtime, filling in the path to unzip.exe (scripts/doxc2txt/unzip) according to what folder it's in. Then I generated an executable (docx2txt_win.exe) out of this slightly modified script. On Linux and OS X systems, the original .pl is used without modifications as these OSes can reasonably be expected to have an unzip utility at usr/bin/unzip.