Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: docx to txt

by welle (Beadle)
on Jan 14, 2011 at 21:57 UTC ( #882419=note: print w/ replies, xml ) Need Help??


in reply to Re: docx to txt
in thread docx to txt

I am back from some days of holidays ;)

Oh, I would really appreciate if you could post your solution!


Comment on Re^2: docx to txt
Re^3: docx to txt
by elef (Friar) on Jan 19, 2011 at 20:12 UTC
    I'm guessing that was directed towards me.
    Apologies for not answering sooner, I just found this post now.

    Anyway, here's the code (keep in mind that I'm by no means an experienced coder... also, this code was taken from a much larger project, so some of the silly variable names have a logic you can't see here. It's also missing some logging and reporting I left out etc. I have only tested this on Windows, but it's designed to work on most linux distros and OSX as well.):
    #!/usr/bin/perl use strict; use warnings; use File::Spec; # IDENTIFY OS my $OS; if ($^O =~ /mswin/i) {$OS = "Windows";print "OS detected: Windows\n"} elsif ($^O =~ /linux/i) {$OS = "Linux";print "OS detected: Linux\n"} elsif ($^O =~ /darwin/i) {$OS = "Mac";print "OS detected: Mac OS X\n"} + else {print "\nUnable to detect OS type, choose your OS:\n\nWindows + Any version of Microsoft Windows\nMac Any flavour of Mac OS X\nLi +nux Linux of some sort\n\n"; do { chomp ($OS = <STDIN>); print "\nIncorrect OS type. Try again.\n\n" unless $OS eq "Windows +" or $OS eq "Mac" or $OS eq "Linux";} until ($OS eq "Windows" or $OS +eq "Mac" or $OS eq "Linux"); } # IDENTIFY SCRIPT PATH my $script = File::Spec->rel2abs( __FILE__ ); $script =~ /(.*)[\/|\\](.*)/; my $scriptpath = $1; if (-d "$scriptpath/scripts/docx2txt") { # print "\nScript folder found.\n";# comment out } else { do { print "\nThe script path found automatically (${scriptpath}) i +s not correct.\nPlease drag and drop the aligner script here and pres +s enter. (If your OS doesn't support drag & drop, copy-paste the path + here. You can paste by right clicking in the window or right clickin +g the icon in the top left corner of this window.)\n"; chomp ($script = <STDIN>); $script =~ / *[\"\'](.*)[\/\\](.*)[\"\'] */; $scriptpath = $1; $scriptpath =~ s/^\s+//; # strip leading wh +itespace $scriptpath =~ s/\s+$//; # strip trailing w +hitespace if (-e "$scriptpath/scripts/docx2txt") {print "\nScript folder + identified correctly.\n"} } until (-e "$scriptpath/scripts/docx2txt"); } # DRAG AND DROP INPUT FILE my $file1_full; print "\n\nDrag and drop your input file here and press enter.\n"; chomp ($file1_full = <STDIN>); $file1_full =~ s/^\s+//; # strip leading whitespace $file1_full =~ s/\s+$//; # strip trailing whitespac +e $file1_full =~ /^[\"\']?(.*)[\/\\]([^\"\']*)[\"\']?$/; my $folder = $1; my $file1 = $2; $file1 =~ /(.*)\.(.*)/; my $f1 = $1; my $ext = lc($2); # CONVERT DOCX TO UTF-8 TXT if ($OS eq "Windows") { # create config file, run docx2txt.exe modded to use win config fi +le open (DOCX2TXTCONFIG, "<", "$scriptpath/scripts/docx2txt/docx2txt. +config") or die "Can't open file: $!"; unlink "$scriptpath/scripts/docx2txt/docx2txt_win.config"; open (DOCX2TXTCONFIG_WIN, ">>", "$scriptpath/scripts/docx2txt/docx +2txt_win.config") or die "Can't open file: $!"; while (<DOCX2TXTCONFIG>) { s/^unzip *=>.*$/unzip => \'$scriptpath\\scripts\\docx2 +txt\\unzip\\unzip\.exe\',/; print DOCX2TXTCONFIG_WIN $_; } close DOCX2TXTCONFIG; close DOCX2TXTCONFIG_WIN; system ("\"$scriptpath\\scripts\\docx2txt\\docx2txt_win.exe\" \"$f +older/$file1\" \"$folder/${f1}.txt\""); } else { # linux and mac both use the original docx2txt.pl and both ha +ve unzip at usr/bun/unzip system ("perl \"$scriptpath/scripts/docx2txt/docx2txt.pl\" \"$fold +er/$file1\" \"$folder/${f1}.txt\""); } #work with the txt file from now on $file1 = "${f1}.txt"; # CHECK FILE SIZE, ABORT IF 0 my $file_1_size = -s "$folder/$file1"; if ($file_1_size == 0) { print "\n\nThe file conversion seems to have failed: the generated + file is empty. ABORTING.\n\n"; sleep 3; die; } # DONE print "\n$file1 created ($file_1_size bytes).\nPress enter to quit.\n" +; <STDIN>;
    Now, this requires docx2txt.pl for *nix, and docx2txt.exe and unzip.exe on windows. It looks for these in scripts/docx2txt, I have uploaded the necessary files here. Of course you can get your own unzip binary and generate docx2txt.exe yourself with pp, which is what I did, or just use the .pl on Windows as well if your users can be expected to have perl installed.
    The file won't be up here for long, so here's a summary in case someone reads this when I've already yanked it:
    Docx2txt needs to unzip the docx (zip) files. To make this work on Windows, I have modded the original perl script to use a different config file (docx2txt_win.config) which the main script generates at runtime, filling in the path to unzip.exe (scripts/doxc2txt/unzip) according to what folder it's in. Then I generated an executable (docx2txt_win.exe) out of this slightly modified script. On Linux and OS X systems, the original .pl is used without modifications as these OSes can reasonably be expected to have an unzip utility at usr/bin/unzip.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://882419]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (6)
As of 2015-07-04 02:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (57 votes), past polls