Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: docx to txt

by elef (Friar)
on Jan 09, 2011 at 22:14 UTC ( [id://881383]=note: print w/replies, xml ) Need Help??


in reply to docx to txt

I just noticed you wrote docx2txt "could be a starting point, but I am having some difficulties to figure out how to integrate it into my application."

It seems pretty straightforwared to me, but here's what I did in case you need it: I just kept docx2txt in a separate file out of laziness and just used
system ("perl \"$pathtodocx/docx2txt.pl\" \"${infile}.doc\" \"${outfil +e}.txt\"");

Now, my app needs to run on both *nix and Windows, and Windows compatibility required quite a bit of trickery. I modified the .pl slightly, bundled an unzip program with my application and wrote some code that generates a .config for docx2txt at runtime, to let docx2txt know where unzip.exe is. It's not what I would call elegant, but it works. If you need this to run on Windows computers other than your own, I can paste the code here.

Replies are listed 'Best First'.
Re^2: docx to txt
by Anonymous Monk on Jan 10, 2011 at 12:06 UTC
    system, autodie
    { use autodie qw' system '; system $^X, "$pathtodocx/docx2txt.pl", "$infile.doc", "$outfile.txt" +; }
Re^2: docx to txt
by welle (Beadle) on Jan 14, 2011 at 21:57 UTC

    I am back from some days of holidays ;)

    Oh, I would really appreciate if you could post your solution!

      I'm guessing that was directed towards me.
      Apologies for not answering sooner, I just found this post now.

      Anyway, here's the code (keep in mind that I'm by no means an experienced coder... also, this code was taken from a much larger project, so some of the silly variable names have a logic you can't see here. It's also missing some logging and reporting I left out etc. I have only tested this on Windows, but it's designed to work on most linux distros and OSX as well.):
      #!/usr/bin/perl use strict; use warnings; use File::Spec; # IDENTIFY OS my $OS; if ($^O =~ /mswin/i) {$OS = "Windows";print "OS detected: Windows\n"} elsif ($^O =~ /linux/i) {$OS = "Linux";print "OS detected: Linux\n"} elsif ($^O =~ /darwin/i) {$OS = "Mac";print "OS detected: Mac OS X\n"} + else {print "\nUnable to detect OS type, choose your OS:\n\nWindows + Any version of Microsoft Windows\nMac Any flavour of Mac OS X\nLi +nux Linux of some sort\n\n"; do { chomp ($OS = <STDIN>); print "\nIncorrect OS type. Try again.\n\n" unless $OS eq "Windows +" or $OS eq "Mac" or $OS eq "Linux";} until ($OS eq "Windows" or $OS +eq "Mac" or $OS eq "Linux"); } # IDENTIFY SCRIPT PATH my $script = File::Spec->rel2abs( __FILE__ ); $script =~ /(.*)[\/|\\](.*)/; my $scriptpath = $1; if (-d "$scriptpath/scripts/docx2txt") { # print "\nScript folder found.\n";# comment out } else { do { print "\nThe script path found automatically (${scriptpath}) i +s not correct.\nPlease drag and drop the aligner script here and pres +s enter. (If your OS doesn't support drag & drop, copy-paste the path + here. You can paste by right clicking in the window or right clickin +g the icon in the top left corner of this window.)\n"; chomp ($script = <STDIN>); $script =~ / *[\"\'](.*)[\/\\](.*)[\"\'] */; $scriptpath = $1; $scriptpath =~ s/^\s+//; # strip leading wh +itespace $scriptpath =~ s/\s+$//; # strip trailing w +hitespace if (-e "$scriptpath/scripts/docx2txt") {print "\nScript folder + identified correctly.\n"} } until (-e "$scriptpath/scripts/docx2txt"); } # DRAG AND DROP INPUT FILE my $file1_full; print "\n\nDrag and drop your input file here and press enter.\n"; chomp ($file1_full = <STDIN>); $file1_full =~ s/^\s+//; # strip leading whitespace $file1_full =~ s/\s+$//; # strip trailing whitespac +e $file1_full =~ /^[\"\']?(.*)[\/\\]([^\"\']*)[\"\']?$/; my $folder = $1; my $file1 = $2; $file1 =~ /(.*)\.(.*)/; my $f1 = $1; my $ext = lc($2); # CONVERT DOCX TO UTF-8 TXT if ($OS eq "Windows") { # create config file, run docx2txt.exe modded to use win config fi +le open (DOCX2TXTCONFIG, "<", "$scriptpath/scripts/docx2txt/docx2txt. +config") or die "Can't open file: $!"; unlink "$scriptpath/scripts/docx2txt/docx2txt_win.config"; open (DOCX2TXTCONFIG_WIN, ">>", "$scriptpath/scripts/docx2txt/docx +2txt_win.config") or die "Can't open file: $!"; while (<DOCX2TXTCONFIG>) { s/^unzip *=>.*$/unzip => \'$scriptpath\\scripts\\docx2 +txt\\unzip\\unzip\.exe\',/; print DOCX2TXTCONFIG_WIN $_; } close DOCX2TXTCONFIG; close DOCX2TXTCONFIG_WIN; system ("\"$scriptpath\\scripts\\docx2txt\\docx2txt_win.exe\" \"$f +older/$file1\" \"$folder/${f1}.txt\""); } else { # linux and mac both use the original docx2txt.pl and both ha +ve unzip at usr/bun/unzip system ("perl \"$scriptpath/scripts/docx2txt/docx2txt.pl\" \"$fold +er/$file1\" \"$folder/${f1}.txt\""); } #work with the txt file from now on $file1 = "${f1}.txt"; # CHECK FILE SIZE, ABORT IF 0 my $file_1_size = -s "$folder/$file1"; if ($file_1_size == 0) { print "\n\nThe file conversion seems to have failed: the generated + file is empty. ABORTING.\n\n"; sleep 3; die; } # DONE print "\n$file1 created ($file_1_size bytes).\nPress enter to quit.\n" +; <STDIN>;
      Now, this requires docx2txt.pl for *nix, and docx2txt.exe and unzip.exe on windows. It looks for these in scripts/docx2txt, I have uploaded the necessary files here. Of course you can get your own unzip binary and generate docx2txt.exe yourself with pp, which is what I did, or just use the .pl on Windows as well if your users can be expected to have perl installed.
      The file won't be up here for long, so here's a summary in case someone reads this when I've already yanked it:
      Docx2txt needs to unzip the docx (zip) files. To make this work on Windows, I have modded the original perl script to use a different config file (docx2txt_win.config) which the main script generates at runtime, filling in the path to unzip.exe (scripts/doxc2txt/unzip) according to what folder it's in. Then I generated an executable (docx2txt_win.exe) out of this slightly modified script. On Linux and OS X systems, the original .pl is used without modifications as these OSes can reasonably be expected to have an unzip utility at usr/bin/unzip.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://881383]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-04-19 21:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found