#!/usr/bin/perl
use strict;
use warnings;
use Text::FromAny;
my $tFromAny = Text::FromAny->new(file => '/path/to/docx');
print my $text = $tFromAny->text, "\n";
| [reply] [d/l] |
There's an app for that.
Err, I mean, there's a perl script for that.
Here it is: http://sourceforge.net/projects/docx2txt/
It's a fairly simple perl script that does just what you need. If you're not on *nix, you'll need to provide an unzip program, by default it just uses /usr/bin/unzip
Works great for me both on Win7 and Ubuntu.
Thanks for the tip on Text::FromAny though, it looks intriguing. If you need a quick-and-dirty solution for all sorts of random formats, it sure is hard to beat. The cpan page is silent about what it uses to decipher docx (or any other format except for PDF), so I'd stick with docx2txt if docx is all that's needed.
Note that docx2txt is not perfect, so if your files contain hyperlinks, tables, headers, footers, footnotes etc., the results may not be exactly what you want - but then docx -> txt will always be lossy by definition.
| [reply] |
For what it’s worth, the XML format used in a DOCX file is thoroughly documented by Microsoft, and there is a DTD as well. (I rather think that some guv’mint along the way must have read them the “no more proprietary document formats for public documents” riot-act... which would have been a good thing. I know the Library of Congress has been very outspoken about that.)
| |
I just noticed you wrote docx2txt "could be a starting point, but I am having some difficulties to figure out how to integrate it into my application."
It seems pretty straightforwared to me, but here's what I did in case you need it:
I just kept docx2txt in a separate file out of laziness and just used system ("perl \"$pathtodocx/docx2txt.pl\" \"${infile}.doc\" \"${outfil
+e}.txt\"");
Now, my app needs to run on both *nix and Windows, and Windows compatibility required quite a bit of trickery. I modified the .pl slightly, bundled an unzip program with my application and wrote some code that generates a .config for docx2txt at runtime, to let docx2txt know where unzip.exe is. It's not what I would call elegant, but it works. If you need this to run on Windows computers other than your own, I can paste the code here. | [reply] [d/l] |
{
use autodie qw' system ';
system $^X, "$pathtodocx/docx2txt.pl", "$infile.doc", "$outfile.txt"
+;
}
| [reply] [d/l] |
I am back from some days of holidays ;) Oh, I would really appreciate if you could post your solution!
| [reply] |
I'm guessing that was directed towards me.
Apologies for not answering sooner, I just found this post now.
Anyway, here's the code (keep in mind that I'm by no means an experienced coder... also, this code was taken from a much larger project, so some of the silly variable names have a logic you can't see here. It's also missing some logging and reporting I left out etc. I have only tested this on Windows, but it's designed to work on most linux distros and OSX as well.):
#!/usr/bin/perl
use strict;
use warnings;
use File::Spec;
# IDENTIFY OS
my $OS;
if ($^O =~ /mswin/i) {$OS = "Windows";print "OS detected: Windows\n"}
elsif ($^O =~ /linux/i) {$OS = "Linux";print "OS detected: Linux\n"}
elsif ($^O =~ /darwin/i) {$OS = "Mac";print "OS detected: Mac OS X\n"}
+
else {print "\nUnable to detect OS type, choose your OS:\n\nWindows
+ Any version of Microsoft Windows\nMac Any flavour of Mac OS X\nLi
+nux Linux of some sort\n\n";
do {
chomp ($OS = <STDIN>);
print "\nIncorrect OS type. Try again.\n\n" unless $OS eq "Windows
+" or $OS eq "Mac" or $OS eq "Linux";} until ($OS eq "Windows" or $OS
+eq "Mac" or $OS eq "Linux");
}
# IDENTIFY SCRIPT PATH
my $script = File::Spec->rel2abs( __FILE__ );
$script =~ /(.*)[\/|\\](.*)/;
my $scriptpath = $1;
if (-d "$scriptpath/scripts/docx2txt") {
# print "\nScript folder found.\n";# comment out
} else {
do {
print "\nThe script path found automatically (${scriptpath}) i
+s not correct.\nPlease drag and drop the aligner script here and pres
+s enter. (If your OS doesn't support drag & drop, copy-paste the path
+ here. You can paste by right clicking in the window or right clickin
+g the icon in the top left corner of this window.)\n";
chomp ($script = <STDIN>);
$script =~ / *[\"\'](.*)[\/\\](.*)[\"\'] */;
$scriptpath = $1;
$scriptpath =~ s/^\s+//; # strip leading wh
+itespace
$scriptpath =~ s/\s+$//; # strip trailing w
+hitespace
if (-e "$scriptpath/scripts/docx2txt") {print "\nScript folder
+ identified correctly.\n"}
} until (-e "$scriptpath/scripts/docx2txt");
}
# DRAG AND DROP INPUT FILE
my $file1_full;
print "\n\nDrag and drop your input file here and press enter.\n";
chomp ($file1_full = <STDIN>);
$file1_full =~ s/^\s+//; # strip leading whitespace
$file1_full =~ s/\s+$//; # strip trailing whitespac
+e
$file1_full =~ /^[\"\']?(.*)[\/\\]([^\"\']*)[\"\']?$/;
my $folder = $1;
my $file1 = $2;
$file1 =~ /(.*)\.(.*)/;
my $f1 = $1;
my $ext = lc($2);
# CONVERT DOCX TO UTF-8 TXT
if ($OS eq "Windows") {
# create config file, run docx2txt.exe modded to use win config fi
+le
open (DOCX2TXTCONFIG, "<", "$scriptpath/scripts/docx2txt/docx2txt.
+config") or die "Can't open file: $!";
unlink "$scriptpath/scripts/docx2txt/docx2txt_win.config";
open (DOCX2TXTCONFIG_WIN, ">>", "$scriptpath/scripts/docx2txt/docx
+2txt_win.config") or die "Can't open file: $!";
while (<DOCX2TXTCONFIG>) {
s/^unzip *=>.*$/unzip => \'$scriptpath\\scripts\\docx2
+txt\\unzip\\unzip\.exe\',/;
print DOCX2TXTCONFIG_WIN $_;
}
close DOCX2TXTCONFIG;
close DOCX2TXTCONFIG_WIN;
system ("\"$scriptpath\\scripts\\docx2txt\\docx2txt_win.exe\" \"$f
+older/$file1\" \"$folder/${f1}.txt\"");
} else { # linux and mac both use the original docx2txt.pl and both ha
+ve unzip at usr/bun/unzip
system ("perl \"$scriptpath/scripts/docx2txt/docx2txt.pl\" \"$fold
+er/$file1\" \"$folder/${f1}.txt\"");
}
#work with the txt file from now on
$file1 = "${f1}.txt";
# CHECK FILE SIZE, ABORT IF 0
my $file_1_size = -s "$folder/$file1";
if ($file_1_size == 0) {
print "\n\nThe file conversion seems to have failed: the generated
+ file is empty. ABORTING.\n\n";
sleep 3;
die;
}
# DONE
print "\n$file1 created ($file_1_size bytes).\nPress enter to quit.\n"
+;
<STDIN>;
Now, this requires docx2txt.pl for *nix, and docx2txt.exe and unzip.exe on windows. It looks for these in scripts/docx2txt, I have uploaded the necessary files here. Of course you can get your own unzip binary and generate docx2txt.exe yourself with pp, which is what I did, or just use the .pl on Windows as well if your users can be expected to have perl installed.
The file won't be up here for long, so here's a summary in case someone reads this when I've already yanked it:
Docx2txt needs to unzip the docx (zip) files. To make this work on Windows, I have modded the original perl script to use a different config file (docx2txt_win.config) which the main script generates at runtime, filling in the path to unzip.exe (scripts/doxc2txt/unzip) according to what folder it's in. Then I generated an executable (docx2txt_win.exe) out of this slightly modified script.
On Linux and OS X systems, the original .pl is used without modifications as these OSes can reasonably be expected to have an unzip utility at usr/bin/unzip. | [reply] [d/l] |