http://www.perlmonks.org?node_id=881233

welle has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I need to access a .docx, converting its content i txt. No fancy formatting needed. I am experiencing several issues with Win32::OLE (everything worked fine with .doc files and Word 2003). Therefore I would like a pure perl solution. For the same purpose with Excel I use Spreadsheet::ParseExcel.

Anyone know a script/module or whatever for doing this? As docx file are zipped XML file, it could be possible to parse it. But maybe it is not that straightforward. And I don't want to reinvent the wheel... Google search only showed me docx2txt (http://docx2txt.sourceforge.net/). It could be a starting point, but I am having some difficulties to figure out how to integrate it into my application. Any advice would be great. Welle

Replies are listed 'Best First'.
Re: docx to txt
by Khen1950fx (Canon) on Jan 08, 2011 at 16:59 UTC
    See: Text::FromAny.
    #!/usr/bin/perl use strict; use warnings; use Text::FromAny; my $tFromAny = Text::FromAny->new(file => '/path/to/docx'); print my $text = $tFromAny->text, "\n";
Re: docx to txt
by elef (Friar) on Jan 08, 2011 at 20:20 UTC
    There's an app for that.

    Err, I mean, there's a perl script for that.
    Here it is: http://sourceforge.net/projects/docx2txt/
    It's a fairly simple perl script that does just what you need. If you're not on *nix, you'll need to provide an unzip program, by default it just uses /usr/bin/unzip

    Works great for me both on Win7 and Ubuntu.

    Thanks for the tip on Text::FromAny though, it looks intriguing. If you need a quick-and-dirty solution for all sorts of random formats, it sure is hard to beat. The cpan page is silent about what it uses to decipher docx (or any other format except for PDF), so I'd stick with docx2txt if docx is all that's needed. Note that docx2txt is not perfect, so if your files contain hyperlinks, tables, headers, footers, footnotes etc., the results may not be exactly what you want - but then docx -> txt will always be lossy by definition.
Re: docx to txt
by sundialsvc4 (Abbot) on Jan 08, 2011 at 17:02 UTC

    For what it’s worth, the XML format used in a DOCX file is thoroughly documented by Microsoft, and there is a DTD as well.   (I rather think that some guv’mint along the way must have read them the “no more proprietary document formats for public documents” riot-act... which would have been a good thing.   I know the Library of Congress has been very outspoken about that.)

Re: docx to txt
by elef (Friar) on Jan 09, 2011 at 22:14 UTC
    I just noticed you wrote docx2txt "could be a starting point, but I am having some difficulties to figure out how to integrate it into my application."

    It seems pretty straightforwared to me, but here's what I did in case you need it: I just kept docx2txt in a separate file out of laziness and just used
    system ("perl \"$pathtodocx/docx2txt.pl\" \"${infile}.doc\" \"${outfil +e}.txt\"");

    Now, my app needs to run on both *nix and Windows, and Windows compatibility required quite a bit of trickery. I modified the .pl slightly, bundled an unzip program with my application and wrote some code that generates a .config for docx2txt at runtime, to let docx2txt know where unzip.exe is. It's not what I would call elegant, but it works. If you need this to run on Windows computers other than your own, I can paste the code here.
      system, autodie
      { use autodie qw' system '; system $^X, "$pathtodocx/docx2txt.pl", "$infile.doc", "$outfile.txt" +; }

      I am back from some days of holidays ;)

      Oh, I would really appreciate if you could post your solution!

        I'm guessing that was directed towards me.
        Apologies for not answering sooner, I just found this post now.

        Anyway, here's the code (keep in mind that I'm by no means an experienced coder... also, this code was taken from a much larger project, so some of the silly variable names have a logic you can't see here. It's also missing some logging and reporting I left out etc. I have only tested this on Windows, but it's designed to work on most linux distros and OSX as well.):
        #!/usr/bin/perl use strict; use warnings; use File::Spec; # IDENTIFY OS my $OS; if ($^O =~ /mswin/i) {$OS = "Windows";print "OS detected: Windows\n"} elsif ($^O =~ /linux/i) {$OS = "Linux";print "OS detected: Linux\n"} elsif ($^O =~ /darwin/i) {$OS = "Mac";print "OS detected: Mac OS X\n"} + else {print "\nUnable to detect OS type, choose your OS:\n\nWindows + Any version of Microsoft Windows\nMac Any flavour of Mac OS X\nLi +nux Linux of some sort\n\n"; do { chomp ($OS = <STDIN>); print "\nIncorrect OS type. Try again.\n\n" unless $OS eq "Windows +" or $OS eq "Mac" or $OS eq "Linux";} until ($OS eq "Windows" or $OS +eq "Mac" or $OS eq "Linux"); } # IDENTIFY SCRIPT PATH my $script = File::Spec->rel2abs( __FILE__ ); $script =~ /(.*)[\/|\\](.*)/; my $scriptpath = $1; if (-d "$scriptpath/scripts/docx2txt") { # print "\nScript folder found.\n";# comment out } else { do { print "\nThe script path found automatically (${scriptpath}) i +s not correct.\nPlease drag and drop the aligner script here and pres +s enter. (If your OS doesn't support drag & drop, copy-paste the path + here. You can paste by right clicking in the window or right clickin +g the icon in the top left corner of this window.)\n"; chomp ($script = <STDIN>); $script =~ / *[\"\'](.*)[\/\\](.*)[\"\'] */; $scriptpath = $1; $scriptpath =~ s/^\s+//; # strip leading wh +itespace $scriptpath =~ s/\s+$//; # strip trailing w +hitespace if (-e "$scriptpath/scripts/docx2txt") {print "\nScript folder + identified correctly.\n"} } until (-e "$scriptpath/scripts/docx2txt"); } # DRAG AND DROP INPUT FILE my $file1_full; print "\n\nDrag and drop your input file here and press enter.\n"; chomp ($file1_full = <STDIN>); $file1_full =~ s/^\s+//; # strip leading whitespace $file1_full =~ s/\s+$//; # strip trailing whitespac +e $file1_full =~ /^[\"\']?(.*)[\/\\]([^\"\']*)[\"\']?$/; my $folder = $1; my $file1 = $2; $file1 =~ /(.*)\.(.*)/; my $f1 = $1; my $ext = lc($2); # CONVERT DOCX TO UTF-8 TXT if ($OS eq "Windows") { # create config file, run docx2txt.exe modded to use win config fi +le open (DOCX2TXTCONFIG, "<", "$scriptpath/scripts/docx2txt/docx2txt. +config") or die "Can't open file: $!"; unlink "$scriptpath/scripts/docx2txt/docx2txt_win.config"; open (DOCX2TXTCONFIG_WIN, ">>", "$scriptpath/scripts/docx2txt/docx +2txt_win.config") or die "Can't open file: $!"; while (<DOCX2TXTCONFIG>) { s/^unzip *=>.*$/unzip => \'$scriptpath\\scripts\\docx2 +txt\\unzip\\unzip\.exe\',/; print DOCX2TXTCONFIG_WIN $_; } close DOCX2TXTCONFIG; close DOCX2TXTCONFIG_WIN; system ("\"$scriptpath\\scripts\\docx2txt\\docx2txt_win.exe\" \"$f +older/$file1\" \"$folder/${f1}.txt\""); } else { # linux and mac both use the original docx2txt.pl and both ha +ve unzip at usr/bun/unzip system ("perl \"$scriptpath/scripts/docx2txt/docx2txt.pl\" \"$fold +er/$file1\" \"$folder/${f1}.txt\""); } #work with the txt file from now on $file1 = "${f1}.txt"; # CHECK FILE SIZE, ABORT IF 0 my $file_1_size = -s "$folder/$file1"; if ($file_1_size == 0) { print "\n\nThe file conversion seems to have failed: the generated + file is empty. ABORTING.\n\n"; sleep 3; die; } # DONE print "\n$file1 created ($file_1_size bytes).\nPress enter to quit.\n" +; <STDIN>;
        Now, this requires docx2txt.pl for *nix, and docx2txt.exe and unzip.exe on windows. It looks for these in scripts/docx2txt, I have uploaded the necessary files here. Of course you can get your own unzip binary and generate docx2txt.exe yourself with pp, which is what I did, or just use the .pl on Windows as well if your users can be expected to have perl installed.
        The file won't be up here for long, so here's a summary in case someone reads this when I've already yanked it:
        Docx2txt needs to unzip the docx (zip) files. To make this work on Windows, I have modded the original perl script to use a different config file (docx2txt_win.config) which the main script generates at runtime, filling in the path to unzip.exe (scripts/doxc2txt/unzip) according to what folder it's in. Then I generated an executable (docx2txt_win.exe) out of this slightly modified script. On Linux and OS X systems, the original .pl is used without modifications as these OSes can reasonably be expected to have an unzip utility at usr/bin/unzip.