Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
I'm guessing that was directed towards me.
Apologies for not answering sooner, I just found this post now.

Anyway, here's the code (keep in mind that I'm by no means an experienced coder... also, this code was taken from a much larger project, so some of the silly variable names have a logic you can't see here. It's also missing some logging and reporting I left out etc. I have only tested this on Windows, but it's designed to work on most linux distros and OSX as well.):
#!/usr/bin/perl use strict; use warnings; use File::Spec; # IDENTIFY OS my $OS; if ($^O =~ /mswin/i) {$OS = "Windows";print "OS detected: Windows\n"} elsif ($^O =~ /linux/i) {$OS = "Linux";print "OS detected: Linux\n"} elsif ($^O =~ /darwin/i) {$OS = "Mac";print "OS detected: Mac OS X\n"} + else {print "\nUnable to detect OS type, choose your OS:\n\nWindows + Any version of Microsoft Windows\nMac Any flavour of Mac OS X\nLi +nux Linux of some sort\n\n"; do { chomp ($OS = <STDIN>); print "\nIncorrect OS type. Try again.\n\n" unless $OS eq "Windows +" or $OS eq "Mac" or $OS eq "Linux";} until ($OS eq "Windows" or $OS +eq "Mac" or $OS eq "Linux"); } # IDENTIFY SCRIPT PATH my $script = File::Spec->rel2abs( __FILE__ ); $script =~ /(.*)[\/|\\](.*)/; my $scriptpath = $1; if (-d "$scriptpath/scripts/docx2txt") { # print "\nScript folder found.\n";# comment out } else { do { print "\nThe script path found automatically (${scriptpath}) i +s not correct.\nPlease drag and drop the aligner script here and pres +s enter. (If your OS doesn't support drag & drop, copy-paste the path + here. You can paste by right clicking in the window or right clickin +g the icon in the top left corner of this window.)\n"; chomp ($script = <STDIN>); $script =~ / *[\"\'](.*)[\/\\](.*)[\"\'] */; $scriptpath = $1; $scriptpath =~ s/^\s+//; # strip leading wh +itespace $scriptpath =~ s/\s+$//; # strip trailing w +hitespace if (-e "$scriptpath/scripts/docx2txt") {print "\nScript folder + identified correctly.\n"} } until (-e "$scriptpath/scripts/docx2txt"); } # DRAG AND DROP INPUT FILE my $file1_full; print "\n\nDrag and drop your input file here and press enter.\n"; chomp ($file1_full = <STDIN>); $file1_full =~ s/^\s+//; # strip leading whitespace $file1_full =~ s/\s+$//; # strip trailing whitespac +e $file1_full =~ /^[\"\']?(.*)[\/\\]([^\"\']*)[\"\']?$/; my $folder = $1; my $file1 = $2; $file1 =~ /(.*)\.(.*)/; my $f1 = $1; my $ext = lc($2); # CONVERT DOCX TO UTF-8 TXT if ($OS eq "Windows") { # create config file, run docx2txt.exe modded to use win config fi +le open (DOCX2TXTCONFIG, "<", "$scriptpath/scripts/docx2txt/docx2txt. +config") or die "Can't open file: $!"; unlink "$scriptpath/scripts/docx2txt/docx2txt_win.config"; open (DOCX2TXTCONFIG_WIN, ">>", "$scriptpath/scripts/docx2txt/docx +2txt_win.config") or die "Can't open file: $!"; while (<DOCX2TXTCONFIG>) { s/^unzip *=>.*$/unzip => \'$scriptpath\\scripts\\docx2 +txt\\unzip\\unzip\.exe\',/; print DOCX2TXTCONFIG_WIN $_; } close DOCX2TXTCONFIG; close DOCX2TXTCONFIG_WIN; system ("\"$scriptpath\\scripts\\docx2txt\\docx2txt_win.exe\" \"$f +older/$file1\" \"$folder/${f1}.txt\""); } else { # linux and mac both use the original docx2txt.pl and both ha +ve unzip at usr/bun/unzip system ("perl \"$scriptpath/scripts/docx2txt/docx2txt.pl\" \"$fold +er/$file1\" \"$folder/${f1}.txt\""); } #work with the txt file from now on $file1 = "${f1}.txt"; # CHECK FILE SIZE, ABORT IF 0 my $file_1_size = -s "$folder/$file1"; if ($file_1_size == 0) { print "\n\nThe file conversion seems to have failed: the generated + file is empty. ABORTING.\n\n"; sleep 3; die; } # DONE print "\n$file1 created ($file_1_size bytes).\nPress enter to quit.\n" +; <STDIN>;
Now, this requires docx2txt.pl for *nix, and docx2txt.exe and unzip.exe on windows. It looks for these in scripts/docx2txt, I have uploaded the necessary files here. Of course you can get your own unzip binary and generate docx2txt.exe yourself with pp, which is what I did, or just use the .pl on Windows as well if your users can be expected to have perl installed.
The file won't be up here for long, so here's a summary in case someone reads this when I've already yanked it:
Docx2txt needs to unzip the docx (zip) files. To make this work on Windows, I have modded the original perl script to use a different config file (docx2txt_win.config) which the main script generates at runtime, filling in the path to unzip.exe (scripts/doxc2txt/unzip) according to what folder it's in. Then I generated an executable (docx2txt_win.exe) out of this slightly modified script. On Linux and OS X systems, the original .pl is used without modifications as these OSes can reasonably be expected to have an unzip utility at usr/bin/unzip.

In reply to Re^3: docx to txt by elef
in thread docx to txt by welle

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others scrutinizing the Monastery: (8)
    As of 2014-07-29 20:35 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (227 votes), past polls