Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Converting doc to txt without WIN32::OLE

by mrguy123 (Hermit)
on Jun 20, 2012 at 07:27 UTC ( [id://977242]=perlquestion: print w/replies, xml ) Need Help??

mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks
I have about 30 word documents that I need to parse, and would like to transform them to text. I know the best solution is to use Win32::OLE but since I am working in Unix this is a bit of a problem. Thre are solutions but for now I am looking for other directions.
What I really need is a super simple Perl module that gets text from word documents and dumps it in a string. I tried using Text::Extract::Word but it only worked on one doc out of 5.
So, any ideas?

There’s no point in being grown up if you can’t act a little childish sometimes

P.S: I need this solution primarily for doc and not docx documents

UPDATE: Installed the latest version of Text::Extract::Word (it appears that I was working with the legacy code for some reason). It now works! Thanks for all your help
Guy
  • Comment on Converting doc to txt without WIN32::OLE

Replies are listed 'Best First'.
Re: Converting doc to txt without WIN32::OLE
by Corion (Patriarch) on Jun 20, 2012 at 07:39 UTC

    If you don't want to use Win32::OLE, I would look at OpenOffice / LibreOffice for converting the doc file to something else. Searching for MSWord also shows some modules, but there is nothing like Spreadsheet::ParseExcel for Word.

    Simply scanning the file for strings will likely not work well if the strings are stored as UTF-16 or wide characters.

      Hi Corion, thanks for the tip
      I will try some of the modules you suggested, maybe I will get lucky :)
      Open Office is also a good solution, but I need to see if there is a batch option for conversion before I go there
        You don't really need a "batch option" with OO/LO; just write your script, allowing it to accept (@ARGV or a list in a file) multiple docs which names/fullpaths it then feeds to OO.

        UPDATE: PS. You can write offsite links here as "[http://www.foo/bar.htm]" and, for PM nodes, as "[id://977242]." In the latter case, that's especially preferred, so you don't inadvertently logout a reader following your node on say, PM.org by a link to PM.com.

        See also Markup in the Monastery for less expensive (keystroke-wise) ways a marking paras. </update>

Re: Converting doc to txt without WIN32::OLE
by Khen1950fx (Canon) on Jun 20, 2012 at 16:55 UTC

    I did some work on Text::Extract::Word last September. See: Re: Problem in Text::Extract::Word.

    I put two super simple scripts together for you that put the files into a string. For the test.docs, I used the test.docs from Text::Extract::Word/t directory:
    #single file #!/usr/bin/perl -l use strict qw/refs/; use warnings FATAL => 'all'; use Text::Extract::Word; binmode STDOUT, ':encoding(UTF-8)'; my $file = '/root/Desktop/xls/test1.doc'; my $extractor = Text::Extract::Word->new($file); my $string = $extractor->get_text; print "$string"; close STDOUT;
    #!/usr/bin/perl BEGIN { $| = 1; } use autodie; use strict qw/refs/; use Text::Extract::Word; use warnings FATAL => 'all'; binmode STDOUT, ':encoding(UTF-8)'; my(@data) = qw( test1.doc test2.doc test3.doc test4.doc test5.doc test6.doc ); foreach my $data(@data) { my $file = Text::Extract::Word->new($data); my $str = $file->get_text; print "$str===>File done<===\n\n"; sleep 2; } close STDOUT;
      Hi, thanks for the scripts
      Problem is that when I run them I get:
      Can't locate object method "new" via package "Text::Extract::Word" at +monks.pl line 11.
      I guess that there was something wrong with the installation or that the module isn't stable- I will try to fix it. The legacy interface works but not very well (some docs just aren't parsed)
      # legacy interface use Text::Extract::Word qw(get_all_text); my $text = get_all_text("test1.doc");
      OK, so it seems I am working with the legacy code for some reason. I am now installing the new version...hope it works better

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://977242]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-19 13:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found