Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Pull all text from msword document

by boat73 (Scribe)
on Jul 22, 2005 at 16:33 UTC ( #477281=perlquestion: print w/ replies, xml ) Need Help??
boat73 has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise ones. I have a script that opens a Word Document, saves it as a txt document. This code works fine. Does anyone know of a way to capture the text from the word document into an array without saving it to a txt document first? Below is the working code and as always I am open to any and all criticism both good and bad. Thanks in advance.
use Win32::OLE; use constant wdCRLF => 0; use constant wdFormatText => 2; use constant wdOpenFormatAuto => 0; $doc = "c:\\temp\\test.doc"; $txtdoc = "$ENV{TEMP}\\reportmacro.txt"; $Win32::OLE::Warn = 3; my $wd_object = (Win32::OLE->GetActiveObject('Word.Application') || Win32::OLE->new('Word.Application', 'Quit')); ##### MAKE WORD APP VISIBLE(1), NOT VISIBLE(0) #### $wd_object -> {Visible} = 1; $wd_object->Documents->Open({FileName => "$doc", ConfirmConversions +=> 0, ReadOnly => 0, AddToRecentFiles => 0, PasswordDocume +nt => '', PasswordTemplate => '', Revert => 0, WritePasswordDocument => + '', WritePasswordTemplate => '', Format => wdOpenFormatAuto, XMLTransf +orm => ''}); $wd_object->ActiveDocument->SaveAs({FileName => "$txtdoc", FileForma +t => wdFormatText, LockComments => 0, password => '', AddToRecentFil +es => 1, WritePassword => '', ReadOnlyRecommended => 0, Embe +dTrueTypeFonts => 0, SaveNativePictureFormat => 0, +SaveFormsData => 0, SaveAsAOCELetter => 0, Encodin +g => 1252, InsertLineBreaks => 1, AllowSubstitutions => 0, LineE +nding => wdCRLF}); $wd_object->ActiveDocument->Close();

Comment on Pull all text from msword document
Download Code
Re: Pull all text from msword document
by davidrw (Prior) on Jul 22, 2005 at 16:52 UTC
    a few random thoughts:
    • add use strict; and use warnings;
    • I assume the SaveAs() method has to take a filename and not a filehandle.. if it could take a filehandle IO::Scalar would help you out..
    • If you do the equivalent of a "select all", what data type/structure do you get? can you get plain text from that?
    • If you do a "select all" and "copy" to shove it into the clipboard, is it any easier to get the plain text from that? (this is probably not the ideal route)
    • Using a temp file isn't the worst thing in the world.. File::Temp will make it easier, too.
Re: Pull all text from msword document
by wfsp (Abbot) on Jul 22, 2005 at 16:57 UTC
    #!/bin/perl5 use strict; use warnings; use Win32::OLE; my $w = Win32::OLE->GetActiveObject('Word.Application'); my $d = $w->ActiveDocument; my $paras = $d->Paragraphs; my @word; foreach my $para (in $paras) { my $text = $para->Range->{text}; chop $text; # remove /r push @word, $text; } print "$_\n" for @word;
      Perfect, tyhanks so much. I have also added the warnings and strict to the code. Thanks to all those that responded.
Re: Pull all text from msword document
by socketdave (Curate) on Jul 22, 2005 at 16:59 UTC
    I've always used a program called antiword for this:

    http://www.winfield.demon.nl/

    I wonder how much trouble it would be to incorporate this work into a Perl module that could take this out of win32::ole territory...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://477281]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2014-08-29 02:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (275 votes), past polls