Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Problems? Is your data what you think it is?
 
PerlMonks  

Pull all text from msword document

by boat73 (Scribe)
on Jul 22, 2005 at 16:33 UTC ( #477281=perlquestion: print w/ replies, xml ) Need Help??
boat73 has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise ones. I have a script that opens a Word Document, saves it as a txt document. This code works fine. Does anyone know of a way to capture the text from the word document into an array without saving it to a txt document first? Below is the working code and as always I am open to any and all criticism both good and bad. Thanks in advance.
use Win32::OLE; use constant wdCRLF => 0; use constant wdFormatText => 2; use constant wdOpenFormatAuto => 0; $doc = "c:\\temp\\test.doc"; $txtdoc = "$ENV{TEMP}\\reportmacro.txt"; $Win32::OLE::Warn = 3; my $wd_object = (Win32::OLE->GetActiveObject('Word.Application') || Win32::OLE->new('Word.Application', 'Quit')); ##### MAKE WORD APP VISIBLE(1), NOT VISIBLE(0) #### $wd_object -> {Visible} = 1; $wd_object->Documents->Open({FileName => "$doc", ConfirmConversions +=> 0, ReadOnly => 0, AddToRecentFiles => 0, PasswordDocume +nt => '', PasswordTemplate => '', Revert => 0, WritePasswordDocument => + '', WritePasswordTemplate => '', Format => wdOpenFormatAuto, XMLTransf +orm => ''}); $wd_object->ActiveDocument->SaveAs({FileName => "$txtdoc", FileForma +t => wdFormatText, LockComments => 0, password => '', AddToRecentFil +es => 1, WritePassword => '', ReadOnlyRecommended => 0, Embe +dTrueTypeFonts => 0, SaveNativePictureFormat => 0, +SaveFormsData => 0, SaveAsAOCELetter => 0, Encodin +g => 1252, InsertLineBreaks => 1, AllowSubstitutions => 0, LineE +nding => wdCRLF}); $wd_object->ActiveDocument->Close();

Comment on Pull all text from msword document
Download Code
Re: Pull all text from msword document
by davidrw (Prior) on Jul 22, 2005 at 16:52 UTC
    a few random thoughts:
    • add use strict; and use warnings;
    • I assume the SaveAs() method has to take a filename and not a filehandle.. if it could take a filehandle IO::Scalar would help you out..
    • If you do the equivalent of a "select all", what data type/structure do you get? can you get plain text from that?
    • If you do a "select all" and "copy" to shove it into the clipboard, is it any easier to get the plain text from that? (this is probably not the ideal route)
    • Using a temp file isn't the worst thing in the world.. File::Temp will make it easier, too.
Re: Pull all text from msword document
by wfsp (Abbot) on Jul 22, 2005 at 16:57 UTC
    #!/bin/perl5 use strict; use warnings; use Win32::OLE; my $w = Win32::OLE->GetActiveObject('Word.Application'); my $d = $w->ActiveDocument; my $paras = $d->Paragraphs; my @word; foreach my $para (in $paras) { my $text = $para->Range->{text}; chop $text; # remove /r push @word, $text; } print "$_\n" for @word;
      Perfect, tyhanks so much. I have also added the warnings and strict to the code. Thanks to all those that responded.
Re: Pull all text from msword document
by socketdave (Curate) on Jul 22, 2005 at 16:59 UTC
    I've always used a program called antiword for this:

    http://www.winfield.demon.nl/

    I wonder how much trouble it would be to incorporate this work into a Perl module that could take this out of win32::ole territory...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://477281]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2014-04-18 05:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (462 votes), past polls