Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: (Get text of Word Document)going through a Win32 MSWORD doc

by buzzcutbuddha (Chaplain)
on Dec 20, 2001 at 22:28 UTC ( #133555=note: print w/ replies, xml ) Need Help??

in reply to going through a Win32 MSWORD doc

If what you meant by indexing is grabbing each word in a Word Document and creating an index of those words to find them later, the following will give you all of the words in a document. I'll let you focus on the indexing part. :)

#!/usr/bin/perl # general use directives use strict; use warnings; # project specific use directives # this comes with the standard ActiveState # distribution. You can also look for # a newer version with PPM use Win32::OLE; my $wd; # get the document # use the full path eval { $wd = Win32::OLE->GetObject('C:/pathto/document/foo.doc') }; die "Unable to load document\n" if $@; # all of the Word document data members I'm using # are explained in the MSDN documentation of the # external interfaces of a Word Document. # if you have MSDN, search for "Word OLE". # get the number of paragraphs my $paraCount = $wd->{Paragraphs}->Count; # set the counter my $foo = 0; my @words; while ($foo++ < $paraCount) { push @words, split /\s/, $wd->{Paragraphs}{$foo}{Range}{Text}; } #clean up at the end undef $wd;
That's how you get the words of a word document out and into an array. You may prefer a different data structure, but again, I'll leave that up to you! I hope this helps.

Comment on Re: (Get text of Word Document)going through a Win32 MSWORD doc
Download Code

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://133555]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2015-11-25 03:59 GMT
Find Nodes?
    Voting Booth?

    What would be the most significant thing to happen if a rope (or wire) tied the Earth and the Moon together?

    Results (670 votes), past polls