http://www.perlmonks.org?node_id=1019638

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Good day, Bros. I am preparing an index of a book manuscript done in Word (2010) and I want to write a script that will grab the text from each page separately. After some searching around I can only find code examples that do things like print the document, change margins, etc. Can anyone point me in the right direction? I've done quite a bit with OLE and Outlook and Excel, but I don't know the Word object model and would prefer to avoid climbing that learning curve if possible.

Replies are listed 'Best First'.
Re: Win32 OLE Word Get Page Text
by ww (Archbishop) on Feb 19, 2013 at 18:40 UTC

    One simple-minded, OTTOMH approach (unless MS Word's formatting is somehow important):

    1. in Word 2010, edit in an end_of_record marker of any flavor you like, so long as it won't appear in the test.
    2. save the whole (edited) .doc as .txt
    3. read the .txt
    4. split on EOR to create an array (named for page number) per page of words.
    5. split each array's contents on spaces to a second-level array (named for the page number from which the first array was extracted) of individual words
    6. index to your heart's content...

    Of course, this may not be the most efficient approach, but it certainly avoids "climbing that (Word object model) curve.


    If you didn't program your executable by toggling in binary, it wasn't really programming!

Re: Win32 OLE Word Get Page Text
by nikosv (Deacon) on Feb 19, 2013 at 22:07 UTC
    use OLE Viewer to check the Type libraries and get an understanding of the automation interfaces provided by the Word object
    Using the OLE/COM Object Viewer