Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Indexing of Word documents

by axiomcrs (Initiate)
on Jun 07, 2013 at 20:08 UTC ( [id://1037746]=perlquestion: print w/replies, xml ) Need Help??

axiomcrs has asked for the wisdom of the Perl Monks concerning the following question:

I have about 1000 Word documents that are semi-actively modified using Word 2003. They are all text-based, i.e. no pictures, graphs or the like, and contain among other pieces of data, names of people. I need to find the names and then the page number that name is on and then create an index appended to the end of the concatenation of all the doc files. I have written a perl script using win32:ole and have experienced some success, but it seems that win32:ole is poorly documented and somewhat flaky. The script works more or less. but I can't seem to fix the last remaining bugs. They are related to saving the file and opening and closing the documents I believe. I was wondering if there is a better way of doing this? I would have preferred to keep the finished document as a word document, but perhaps this is problematic. Would it be better to extract the text from the word docs and convert it to a pdf file. I can't determine if it is possible to find some text in a pdf file and get the page number the text is on using these: pdf::api2, cam::pdf, pdf::core. Also, I was wondering plain text may be a beter choice? I can provide the perl script I have using win32::ole if it is of any help. Thanks.

Replies are listed 'Best First'.
Re: Indexing of Word documents
by davies (Prior) on Jun 08, 2013 at 11:26 UTC

    My understanding - and that's liable to be very wrong - is that a page number is a very ephemeral thing in Word and therefore not stored in the file. The issue is that it depends on the computer and the available printers. The available fonts and the selected paper size have a major effect on the pagination, so this is not decided by Word until print time. Therefore, any page number you get out of Word is going to be unreliable and my understanding is that, while it can be calculated by Word most of the time given a configuration, it won't store that information and therefore it can't be extracted from a different machine.

    Regards,

    John Davies

      Your assumption is correct ... and it is even more ephemeral since it also depends on the page dimensions.   The 8-1/2x11 paper that’s common in USA, versus the A4-or-whatever page in Europe, and so on ... your default margins ... even font size.   OLE is going to be necessary here ... or ...

      What about Visual Basic or VBA?   Seriously:   it’s built-in to Word anyway, and it can instantiate a Word.Document object, and ... heck, it even has regexes buried in there someplace.   Maybe it could (“if I on-ly had a brai-i-in...”)   ;-)   do the first-step of finding words and page numbers, writing that to a file for consumption by more-intelligent beings.

Re: Indexing of Word documents
by CountZero (Bishop) on Jun 08, 2013 at 07:55 UTC
    If you need to find the page number in the word document then exporting the word files to PDF or tekst will not work, because you will have lost the word "pages". So you are stuck with working with the word-file and Win32::OLE.

    This is exactly the reason I stopped working with word and switched to LaTeX where using Perl to add an index is trivially easy.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Indexing of Word documents
by Laurent_R (Canon) on Jun 07, 2013 at 21:12 UTC

    I have never used Win32::OLE, so I may be totally on the wrong track, but reading from a file and trying to write to the same file does not work properly most of the time. Perhaps all you need to do is to write to a copy of each file (and then do the necessary house cleaning, deleting old files and renaming new ones, or may be you want to do that before).

Re: Indexing of Word documents
by flexvault (Monsignor) on Jun 08, 2013 at 18:45 UTC

    axiomcrs,

    On *nix, I have written indexing software to build book indexes. Author sends a manuscript in Word. I make sure the header or footer has the page number in it. I then export that to 'PDF' format. I then use the *nix command 'pdftohtml' to generate a large html file which includes the page number for the expected page size of the book.

    Maybe a variation of this will generate what you need.

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

Re: Indexing of Word documents
by rpnoble419 (Pilgrim) on Jun 10, 2013 at 04:31 UTC

    As you are indexing the data from word, the exact page number does not matter given the reasons stated by davies. A better solution is to index the document and use a paragraph counter to index any key words from. The number of paragraphs remain the same regardless of how the document re-flows. Only an edit to the document can change the paragraph count.

      Thanks for all suggestions. Here are some more details. This script is to create an index for a book. The word files will only reside on one computer, and so, the issues with changing computers and different printers goes away. Using paragraphs does not work since any paragraph could be on 2 pages at once and then a page number associated with a name would be wrong. I am not forced to do this with Word. So changing to pdf could be an option since an index for a book can be provided with a pdf file. I was asking about using pdf, but could not determine if page numbers are associated with the text. For instance, if I search for bob jones in the pdf file, is there meta-data that tells what page number that name appears?
        hey flexvault, does the pdftohtml program give page numbers as a metadata for the text?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1037746]
Approved by SamCG
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2024-04-23 12:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found