Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

A copyeditor needs help to get started with a Perl project

by wordsmith (Acolyte)
on Nov 04, 2004 at 08:40 UTC ( [id://405113]=perlquestion: print w/replies, xml ) Need Help??

wordsmith has asked for the wisdom of the Perl Monks concerning the following question:

I'm an ex-IT person who has recently switched to copyediting STM books for a living. I would like to use Perl to automate the more mechanical aspects of my work. The manuscript files come in MS Word format, and I want to generate a report that will extract all hyphenated terms, capitalized phrases, acronyms (expanded and unexpanded), etc.

Conflicting terms (e.g., the same term could be uppercase in one chapter and lowercase in another) would also be identified in the report. Also, the utility should be able to flag all the terms in the manuscript file that appear in another text file containing keywords input by me. One of the tasks of the copyeditor is to make sure the book is consistent in the way terms appear in different chapters.

I'm hoping to write a utility program in Perl that will generate such a report and help me make the manuscript consistent. A few questions:

  1. I'm assuming Perl is the right tool for such a utility. Stupid question, still would appreciate confirmation from an expert.

  2. Do any readymade libraries exist for such tasks? I'd prefer to write the code myself because I could tweak it to suit my private perversions; still, it would be nice to know.

  3. Can I search MS Word files directly using Perl or do I need to save the Word file as a text file? If I could work directly with the Word file, the added functionality of searching headers, tables, etc., would be a great plus.

In passing, is the MS Word format a state secret?

My Perl level, you ask? Tyro, sir, tyro. Just downloaded ActiveState 5.8, have bookmarked an online book, have a Perl primer in my drawer, and am looking forward to having fun. I'm just rarin' to go. I have done a fair amount of programming in the past and am not afraid of writing code.

Thanks in advance!

Wordsmith

Janitored by Arunbear - added <p> tags for readability

  • Comment on A copyeditor needs help to get started with a Perl project

Replies are listed 'Best First'.
Re: A copyeditor needs help to get started with a Perl project
by demerphq (Chancellor) on Nov 04, 2004 at 09:00 UTC

    While Perl is a very good tool for these types of jobs I wonder if in this case VB macros aren't more suitable. I should think that by interacting with Word's Object model you could get the Word engine itself to do many of these tasks. Having said that if you do go Perl I would probably end up converting the word files into a more suitable text only form (word documents are actually binary objects with lots of stuff in them) and then work there.

    Possibly a hybrid approach would be to make a macro in VB that automatically extracts each chapter to a text file, and then you could have a perl script that operated on the result. There are also options of interfacing with Word COM interface from Perl as well, but until I was familiar with doing it in VB i wouldnt bother, the whole VBA editor/design process can be quite illuminating as to how the Object model works with reasonably good documentation and support like a visual debugger and automatic method/property selection. Making macros and then studying the generated code is a good way to learn. Once you have working VB code its not too diffuclt to translate it to Perl via the Win32 modules.

    Altogether I imagine the difficulty will be dependent on how automated you want the process to be. If you simply save the documents as text and then have a script that does various jobs like you decribe you avoid a lot of the complexity of interfacing to the Word document/engine. As your perl skills improve you could look at automating the process.

    BTW, as an editor you may like to know that your node was put up for editorial/janitorial consideration to add some markup to make your node more readable. Normally you should use P tags like: <p>blah blah</p> to break up your nodes. Reading a long paragraph like yours is difficult on a screen. There are markup tips and links underneath where you can post, please review them. ;-)

    Cheers,

    ---
    demerphq

      Thanks, everyone, for all your suggestions. I think I'll save as text and unleash Perl on the text files for starters. Later, I'll try and wade into the more complex part.

      Yes, your comments on the poor formatting of my post were on target. I'll use the HTML tags from now on.

Re: A copyeditor needs help to get started with a Perl project
by tachyon (Chancellor) on Nov 04, 2004 at 09:33 UTC

    Working with Word documents directly from Perl *can* be done. Working with them from VB is possibly easier. You will basically be converting VB examples to Win32::OLE in Perl if you wish to run with Perl. If you can process the manuscripts as text files then Perl is definitely the weapon of choice. Here are a few Win32::OLE examples that work with Word Files. I just ripped them out of the toolbox so you will need to fiddle with them. They worked happilly in their native environment ;-)

Re: A copyeditor needs help to get started with a Perl project
by BrowserUk (Patriarch) on Nov 04, 2004 at 16:25 UTC

    Personally, I'd stick to dealing with the text as text. The format of Word documents is horribly complex, badly if at all documented, and subject to variation from version to version of word. It also contains a heap of stuff that are nothing to do with the content or even it's markup. (I should also add that I don't use word for anything. wordpad is as much of a word processor as I will ever need.)

    As a first pass, I would

    1. Break the text into terms (words; but including things like leading or trailing quotes, embedded hyphens and apostrophies etc.).
    2. Store the terms in a hash as the key, and sufficient context to allow for easy searching.
    3. Sort the terms case insensitively, (retaining case) so that 'This' and 'this' (and 'tihs') sort adjacently.
    4. Flag terms not found in the dictionary. (optionally, also flag those in your word list--not done yet).

      Present the terms (+flags +counts) as a scrolling list.

    5. Allow inspection of the contexts for selected words.
    6. Copy the context to the clipboard for easy searching in Word or other editor.

    And here's some crude, incomplete code to do most of that. Fixing it up to use a Win32::Console or one of the GUI interfaces etc. is left as an exercise :)

    Like I said, crude, but a simple, effective?, possible approach.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

    Janitored by Arunbear - added readmore tags, as per Monastery guidelines

Re: A copyeditor needs help to get started with a Perl project
by gaal (Parson) on Nov 04, 2004 at 09:42 UTC
    FWIW, if you prefer Perl to VB, you can save your data as HTML in Word, edit that, and reread it. Word reads its own HTML very well, for all I know with no loss of information at all.

      Word reads its own HTML very well

      That is a very good thought. As horrid as Word HTML is to the naked eye HTML parser should let you whip through it with ease, editing the text but leaving the puke vomit markup formatting. Then as you say let Word convert its own excreta back into native format. This conversion is essentially just padding with huge numbers of null bytes for every real character, thus 'Hello World!' as a text file is 13 bytes but in .DOC format it needs a mere 19,456 :-)

        *shrug*

        A VB runtime sounds like it's going to be bigger than either :)

        (BTW, some versions of Word had a bug where the first time you saved a file after an edit, it's size would be about double what it'd be if you'd immediately save it again. Or did this happen only when the file was saved as RTF?)

        Tacyon's phrasing is admirable -- wish I'd said it first! ( and read 405192 et seq, as well -- inserted 1700GMT) -- but sparks a small question.

        Based on wordsmith's original description, won't cleaning up the "puke" "vomit" bovine manure (ok, "markup" or "formatting") be almost as important to the desired "consistency" as emending the actual text? The boldfacing, font changes, etc. in the original .doc may be formatting conventions for which consistency is also desired.

        Also, curious (enough so to read/experiment soon, unless some oracle here knows with certainty) whether HTML parser can actually make sense of the many conditionals Word inserts while saving as .html garbage.

      Working with the HTML version of the Word file is an interesting idea. By the way, I just need to read the file; writing will be to a new report file. The original Word file will remain untouched. I will enter the corrections manually. The "automation" in my original post only refers to the process of identifying inconsistencies in the document.

      But can Perl recognize document elements such as headers, fonts, superscripts, tables, etc., in the HTML file? Let me give you a sample real-life scenario. Let us say a chapter has 100 numbered reference items at the end of the chapter, which are cited in text by superscripted integers. The problem: generate a report that will identify the reference items that have not been cited in the text.

      To accomplish this, Perl would have to recognize superscripted elements in the HTML file. Pardon my ignorance, but can it?

        Yes, it can.

        I don't have sample data at hand right now, so I can't give you the exact example, but if you inserted the footnote (or endnote) with the standard Insert Footnote menu item, the integer is well tagged with something like "<span class="footnoteReference">....</span>" tags. If you're on a Windows machine, just save a demo document as HTML and look at the result with a text editor, it should be clear enough. Don't be intimidated by the several KBs of CSS in the beginning :)

Re: A copyeditor needs help to get started with a Perl project
by perlcapt (Pilgrim) on Nov 04, 2004 at 23:08 UTC
    I think that MS has very good tools for XML generation and editing. Isn't there a utility for going between Word and XML? XML has the depth of the book organization but allows for the easy manipulation by Perl and many other tools.

    I'm talking off the top of my head here, and would be hard put to give a demonstration, but this is something I would investigate if I were to do another book or more book scale editing.

    What is STM books?

    Update:

    I found this commercial suite of products for doing the conversions between Word and XML.
    perlcapt
    -ben

      Yes, I really should pick up XML because I'm in the publishing industry. I can see 3 milestones for my project:

      (1) Perl with text files.

      (2) Perl with HTML files for the greater functionality of being able to handle Word document elements such as superscripts, fonts, etc. (Or bite the bullet and do it with VB if this approach fails to work).

      (3) Perl with XML. At this point, I should have a marketable product and the big bucks should start flowing in. :-)

      What are STM books? STM stands for "scientific, technical, and medical." But it's the IT books that drive us up the wall with the jargon, acronyms, and terms uppercased or not depending on the author's whims. My Perl project is primarily directed toward taming IT books. Most sciences have fairly stable conventions regarding nomenclature, but not IT methinks.

      Would you say "I bought two mouses from the store" or "I bought two mice from the store"? And which would choose: keystream, key stream, or key-stream? We had one book where it appeared all three ways.

      By the way, are there no well-known freeware Word-to-XML converters?

      Thanks, everybody, for all the help.

        I can't find BYTE Magazine's style guide which has all of these issues pinned down after thousands of man-hours debating them, but I did find two links that are probably even better:

        I have a personal/professional interest in the .doc -> XML subject and am pursuing it as a result of your comments. I'll post an update to this message when I have found out more about it. It appears as though there is a Word plugin or enhancement for going between XML and Word documents.

        Update:

        Here is the Microsoft Toolbox for Word/XML I'll try it out and let you know.
        perlcapt
        -ben

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://405113]
Approved by Tomte
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2024-04-24 23:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found