A copyeditor needs help to get started with a Perl project

wordsmith has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: A copyeditor needs help to get started with a Perl project by demerphq (Chancellor) on Nov 04, 2004 at 09:00 UTC
While Perl is a very good tool for these types of jobs I wonder if in this case VB macros aren't more suitable. I should think that by interacting with Word's Object model you could get the Word engine itself to do many of these tasks. Having said that if you do go Perl I would probably end up converting the word files into a more suitable text only form (word documents are actually binary objects with lots of stuff in them) and then work there. Possibly a hybrid approach would be to make a macro in VB that automatically extracts each chapter to a text file, and then you could have a perl script that operated on the result. There are also options of interfacing with Word COM interface from Perl as well, but until I was familiar with doing it in VB i wouldnt bother, the whole VBA editor/design process can be quite illuminating as to how the Object model works with reasonably good documentation and support like a visual debugger and automatic method/property selection. Making macros and then studying the generated code is a good way to learn. Once you have working VB code its not too diffuclt to translate it to Perl via the Win32 modules. Altogether I imagine the difficulty will be dependent on how automated you want the process to be. If you simply save the documents as text and then have a script that does various jobs like you decribe you avoid a lot of the complexity of interfacing to the Word document/engine. As your perl skills improve you could look at automating the process. BTW, as an editor you may like to know that your node was put up for editorial/janitorial consideration to add some markup to make your node more readable. Normally you should use P tags like: `<p>blah blah</p>` to break up your nodes. Reading a long paragraph like yours is difficult on a screen. There are markup tips and links underneath where you can post, please review them. ;-) Read more... about how your node could have looked (2 kB) Cheers, --- demerphq	[reply] [d/l]
Re^2: A copyeditor needs help to get started with a Perl project by wordsmith (Acolyte) on Nov 04, 2004 at 13:17 UTC
Thanks, everyone, for all your suggestions. I think I'll save as text and unleash Perl on the text files for starters. Later, I'll try and wade into the more complex part. Yes, your comments on the poor formatting of my post were on target. I'll use the HTML tags from now on.	[reply]
Re: A copyeditor needs help to get started with a Perl project by tachyon (Chancellor) on Nov 04, 2004 at 09:33 UTC
Working with Word documents directly from Perl can be done. Working with them from VB is possibly easier. You will basically be converting VB examples to Win32::OLE in Perl if you wish to run with Perl. If you can process the manuscripts as text files then Perl is definitely the weapon of choice. Here are a few Win32::OLE examples that work with Word Files. I just ripped them out of the toolbox so you will need to fiddle with them. They worked happilly in their native environment ;-) Read more... (7 kB)	[reply] [d/l]
Re: A copyeditor needs help to get started with a Perl project by BrowserUk (Patriarch) on Nov 04, 2004 at 16:25 UTC
Personally, I'd stick to dealing with the text as text. The format of Word documents is horribly complex, badly if at all documented, and subject to variation from version to version of word. It also contains a heap of stuff that are nothing to do with the content or even it's markup. (I should also add that I don't use word for anything. wordpad is as much of a word processor as I will ever need.) As a first pass, I would Break the text into terms (words; but including things like leading or trailing quotes, embedded hyphens and apostrophies etc.). Store the terms in a hash as the key, and sufficient context to allow for easy searching. Sort the terms case insensitively, (retaining case) so that 'This' and 'this' (and 'tihs') sort adjacently. Flag terms not found in the dictionary. (optionally, also flag those in your word list--not done yet). Present the terms (+flags +counts) as a scrolling list. Allow inspection of the contexts for selected words. Copy the context to the clipboard for easy searching in Word or other editor. And here's some crude, incomplete code to do most of that. Fixing it up to use a Win32::Console or one of the GUI interfaces etc. is left as an exercise :) Read more... (6 kB) Like I said, crude, but a simple, effective?, possible approach. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon Janitored by Arunbear - added readmore tags, as per Monastery guidelines	[reply] [d/l] [select]
Re: A copyeditor needs help to get started with a Perl project by gaal (Parson) on Nov 04, 2004 at 09:42 UTC
FWIW, if you prefer Perl to VB, you can save your data as HTML in Word, edit that, and reread it. Word reads its own HTML very well, for all I know with no loss of information at all.	[reply]
Re^2: A copyeditor needs help to get started with a Perl project by tachyon (Chancellor) on Nov 04, 2004 at 10:01 UTC
Word reads its own HTML very well That is a very good thought. As horrid as Word HTML is to the naked eye HTML parser should let you whip through it with ease, editing the text but leaving the ~~puke vomit markup~~ formatting. Then as you say let Word convert its own excreta back into native format. This conversion is essentially just padding with huge numbers of null bytes for every real character, thus 'Hello World!' as a text file is 13 bytes but in .DOC format it needs a mere 19,456 :-)	[reply]
Re^3: A copyeditor needs help to get started with a Perl project by gaal (Parson) on Nov 04, 2004 at 11:06 UTC
shrug A VB runtime sounds like it's going to be bigger than either :) (BTW, some versions of Word had a bug where the first time you saved a file after an edit, it's size would be about double what it'd be if you'd immediately save it again. Or did this happen only when the file was saved as RTF?)	[reply]
Re^4: A copyeditor needs help to get started with a Perl project by tachyon (Chancellor) on Nov 04, 2004 at 15:30 UTC
Re^5: A copyeditor needs help to get started with a Perl project by gaal (Parson) on Nov 04, 2004 at 15:51 UTC
Re^3: A copyeditor needs help to get started with a Perl project by ww (Archbishop) on Nov 04, 2004 at 14:55 UTC
Tacyon's phrasing is admirable -- wish I'd said it first! ( and read 405192 et seq, as well -- inserted 1700GMT) -- but sparks a small question. Based on wordsmith's original description, won't cleaning up the "~~puke~~" "~~vomit~~" ~~bovine manure~~ (ok, "markup" or "formatting") be almost as important to the desired "consistency" as emending the actual text? The boldfacing, ^{font changes}, etc. in the original .doc may be formatting conventions for which consistency is also desired. Also, curious (enough so to read/experiment soon, unless some oracle here knows with certainty) whether HTML parser can actually make sense of the many conditionals Word inserts while saving as ~~.html~~ garbage.	[reply]
Re^2: A copyeditor needs help to get started with a Perl project by wordsmith (Acolyte) on Nov 05, 2004 at 08:44 UTC
Working with the HTML version of the Word file is an interesting idea. By the way, I just need to read the file; writing will be to a new report file. The original Word file will remain untouched. I will enter the corrections manually. The "automation" in my original post only refers to the process of identifying inconsistencies in the document. But can Perl recognize document elements such as headers, fonts, superscripts, tables, etc., in the HTML file? Let me give you a sample real-life scenario. Let us say a chapter has 100 numbered reference items at the end of the chapter, which are cited in text by superscripted integers. The problem: generate a report that will identify the reference items that have not been cited in the text. To accomplish this, Perl would have to recognize superscripted elements in the HTML file. Pardon my ignorance, but can it?	[reply]
Re^3: A copyeditor needs help to get started with a Perl project by gaal (Parson) on Nov 05, 2004 at 09:44 UTC
Yes, it can. I don't have sample data at hand right now, so I can't give you the exact example, but if you inserted the footnote (or endnote) with the standard Insert Footnote menu item, the integer is well tagged with something like "`<span class="footnoteReference">....</span>`" tags. If you're on a Windows machine, just save a demo document as HTML and look at the result with a text editor, it should be clear enough. Don't be intimidated by the several KBs of CSS in the beginning :)	[reply] [d/l]
Re: A copyeditor needs help to get started with a Perl project by perlcapt (Pilgrim) on Nov 04, 2004 at 23:08 UTC
I think that MS has very good tools for XML generation and editing. Isn't there a utility for going between Word and XML? XML has the depth of the book organization but allows for the easy manipulation by Perl and many other tools. I'm talking off the top of my head here, and would be hard put to give a demonstration, but this is something I would investigate if I were to do another book or more book scale editing. What is STM books? Update: I found this commercial suite of products for doing the conversions between Word and XML. perlcapt -ben	[reply]
Re^2: A copyeditor needs help to get started with a Perl project by wordsmith (Acolyte) on Nov 06, 2004 at 18:08 UTC
Yes, I really should pick up XML because I'm in the publishing industry. I can see 3 milestones for my project: (1) Perl with text files. (2) Perl with HTML files for the greater functionality of being able to handle Word document elements such as superscripts, fonts, etc. (Or bite the bullet and do it with VB if this approach fails to work). (3) Perl with XML. At this point, I should have a marketable product and the big bucks should start flowing in. :-) What are STM books? STM stands for "scientific, technical, and medical." But it's the IT books that drive us up the wall with the jargon, acronyms, and terms uppercased or not depending on the author's whims. My Perl project is primarily directed toward taming IT books. Most sciences have fairly stable conventions regarding nomenclature, but not IT methinks. Would you say "I bought two mouses from the store" or "I bought two mice from the store"? And which would choose: keystream, key stream, or key-stream? We had one book where it appeared all three ways. By the way, are there no well-known freeware Word-to-XML converters? Thanks, everybody, for all the help.	[reply]
Re^3: A copyeditor needs help to get started with a Perl project by perlcapt (Pilgrim) on Nov 06, 2004 at 21:04 UTC
I can't find BYTE Magazine's style guide which has all of these issues pinned down after thousands of man-hours debating them, but I did find two links that are probably even better: UC Berkeley-IT Style Guide IEEE Computer Society Style Guide I have a personal/professional interest in the .doc -> XML subject and am pursuing it as a result of your comments. I'll post an update to this message when I have found out more about it. It appears as though there is a Word plugin or enhancement for going between XML and Word documents. Update: Here is the Microsoft Toolbox for Word/XML I'll try it out and let you know. perlcapt -ben	[reply]


XP is just a number
	PerlMonks

A copyeditor needs help to get started with a Perl project

Update:

Update: