|Perl: the Markov chain saw|
Re^3: LaTeX Abbreviations for Linguistsby graff (Chancellor)
|on Sep 05, 2009 at 16:54 UTC||Need Help??|
Um... okay, given the date of your last reply, that comment line in your code can be interpreted correctly, but you should be aware that taken by itself, that date string could mean three different things.
This regex in your code looks wrong, and is very different from the one I recommended (and from the one I quoted out of the original version of your script):
It creates a character class that includes "e" two times, and also includes period and open and close parens. It will match things like "\)g.", "\(g.", "\.g.", "\xg.", etc, and probably won't match things that you want it to match, like "exg". I realize now that in my earlier reply, I left out a colon; I've updated that accordingly, and I apologize for that mistake.
Consistent indentation is a nice thing, and so is using @ARGV for things like asking for usage help and providing file names and options -- please get acquainted with using @ARGV (and Getopt::Std and/or Getopt::Long), because making the user manually type things in after the script is running is a Real Pain™.
A long list of "configuration" or "initialization" data (your "@lgr" list) would be handled more cleanly (and would be easier to maintain) as a __DATA__ segment that gets read into an array or hash on start-up.
The regex alternation character (vertical-bar, |) does not work as such inside a regex character class (between square brackets), it just matches a literal "|" -- so you should study the perlre man page to understand how character classes work.
Also, when you want to delete all occurrences of particular characters from a string, using tr/xyz//d is much more efficient than s/x//g; s/y//g; s/z//g; (using tr is even more efficient than s/[xyz]//g).
For a small script like this, modifying global-scope variables inside of subroutines isn't such a big deal, because the script is small, but it's usually not a good idea. As a general rule (and in the interest of creating subroutines that are modular and easy to maintain and adapt), it's better to pass data to subs as parameters, and have the sub either return its resulting data to the caller, or modify its parameters in-place (because they were passed as references).
If you document your code with POD, it will be easier to read and maintain the documentation, which is important. If the documentation includes a brief description of what the code actually does, that will help you to organize your thoughts in a sensible way about the algorithm, and then organize your code according to what makes sense (and is documented). As it is, there's a lot of inefficiency in your code, because the algorithm hasn't been thought out. In particular, you are using arrays where you should be using hashes.
This note in your help text is not necessarily true:
The script could be in the shell's execution PATH, so it doesn't have to be where the data file is; also, a user can provide (relative or absolute) paths for both the script and the data file, so they don't both have to be in the same place. Also, it's always a good idea to include the filename string and $! in the error message when you "die" on a failed "open" call.
I happened to notice this one odd entry in your long list of LGR abbrevs:
There's nothing in your code that handles this "N" prefix on other abbrevs, so things like "NPST" and "NSG" will never be labeled as "nonpast" or "nonsingular" in your output. Also, that explanation will never appear in the output either, unless "N-" happens to occur in the tex file.
One last point: do you have a suitable "test.tex" file that contains at least one example of every kind of abbreviation you intend to handle with this script, along with some variety of "normal" content? If not, make one. The point would be to make sure that all these abbreviations get listed as intended.
Of course, you can't anticipate all the ways that "normal" LaTeX content might cause your script to miss things that are real abbreviations (e.g. if two abbrevs occur next to each other separated by a single space, the second one will be missed), or to list things as abbrevs when they really aren't (e.g. FULL WORDS IN UPPERCASE, or any single-digit number). But even a little bit of testing is better than none.
Here's how I would write your script (though this version won't behave exactly the same as yours, and might have some mistakes in it -- I didn't have any LaTeX files with abbreviations to test it on):
(note that you have to fill in the part about defining what abbreviations are, and that part should be updated as you refine your code)
(update: added missing ABBR file handle in last print statement)