deprecated has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks.

I've been plunked down into a new development environment and I am writing code that others after I am gone (I am a contractor) need to be able to read and maintain. For purposes of maintainability and readability, I wrote the following sub instead of using a regular expression:

sub is_ms_num { # field separator, normally ' ', modified for slice comparison. # see perldoc perlvar. my $ofs = $"; $" = ''; my $string = shift; my @test = split //, $string; chomp($string); $^W=0; # check to see if we have the two letter designation if ("@test[0,1]" !~ /[A-Za-z][A-Za-z]/) { return undef } # the next two should be 00 or greater if ("@test[2,3]" !~ /0[0-9]/) { return undef } # the next four are the actual manuscript number, and should be # numeric if ("@test[4,5,6,7]" !~ /[0-9]{4}/) { return undef } # char 8 is PR's checksum # JACS is allowed to have an additional six chars, but nobody else.. +. if (($test[9]) and ("@test[0,1]" !~ /JA/)) { return undef } # verify that the JACS additional chars are kosher if ((defined $test[9]) and # it might very well be zero... ("@test[0,1]" =~ /JA/) and ("@test[9,10,11,12,13,14]" !~ /-[0-9]{2}-[0-9]{2}/)) { return un +def } $" = $ofs; $^W=1; # looks like its good, return it. return $string; }
which then uses the following subs:
sub atomize { my $string = shift; my @atoms = split / /, $string; # the most we are going to combine is word -1, word, and word +1. my @molecules; my $ofs = $"; for ( my $i = 0; $i < $#atoms; $i++ ) { $" = ''; my $molecule = "@atoms[$i-1,$i,$i+1]"; push @molecules, $molecule; } $" = $ofs; return (\@atoms, \@molecules); } sub search_ms_nums { my @words = (@_); foreach my $word (@words) { return $word if is_ms_num( $word ); } return undef; }
in a loop that looks like this:
if ($header =~ /^[Ss]ubject/) { my ($words, $strings) = atomize( $header ); $key = search_ms_nums(@{$words}) || search_ms_nums(@{$strings} +); }
ugh, though, it is pitifully slow. I mean reeeeeally slow. I resisted using a regular expression not because I couldnt craft one, but because they tend to be totally unmaintainable (in is_ms_num()). (yes I know about the /x modifier, but I dont think even that would clarify the regex necessary for the people who will be maintaining this code)

Dominus makes a point on his website for his upcoming book that usually recursion isnt necessary, that a single iteration over an array will produce the desired results and be faster as well as clearer. Well, I am certainly using recursion here. The way that this code is being used is to examine subject lines from an inbox. It is possible that the mailserver is pitifully slow -- I cannot actually inspect the perl process itself to see if it is eating up the whole CPU, but from uptime(1) I can see that something is sending the load to 3.5.

The time taken to parse these messages is on the order of 20-60 seconds per message. With an inbox of 180 messages, this could take 2-3 hours. Thats really just unacceptable. You could just say "well, move it to a faster server." The code will actually not be running on the Ultra 2 I am writing it on, but chances are, it wont be faster than an Ultra 10.

I'd appreciate any insights into making the code a little more efficient without sacrificing the (perhaps overfriendly) readability of it.

brother dep.

update: Okay, I spoke to the lead developer and he said that while he isnt particularly fond of regular expressions being used because of code maintainability I explained that it was saving quite a bit of code and would definitely improve the speed of the sub. So here is the RE version:

sub is_ms_num_re { my $rval = shift; if ($rval =~ /(?:IC|JA|CM|OL|OM)\d{6}[\w+-](?:-\d{,3}-\d{,3})?/) { return $rval; } else { undef; } }
which, annoyingly, isnt much faster at all. We need a new development server. *sigh*.

Laziness, Impatience, Hubris, and Generosity.

Replies are listed 'Best First'.
Re: Optimization for readability and speed (code)
by merlyn (Sage) on Apr 30, 2001 at 22:17 UTC
    Augh! I'd scream if I had to maintain that code!

    Write it as mostly a single regular expression. You are using a number of very un-perl-like things. The fact that you have to set $" should be the first clue.

    Writing it as a single regular expression will almost certainly speed it up as well. You're making Perl do all the things it hates doing. I know, I've had many chats with Perl. {grin}

    -- Randal L. Schwartz, Perl hacker

      However, the developers who are going to be maintaining code like this are not perl developers and feel that perl is "part of the problem" because the CGI's on the site (which I had no hand in) are misbehaving. It almost reminds me of C code, and I wanted to make sure that others were able to read it. It is readable, if not very perlful, imho.


      Laziness, Impatience, Hubris, and Generosity.

        I'm again going to disagree. What you've written is neither C nor Perl, and something kinda awful in between. Perl experts will wince, and C experts will lack the Perl understanding to pick it up.

        Either make it native Perl, or don't write it in Perl.

        It's also broken as Perl. You return from the middle of the subroutine that sets $", but you don't reset it! That breaks the rest of the program. Use this instead (if you must):

        local $" = '';
        Then no matter what exit you take, the caller's value of $" is preserved.

        -- Randal L. Schwartz, Perl hacker

Re (tilly) 1: Optimization for readability and speed (code)
by tilly (Archbishop) on Apr 30, 2001 at 23:36 UTC
    I am with everyone else. Turning a string into an array of characters which you then join slices together from is a mistake. Plus your handling of globals is broken, rather than hand-code that, use local. That is what it is for. And Perl's readability and maintainability comes in making code short and sweet, not in overcommenting.

    If you really want to avoid REs and make C programmers comfortable, then I would point you at unpack and substr. Then you at least get some higher order constructs to compensate you for not writing in C.

    Oh, a question. I don't know what atomize is supposed to do, but the first molecule in your molecules list is unlikely to be exactly what you planned on...

    But still I would structure this by somewhere documenting the format, put in an RE, and somewhere (possibly before the RE, possibly in general documentation) put a pointer at japhy's module for taking an RE apart and explaining it in plain English.

Re: Optimization for readability and speed (code)
by Masem (Monsignor) on Apr 30, 2001 at 22:41 UTC
    <insert all of merlyn's comments here>. In addition to sticking to the RE, given the situation that you need to allow others to maintain this code, I would include in your comments the output of OGRE or some other such tool that english-fies the regex for non native regex speakers.

    Also, I would try to make sure that you grab as much as you can from the regex even if your code now does not currently need it. Say that you have a 5 field line, separated by white space, and your app currently needs only 4 of the 5 fields. I would still declare and grab all 5 fields, with commenting describing what the field is given the input format. This way, if it needs to be modified down the road to use the 5th field, they simply need to start *after* the regex, instead of possibly messing up the regex and causing the other 4 fields to be inaccessible.

    Dr. Michael K. Neylon - || "You've left the lens cap of your mind on again, Pinky" - The Brain
      I've tried to get to OGRE before, and as far as I can tell it's broken. Any word on this?


      Laziness, Impatience, Hubris, and Generosity.

Re: Optimization for readability and speed (code)
by KM (Priest) on Apr 30, 2001 at 22:34 UTC
    Why not just write it as you normally would, in a more efficient, RE sort of way? I wouldn't sacrifice performace, style, etc... because the code may be maintained by someone who doesn't know how to read Perl.It is up to the employeer to find qualified people.. or train them to be qualified, IMO.


Re: Optimization for readability and speed (code)
by cLive ;-) (Prior) on Apr 30, 2001 at 22:36 UTC
    I'm with Randal on this one. Why not just add the following comment?
    # match a number using regular expreeion - for more # information, read "Learning Perl" or "Programming Perl", # both published by O'Reilly at: #

    I mean, regular expressions are a foundation of Perl. Letting people loose on a Perl script who don't know what a regular expression is, is like letting a graphic designer build web pages without seeing or understanding the underlying code...

    Ah, yes, well, ahem... cLive ;-)

Re: Optimization for readability and speed (code)
by MeowChow (Vicar) on May 01, 2001 at 00:20 UTC
    Regexen are not indigenous to Perl - they're simply a very good tool for solving a subset of the problems which programmers regularly face, such as your own. Regex implementations are available for most most common languages, including JavaScript, Visual Basic, Java, and C/C++. Avoiding them because they're too Perlish makes about as much sense as avoiding map and grep because they're too Lispish. In other words, it makes no sense at all.
                   s aamecha.s a..a\u$&owag.print
Re: Optimization for readability and speed (code)
by jeroenes (Priest) on May 01, 2001 at 10:30 UTC
    To speed up the regex:

    1. Loose the grouping where you can. Do you really need to check the two-caps, of could you just say \L\w{2}?

    2. Use the 'o' modifier, as you don't use variables in your regex (see perlop).

    Hope this helps,

    "We are not alone"(FZ)

      Perl automatically does the second optimization for you - it recompiles a regex only if it contains interpolated variables.
                     s aamecha.s a..a\u$&owag.print