Hello, Monks.

I've been plunked down into a new development environment and I am writing code that others after I am gone (I am a contractor) need to be able to read and maintain. For purposes of maintainability and readability, I wrote the following sub instead of using a regular expression:

sub is_ms_num { # field separator, normally ' ', modified for slice comparison. # see perldoc perlvar. my $ofs = $"; $" = ''; my $string = shift; my @test = split //, $string; chomp($string); $^W=0; # check to see if we have the two letter designation if ("@test[0,1]" !~ /[A-Za-z][A-Za-z]/) { return undef } # the next two should be 00 or greater if ("@test[2,3]" !~ /0[0-9]/) { return undef } # the next four are the actual manuscript number, and should be # numeric if ("@test[4,5,6,7]" !~ /[0-9]{4}/) { return undef } # char 8 is PR's checksum # JACS is allowed to have an additional six chars, but nobody else.. +. if (($test[9]) and ("@test[0,1]" !~ /JA/)) { return undef } # verify that the JACS additional chars are kosher if ((defined $test[9]) and # it might very well be zero... ("@test[0,1]" =~ /JA/) and ("@test[9,10,11,12,13,14]" !~ /-[0-9]{2}-[0-9]{2}/)) { return un +def } $" = $ofs; $^W=1; # looks like its good, return it. return $string; }
which then uses the following subs:
sub atomize { my $string = shift; my @atoms = split / /, $string; # the most we are going to combine is word -1, word, and word +1. my @molecules; my $ofs = $"; for ( my $i = 0; $i < $#atoms; $i++ ) { $" = ''; my $molecule = "@atoms[$i-1,$i,$i+1]"; push @molecules, $molecule; } $" = $ofs; return (\@atoms, \@molecules); } sub search_ms_nums { my @words = (@_); foreach my $word (@words) { return $word if is_ms_num( $word ); } return undef; }
in a loop that looks like this:
if ($header =~ /^[Ss]ubject/) { my ($words, $strings) = atomize( $header ); $key = search_ms_nums(@{$words}) || search_ms_nums(@{$strings} +); }
ugh, though, it is pitifully slow. I mean reeeeeally slow. I resisted using a regular expression not because I couldnt craft one, but because they tend to be totally unmaintainable (in is_ms_num()). (yes I know about the /x modifier, but I dont think even that would clarify the regex necessary for the people who will be maintaining this code)

Dominus makes a point on his website for his upcoming book that usually recursion isnt necessary, that a single iteration over an array will produce the desired results and be faster as well as clearer. Well, I am certainly using recursion here. The way that this code is being used is to examine subject lines from an inbox. It is possible that the mailserver is pitifully slow -- I cannot actually inspect the perl process itself to see if it is eating up the whole CPU, but from uptime(1) I can see that something is sending the load to 3.5.

The time taken to parse these messages is on the order of 20-60 seconds per message. With an inbox of 180 messages, this could take 2-3 hours. Thats really just unacceptable. You could just say "well, move it to a faster server." The code will actually not be running on the Ultra 2 I am writing it on, but chances are, it wont be faster than an Ultra 10.

I'd appreciate any insights into making the code a little more efficient without sacrificing the (perhaps overfriendly) readability of it.

brother dep.

update: Okay, I spoke to the lead developer and he said that while he isnt particularly fond of regular expressions being used because of code maintainability I explained that it was saving quite a bit of code and would definitely improve the speed of the sub. So here is the RE version:

sub is_ms_num_re { my $rval = shift; if ($rval =~ /(?:IC|JA|CM|OL|OM)\d{6}[\w+-](?:-\d{,3}-\d{,3})?/) { return $rval; } else { undef; } }
which, annoyingly, isnt much faster at all. We need a new development server. *sigh*.

Laziness, Impatience, Hubris, and Generosity.

In reply to Optimization for readability and speed (code) by deprecated

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":