Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Hello, Monks.

I've been plunked down into a new development environment and I am writing code that others after I am gone (I am a contractor) need to be able to read and maintain. For purposes of maintainability and readability, I wrote the following sub instead of using a regular expression:

sub is_ms_num { # field separator, normally ' ', modified for slice comparison. # see perldoc perlvar. my $ofs = $"; $" = ''; my $string = shift; my @test = split //, $string; chomp($string); $^W=0; # check to see if we have the two letter designation if ("@test[0,1]" !~ /[A-Za-z][A-Za-z]/) { return undef } # the next two should be 00 or greater if ("@test[2,3]" !~ /0[0-9]/) { return undef } # the next four are the actual manuscript number, and should be # numeric if ("@test[4,5,6,7]" !~ /[0-9]{4}/) { return undef } # char 8 is PR's checksum # JACS is allowed to have an additional six chars, but nobody else.. +. if (($test[9]) and ("@test[0,1]" !~ /JA/)) { return undef } # verify that the JACS additional chars are kosher if ((defined $test[9]) and # it might very well be zero... ("@test[0,1]" =~ /JA/) and ("@test[9,10,11,12,13,14]" !~ /-[0-9]{2}-[0-9]{2}/)) { return un +def } $" = $ofs; $^W=1; # looks like its good, return it. return $string; }
which then uses the following subs:
sub atomize { my $string = shift; my @atoms = split / /, $string; # the most we are going to combine is word -1, word, and word +1. my @molecules; my $ofs = $"; for ( my $i = 0; $i < $#atoms; $i++ ) { $" = ''; my $molecule = "@atoms[$i-1,$i,$i+1]"; push @molecules, $molecule; } $" = $ofs; return (\@atoms, \@molecules); } sub search_ms_nums { my @words = (@_); foreach my $word (@words) { return $word if is_ms_num( $word ); } return undef; }
in a loop that looks like this:
if ($header =~ /^[Ss]ubject/) { my ($words, $strings) = atomize( $header ); $key = search_ms_nums(@{$words}) || search_ms_nums(@{$strings} +); }
ugh, though, it is pitifully slow. I mean reeeeeally slow. I resisted using a regular expression not because I couldnt craft one, but because they tend to be totally unmaintainable (in is_ms_num()). (yes I know about the /x modifier, but I dont think even that would clarify the regex necessary for the people who will be maintaining this code)

Dominus makes a point on his website for his upcoming book that usually recursion isnt necessary, that a single iteration over an array will produce the desired results and be faster as well as clearer. Well, I am certainly using recursion here. The way that this code is being used is to examine subject lines from an inbox. It is possible that the mailserver is pitifully slow -- I cannot actually inspect the perl process itself to see if it is eating up the whole CPU, but from uptime(1) I can see that something is sending the load to 3.5.

The time taken to parse these messages is on the order of 20-60 seconds per message. With an inbox of 180 messages, this could take 2-3 hours. Thats really just unacceptable. You could just say "well, move it to a faster server." The code will actually not be running on the Ultra 2 I am writing it on, but chances are, it wont be faster than an Ultra 10.

I'd appreciate any insights into making the code a little more efficient without sacrificing the (perhaps overfriendly) readability of it.

thanks
brother dep.

update: Okay, I spoke to the lead developer and he said that while he isnt particularly fond of regular expressions being used because of code maintainability I explained that it was saving quite a bit of code and would definitely improve the speed of the sub. So here is the RE version:

sub is_ms_num_re { my $rval = shift; if ($rval =~ /(?:IC|JA|CM|OL|OM)\d{6}[\w+-](?:-\d{,3}-\d{,3})?/) { return $rval; } else { undef; } }
which, annoyingly, isnt much faster at all. We need a new development server. *sigh*.

--
Laziness, Impatience, Hubris, and Generosity.


In reply to Optimization for readability and speed (code) by deprecated

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (5)
    As of 2014-08-22 05:18 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (147 votes), past polls