Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Hello, Monks.

I've been plunked down into a new development environment and I am writing code that others after I am gone (I am a contractor) need to be able to read and maintain. For purposes of maintainability and readability, I wrote the following sub instead of using a regular expression:

sub is_ms_num { # field separator, normally ' ', modified for slice comparison. # see perldoc perlvar. my $ofs = $"; $" = ''; my $string = shift; my @test = split //, $string; chomp($string); $^W=0; # check to see if we have the two letter designation if ("@test[0,1]" !~ /[A-Za-z][A-Za-z]/) { return undef } # the next two should be 00 or greater if ("@test[2,3]" !~ /0[0-9]/) { return undef } # the next four are the actual manuscript number, and should be # numeric if ("@test[4,5,6,7]" !~ /[0-9]{4}/) { return undef } # char 8 is PR's checksum # JACS is allowed to have an additional six chars, but nobody else.. +. if (($test[9]) and ("@test[0,1]" !~ /JA/)) { return undef } # verify that the JACS additional chars are kosher if ((defined $test[9]) and # it might very well be zero... ("@test[0,1]" =~ /JA/) and ("@test[9,10,11,12,13,14]" !~ /-[0-9]{2}-[0-9]{2}/)) { return un +def } $" = $ofs; $^W=1; # looks like its good, return it. return $string; }
which then uses the following subs:
sub atomize { my $string = shift; my @atoms = split / /, $string; # the most we are going to combine is word -1, word, and word +1. my @molecules; my $ofs = $"; for ( my $i = 0; $i < $#atoms; $i++ ) { $" = ''; my $molecule = "@atoms[$i-1,$i,$i+1]"; push @molecules, $molecule; } $" = $ofs; return (\@atoms, \@molecules); } sub search_ms_nums { my @words = (@_); foreach my $word (@words) { return $word if is_ms_num( $word ); } return undef; }
in a loop that looks like this:
if ($header =~ /^[Ss]ubject/) { my ($words, $strings) = atomize( $header ); $key = search_ms_nums(@{$words}) || search_ms_nums(@{$strings} +); }
ugh, though, it is pitifully slow. I mean reeeeeally slow. I resisted using a regular expression not because I couldnt craft one, but because they tend to be totally unmaintainable (in is_ms_num()). (yes I know about the /x modifier, but I dont think even that would clarify the regex necessary for the people who will be maintaining this code)

Dominus makes a point on his website for his upcoming book that usually recursion isnt necessary, that a single iteration over an array will produce the desired results and be faster as well as clearer. Well, I am certainly using recursion here. The way that this code is being used is to examine subject lines from an inbox. It is possible that the mailserver is pitifully slow -- I cannot actually inspect the perl process itself to see if it is eating up the whole CPU, but from uptime(1) I can see that something is sending the load to 3.5.

The time taken to parse these messages is on the order of 20-60 seconds per message. With an inbox of 180 messages, this could take 2-3 hours. Thats really just unacceptable. You could just say "well, move it to a faster server." The code will actually not be running on the Ultra 2 I am writing it on, but chances are, it wont be faster than an Ultra 10.

I'd appreciate any insights into making the code a little more efficient without sacrificing the (perhaps overfriendly) readability of it.

thanks
brother dep.

update: Okay, I spoke to the lead developer and he said that while he isnt particularly fond of regular expressions being used because of code maintainability I explained that it was saving quite a bit of code and would definitely improve the speed of the sub. So here is the RE version:

sub is_ms_num_re { my $rval = shift; if ($rval =~ /(?:IC|JA|CM|OL|OM)\d{6}[\w+-](?:-\d{,3}-\d{,3})?/) { return $rval; } else { undef; } }
which, annoyingly, isnt much faster at all. We need a new development server. *sigh*.

--
Laziness, Impatience, Hubris, and Generosity.


In reply to Optimization for readability and speed (code) by deprecated

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chanting in the Monastery: (11)
    As of 2015-07-07 12:47 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (88 votes), past polls