Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Hello esteemed monks. It has been a while since I've posted much more than the odd SoPW, so I figured I'd put up something meaty for you folks to chew on a while. Hopefully it's not too boring, and my general Perl naivete is not too evident.

I have been working on a logfile parser and analyzer (can we get any more stereotypical?) for the past few days, and it's presented me with the opportunity to push my Perl knowledge just that much further. I thought perhaps I'd share this with you all, as well as try to acknowledge and thank all the Fine Monks who gave me excellent (and fruitful!) food for thought.

The first problem I noticed with my quick-n-dirty solution (it really wasn't *that* quick-n-dirty, considering I spent enough time to separate the code out into a few handy modules, which turned out to be quite a boon later), was the moving average calculations were taking quite a while. That search for optimization led to my call for help at calculate average from range of hash values (optimize). All the replies gave me some insight into better solving the problem, and in the end I used an approach that was somewhat of a combination of Aristotle's code at Re^4: calculate average from range of hash values (optimize) and demerphq's ideas at Re: calculate average from range of hash values (optimize)(reply). This resulted in a dramatic speedup over my original code. Looking back at that original code is actually quite embarrassing, come to think of it.

Here's what I came up with:

{ my @values; my $prev_first; my $prev_last; sub cache { my ($data, $first, $last) = @_; unless ($prev_first and $prev_last) { @values = grep defined, @{$data}{$first .. $last}; } else { for ($prev_first .. $first-1) { shift @values if defined $data->{$_}; } for ($prev_last+1 .. $last) { push @values, $data->{$_} if defined $data->{$_}; } } $prev_first = $first; $prev_last = $last; sum(@values) / @values; } }

And here's the benchmark results compared to a few other versions:

Rate orig zaxo buk aris cache orig 6.26/s -- -22% -42% -80% -98% zaxo 8.00/s 28% -- -26% -75% -98% buk 10.9/s 74% 36% -- -66% -97% aris 31.8/s 408% 297% 192% -- -91% cache 368/s 5775% 4496% 3283% 1057% --

After I was satisfied with the speed improvement here (and who wouldn't be?), I decided to focus on the date parsing routine. Initially I simply passed the date directly to Date::Parse and gave it no second thought. Stuffed the resulting epoch seconds straight into my datastructure and plowed ahead. When I noticed this had become a serious bottleneck, I decided to write a custom parser. Splitting up the date string into its chunks was easy enough, but I deferred the actual work of calculating epoch seconds to Time::Local.

The routine was still running rather slowly. I suspected my regular expression was goofy, so I worked on that for a while. Then I tried a parser that didn't use a single matching regular expression at all (but instead used split). I was not getting substantial speedups, and was about to give up in frustration, when I realized I could use a cache like my moving averages calculation. That way I did not have to repetitively run timelocal, which was the main slow point. Implementing the cache brought runtime down to acceptible levels, and basically completed the excercise. I am proud I was able to "discover" this optimization without need of a SoPW post, though I'll probably be shown a much better way to accomplish the goal once I post this code. :)

So here's the parsing code:

{ my %cache; sub cache { my $str = shift; my ($wday, $mon, $mday, $time, $year) = split /\s+/, $str; my ($hour, $min, $sec) = split ':', $time; my $timebase; if ($cache{"$year$mon$mday"}) { $timebase = $cache{"$year$mon$mday"}; } else { $timebase = timelocal(0, 0, 0, $mday, $mtrans{$mon}, $year-190 +0); $cache{"$year$mon$mday"} = $timebase; } return $timebase + $sec + $min*60 + $hour*60*60; } }

And here're the benchmark results:

s/iter D::P buggy fixed nonre cache D::P 9.08 -- -76% -77% -77% -92% buggy 2.16 320% -- -2% -3% -66% fixed 2.12 328% 2% -- -2% -65% nonre 2.09 335% 4% 2% -- -65% cache 0.735 1136% 194% 188% 184% --

Moral of the story? Writing simple code and optimizing it to hell can be fun and interesting, and even slightly rewarding. Maybe other people can share my joy. :)

PS: If there's any interest in the full benchmarking code, or even the complete log parsing/analyzing modules... send me a /msg and I'll put the stuff up on my scratchpad, or something.

In reply to Adventures in optimization (or: the good and bad side of being truly bored) by revdiablo

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others surveying the Monastery: (9)
    As of 2019-11-13 15:46 GMT
    Find Nodes?
      Voting Booth?
      Strict and warnings: which comes first?

      Results (72 votes). Check out past polls.