http://www.perlmonks.org?node_id=280265

Hello esteemed monks. It has been a while since I've posted much more than the odd SoPW, so I figured I'd put up something meaty for you folks to chew on a while. Hopefully it's not too boring, and my general Perl naivete is not too evident.

I have been working on a logfile parser and analyzer (can we get any more stereotypical?) for the past few days, and it's presented me with the opportunity to push my Perl knowledge just that much further. I thought perhaps I'd share this with you all, as well as try to acknowledge and thank all the Fine Monks who gave me excellent (and fruitful!) food for thought.

The first problem I noticed with my quick-n-dirty solution (it really wasn't *that* quick-n-dirty, considering I spent enough time to separate the code out into a few handy modules, which turned out to be quite a boon later), was the moving average calculations were taking quite a while. That search for optimization led to my call for help at calculate average from range of hash values (optimize). All the replies gave me some insight into better solving the problem, and in the end I used an approach that was somewhat of a combination of Aristotle's code at Re^4: calculate average from range of hash values (optimize) and demerphq's ideas at Re: calculate average from range of hash values (optimize)(reply). This resulted in a dramatic speedup over my original code. Looking back at that original code is actually quite embarrassing, come to think of it.

Here's what I came up with:

{ my @values; my $prev_first; my $prev_last; sub cache { my ($data, $first, $last) = @_; unless ($prev_first and $prev_last) { @values = grep defined, @{$data}{$first .. $last}; } else { for ($prev_first .. $first-1) { shift @values if defined $data->{$_}; } for ($prev_last+1 .. $last) { push @values, $data->{$_} if defined $data->{$_}; } } $prev_first = $first; $prev_last = $last; sum(@values) / @values; } }

And here's the benchmark results compared to a few other versions:

Rate orig zaxo buk aris cache orig 6.26/s -- -22% -42% -80% -98% zaxo 8.00/s 28% -- -26% -75% -98% buk 10.9/s 74% 36% -- -66% -97% aris 31.8/s 408% 297% 192% -- -91% cache 368/s 5775% 4496% 3283% 1057% --

After I was satisfied with the speed improvement here (and who wouldn't be?), I decided to focus on the date parsing routine. Initially I simply passed the date directly to Date::Parse and gave it no second thought. Stuffed the resulting epoch seconds straight into my datastructure and plowed ahead. When I noticed this had become a serious bottleneck, I decided to write a custom parser. Splitting up the date string into its chunks was easy enough, but I deferred the actual work of calculating epoch seconds to Time::Local.

The routine was still running rather slowly. I suspected my regular expression was goofy, so I worked on that for a while. Then I tried a parser that didn't use a single matching regular expression at all (but instead used split). I was not getting substantial speedups, and was about to give up in frustration, when I realized I could use a cache like my moving averages calculation. That way I did not have to repetitively run timelocal, which was the main slow point. Implementing the cache brought runtime down to acceptible levels, and basically completed the excercise. I am proud I was able to "discover" this optimization without need of a SoPW post, though I'll probably be shown a much better way to accomplish the goal once I post this code. :)

So here's the parsing code:

{ my %cache; sub cache { my $str = shift; my ($wday, $mon, $mday, $time, $year) = split /\s+/, $str; my ($hour, $min, $sec) = split ':', $time; my $timebase; if ($cache{"$year$mon$mday"}) { $timebase = $cache{"$year$mon$mday"}; } else { $timebase = timelocal(0, 0, 0, $mday, $mtrans{$mon}, $year-190 +0); $cache{"$year$mon$mday"} = $timebase; } return $timebase + $sec + $min*60 + $hour*60*60; } }

And here're the benchmark results:

s/iter D::P buggy fixed nonre cache D::P 9.08 -- -76% -77% -77% -92% buggy 2.16 320% -- -2% -3% -66% fixed 2.12 328% 2% -- -2% -65% nonre 2.09 335% 4% 2% -- -65% cache 0.735 1136% 194% 188% 184% --

Moral of the story? Writing simple code and optimizing it to hell can be fun and interesting, and even slightly rewarding. Maybe other people can share my joy. :)

PS: If there's any interest in the full benchmarking code, or even the complete log parsing/analyzing modules... send me a /msg and I'll put the stuff up on my scratchpad, or something.