Hello esteemed monks. It has been a while since I've posted much more than the odd SoPW, so I figured I'd put up something meaty for you folks to chew on a while. Hopefully it's not too boring, and my general Perl naivete is not too evident.

I have been working on a logfile parser and analyzer (can we get any more stereotypical?) for the past few days, and it's presented me with the opportunity to push my Perl knowledge just that much further. I thought perhaps I'd share this with you all, as well as try to acknowledge and thank all the Fine Monks who gave me excellent (and fruitful!) food for thought.

The first problem I noticed with my quick-n-dirty solution (it really wasn't *that* quick-n-dirty, considering I spent enough time to separate the code out into a few handy modules, which turned out to be quite a boon later), was the moving average calculations were taking quite a while. That search for optimization led to my call for help at calculate average from range of hash values (optimize). All the replies gave me some insight into better solving the problem, and in the end I used an approach that was somewhat of a combination of Aristotle's code at Re^4: calculate average from range of hash values (optimize) and demerphq's ideas at Re: calculate average from range of hash values (optimize)(reply). This resulted in a dramatic speedup over my original code. Looking back at that original code is actually quite embarrassing, come to think of it.

Here's what I came up with:

{ my @values; my $prev_first; my $prev_last; sub cache { my ($data, $first, $last) = @_; unless ($prev_first and $prev_last) { @values = grep defined, @{$data}{$first .. $last}; } else { for ($prev_first .. $first-1) { shift @values if defined $data->{$_}; } for ($prev_last+1 .. $last) { push @values, $data->{$_} if defined $data->{$_}; } } $prev_first = $first; $prev_last = $last; sum(@values) / @values; } }

And here's the benchmark results compared to a few other versions:

Rate orig zaxo buk aris cache orig 6.26/s -- -22% -42% -80% -98% zaxo 8.00/s 28% -- -26% -75% -98% buk 10.9/s 74% 36% -- -66% -97% aris 31.8/s 408% 297% 192% -- -91% cache 368/s 5775% 4496% 3283% 1057% --

After I was satisfied with the speed improvement here (and who wouldn't be?), I decided to focus on the date parsing routine. Initially I simply passed the date directly to Date::Parse and gave it no second thought. Stuffed the resulting epoch seconds straight into my datastructure and plowed ahead. When I noticed this had become a serious bottleneck, I decided to write a custom parser. Splitting up the date string into its chunks was easy enough, but I deferred the actual work of calculating epoch seconds to Time::Local.

The routine was still running rather slowly. I suspected my regular expression was goofy, so I worked on that for a while. Then I tried a parser that didn't use a single matching regular expression at all (but instead used split). I was not getting substantial speedups, and was about to give up in frustration, when I realized I could use a cache like my moving averages calculation. That way I did not have to repetitively run timelocal, which was the main slow point. Implementing the cache brought runtime down to acceptible levels, and basically completed the excercise. I am proud I was able to "discover" this optimization without need of a SoPW post, though I'll probably be shown a much better way to accomplish the goal once I post this code. :)

So here's the parsing code:

{ my %cache; sub cache { my $str = shift; my ($wday, $mon, $mday, $time, $year) = split /\s+/, $str; my ($hour, $min, $sec) = split ':', $time; my $timebase; if ($cache{"$year$mon$mday"}) { $timebase = $cache{"$year$mon$mday"}; } else { $timebase = timelocal(0, 0, 0, $mday, $mtrans{$mon}, $year-190 +0); $cache{"$year$mon$mday"} = $timebase; } return $timebase + $sec + $min*60 + $hour*60*60; } }

And here're the benchmark results:

s/iter D::P buggy fixed nonre cache D::P 9.08 -- -76% -77% -77% -92% buggy 2.16 320% -- -2% -3% -66% fixed 2.12 328% 2% -- -2% -65% nonre 2.09 335% 4% 2% -- -65% cache 0.735 1136% 194% 188% 184% --

Moral of the story? Writing simple code and optimizing it to hell can be fun and interesting, and even slightly rewarding. Maybe other people can share my joy. :)

PS: If there's any interest in the full benchmarking code, or even the complete log parsing/analyzing modules... send me a /msg and I'll put the stuff up on my scratchpad, or something.

Replies are listed 'Best First'.
Re: Adventures in optimization (or: the good and bad side of being truly bored)
by Limbic~Region (Chancellor) on Aug 02, 2003 at 16:12 UTC
    I am a little confused. You say that Time::Local wasn't fast enough for you, so you did it yourself by adding caching. That's exactly what Time::Local does:

    These routines are quite efficient and yet are always guaranteed to agree with localtime() and gmtime(). We manage this by caching the start times of any months we've seen before. If we know the start time of the month, we can always calculate any time within the month. The start times are calculated using a mathematical formula.

    So the only thing I could see that would make it slower than your method is the fact that it validates the input before acting on the data, which you can turn off using the timelocal_nocheck option and of course the fact that methods are always slower than subs which are always slower than inline code.

    Ok - now that little nit is over, it is almost always turns out that to improve the speed of a giving piece of code to add caching (if possible) or change the algorithm.

    As far as optimizing a piece of code being fun and interesting - I quite agree. You should be using Devel::Profile to determine exactly what piece of the code is chewing up time and focus there as it will often not be where you think the problem is. You should also always Benchmark on an idle system. Your test data should be thoroughly varied as "real" data can often swing your results in the other direction. By that I mean that comparing index to a regex might be slower for strings up to 30 characters, but then it may take the lead and ultimately demolish the regex. If you haven't taken that into account in your test data, you will still end up with poorly optimized code.

    Optimized code that isn't the result of a new algorithm or caching is almost always harder to maintain. I have such a piece of code that I not very proud of even though it gets the job done. It is better to learn the better algorithms than make your code unreadable. If you absolutely need that much speed and can't get it from better algorithms/caching - maybe it is time to learn a new language.

    Cheers - L~R

      And to add caching - Memoize!

      As for actual profiling, its really easy to run your scripts under Devel::DProf (which comes with perl don'cha know) `perl5 -d:DProf test.pl` and after its finished, dprofpp. Sooper easy. I worked out an order of magnitude increase in some perl code speed by repeatedly testing my changes and assumptions against profiled code. And for those that are curious - it turns out that regex capturing can be a real dog some times.

        Your attention to detail never ceases to amaze me. I didn't want to mention Memoize at the beginning since it would not work for what revdiablo was trying to do, but I did want to mention it later as a "generally you can add caching simply by using Memoize by Dominus" and completely forgot.

        Thanks - L~R

        Indeed. Devel::DProf is good stuff. I was very happy with my results when using it and Benchmark to, erm, profile and benchmark my code.

        As for Memoize, Limbic~Region's reply is correct. It won't really work for me in this case, because I willshould never have the case where my subroutine gets called twice with the same arguments (unless of course there's something wrong in my log and I get duplicate entries). Thanks for mentioning this though, I've been meaning to look at it for a good while now.

        Regarding Memoize, maybe im weird but i've never found it to be particularly useful. I have never had a real (dev and toy, ok) circumstance where I can cache based on arguments alone, and splitting my function into subfuctions that can be so Memoized never seemed to me to be the right thing to do. I think caching properly requires some degree of thought and understanding of the overall process, which combined with the ease of establishing a cache in Perl I have never used Memoize in any production code.


        <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

      Thanks for the reply Limbic~Region. I didn't know about the caching built into Time::Local, and I will definitely look into it. Please note that I do still use Time::Local, but I only call it once per each day found in my log, rather than tens of thousands of times per day.

      Also, though I didn't mention it in my original post, I did make extensive use of Devel::DProf. As mentioned later in a reply from diotalevi, it really is easy to use, and the results are quite useful. I highly recommend it for anybody interested in improving the speed of their code, and probably should have said something about it in my original post. Between this and Benchmark, determining where to speed up one's program is all but simple. :)

      I totally agree with the rest of your post too. My first optimization-related SoPW brought to my attention the problem using test data that is not exactly representative -- though it wasn't a huge issue there. And in the case of both subroutines, a new algorithm combined with caching was indeed the way I gained the huge performance increases. Again thanks for the reply.

      Update: on further investigation, I'm not sure how useful Time::Local's caching will be in my situation. It caches at the month level, whereas I'm caching at the day level. Since my logs span only about a week of time, and month-level caching would still result in 10s of thousands more full timelocal calls than I currently do, I think it won't help me too much. (Note: this is all untested speculation. Please be advised to destroy my assumptions at will.)

        on further investigation... 10s of thousands more .. calls

        Are you sure? I thought the cache worked like this


        Having said that, I'm in the odd position that I too didn't realize the nocheck option in Time::Local, and I also wrote my own caching for it, but I did it based on hour. I am parsing dates like "20030401101123" and i endup doing something like the following (time_to_unix is just a wrapper around timelocal that knows how to split the above string (fragment) correctly)

        ($cache{substr($date,0,10)}||=time_to_unix(substr($date,0,10))) + substr($date,10,2)*60 + substr($date,12,2);

        which also gave me a several thousandfold time increase in my time calculations. Incidentally I think this approach will probably signifigantly outperform using timelocal() (and its caching) directly. The hash lookup on the first section of the date is far cheaper than splitting the date and then passing all of its parts on the stack, having timelocal do its checks and caching, which presumably resemble


        anyway, and then getting the results back over the stack. We trade many ops for just a few. And we get a cool bonus, since Time::Local is still validating its input the cache actually acts as a validating filter too. Only valid YMDH's get into it, and if we dont have a hit we either have an unknown valid YMDH or a bad date. Both of which Time::Local handles for us. So we get a serious speed benefit without losing any of the safety of Time::Local.


        <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

        Actually... I wouldn't normally think of Benchmark unless I'm considering altering my perl style. It isn't going to help you find the slow parts in your program, isn't even going to tell you whether the speed difference is even meaningful. I guess the only time I ever actually reach for it is when I'm doing very odd things and want to know which odd thing performs less worse. Outside of that... *shrug*.