Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Re: Adventures in optimization (or: the good and bad side of being truly bored)

by revdiablo (Prior)
on Aug 02, 2003 at 19:40 UTC ( #280312=note: print w/replies, xml ) Need Help??


in reply to Re: Adventures in optimization (or: the good and bad side of being truly bored)
in thread Adventures in optimization (or: the good and bad side of being truly bored)

Thanks for the reply Limbic~Region. I didn't know about the caching built into Time::Local, and I will definitely look into it. Please note that I do still use Time::Local, but I only call it once per each day found in my log, rather than tens of thousands of times per day.

Also, though I didn't mention it in my original post, I did make extensive use of Devel::DProf. As mentioned later in a reply from diotalevi, it really is easy to use, and the results are quite useful. I highly recommend it for anybody interested in improving the speed of their code, and probably should have said something about it in my original post. Between this and Benchmark, determining where to speed up one's program is all but simple. :)

I totally agree with the rest of your post too. My first optimization-related SoPW brought to my attention the problem using test data that is not exactly representative -- though it wasn't a huge issue there. And in the case of both subroutines, a new algorithm combined with caching was indeed the way I gained the huge performance increases. Again thanks for the reply.

Update: on further investigation, I'm not sure how useful Time::Local's caching will be in my situation. It caches at the month level, whereas I'm caching at the day level. Since my logs span only about a week of time, and month-level caching would still result in 10s of thousands more full timelocal calls than I currently do, I think it won't help me too much. (Note: this is all untested speculation. Please be advised to destroy my assumptions at will.)

Replies are listed 'Best First'.
Re: Re: Re: Adventures in optimization (or: the good and bad side of being truly bored)
by demerphq (Chancellor) on Aug 04, 2003 at 07:54 UTC

    on further investigation... 10s of thousands more .. calls

    Are you sure? I thought the cache worked like this

    $cache{$yearmonth}+$days*(24*60*60)+$hours*(60*60)+$mins*60+$secs;

    Having said that, I'm in the odd position that I too didn't realize the nocheck option in Time::Local, and I also wrote my own caching for it, but I did it based on hour. I am parsing dates like "20030401101123" and i endup doing something like the following (time_to_unix is just a wrapper around timelocal that knows how to split the above string (fragment) correctly)

    ($cache{substr($date,0,10)}||=time_to_unix(substr($date,0,10))) + substr($date,10,2)*60 + substr($date,12,2);

    which also gave me a several thousandfold time increase in my time calculations. Incidentally I think this approach will probably signifigantly outperform using timelocal() (and its caching) directly. The hash lookup on the first section of the date is far cheaper than splitting the date and then passing all of its parts on the stack, having timelocal do its checks and caching, which presumably resemble

    $cache{$year.$month}

    anyway, and then getting the results back over the stack. We trade many ops for just a few. And we get a cool bonus, since Time::Local is still validating its input the cache actually acts as a validating filter too. Only valid YMDH's get into it, and if we dont have a hit we either have an unknown valid YMDH or a bad date. Both of which Time::Local handles for us. So we get a serious speed benefit without losing any of the safety of Time::Local.


    ---
    demerphq

    <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

      demerphq++. Thanks for the reply. I did subsequently benchmark Time::Local with _nocheck, and while it was faster than without _nocheck, my home-brew cache was still substantially faster. Interesting that you decided to cache at the hour level, rather than the day. I chose the day level because converting hours to seconds is a relatively trivial calculation, but then again I guess converting days to seconds is too, so maybe caching at the month level would be just as good.

      Now I wonder if the different caching level is the reason _nocheck is slower. Perhaps it's due to the additional subroutine call, and not the different caching at all. But again this is all rank speculation... (I'm actively resisting the urge to break out my benchmark.pl and test hour, day, and month-level caching, but I think I need to just be happy with the performance I've got.)

      PS: Based on your reply here and to my post about moving averages, I have to wonder if you're not doing something relatively similar? Hopefully my posts have been somewhat helpful to you, but more likely it seems that your posts have been more helpful to me. ;)

      Update: Just thought I might clarify a bit:

      on further investigation... 10s of thousands more .. calls

      Are you sure? I thought the cache worked like this ...

      I meant 10s of thousands more calls to timelocal. Your example is essentially how my cache works (though there are a few things I notice that would probably make it a touch quicker than mine). My log has 10s of thousands of entries between each unique day (an entry every 5 seconds, to be precise), so using Perl's math operations instead of a call to timelocal for all those entries is a huge win.

        Interesting that you decided to cache at the hour level, rather than the day.

        I chose the hour and not the day because it ended up producing a "reasonable" number of entries in the cache, as I typically deal with data spread over 30 days my cache is usually around 720 entires. If you are dealing with times that are spread over only a day then I would suggest you go to the minute level of resolution, which would mean a cache around 1440 entries (both numbers are actually low when you factor in the behaviour of perls hashes and the amount of space actually used up).

        As for the analysis side of it I think its pretty clear. We are both manipulating strings. A hash lookup on a string of the sizes we are dealing with is far less work than dissescting the string into the required sizes and order (and perhaps supplying additional values) pushing them onto the stack, having timelocal pull them off the stack, build a fragment that it can use to check its cache and return the value over stack again. If you add it up its probably 4 or 5 times more operations (depending on how you defined the term) to call the subroutine, which in both of our cases will most likely be for a time we have already encountered.


        ---
        demerphq

        <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...
Re: Re: Re: Adventures in optimization (or: the good and bad side of being truly bored)
by diotalevi (Canon) on Aug 03, 2003 at 01:09 UTC

    Actually... I wouldn't normally think of Benchmark unless I'm considering altering my perl style. It isn't going to help you find the slow parts in your program, isn't even going to tell you whether the speed difference is even meaningful. I guess the only time I ever actually reach for it is when I'm doing very odd things and want to know which odd thing performs less worse. Outside of that... *shrug*.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://280312]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2019-11-19 03:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (93 votes). Check out past polls.

    Notices?