Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Special case mktime: semi-sequential access

by not_japh (Novice)
on Jun 30, 2013 at 16:21 UTC ( #1041613=perlquestion: print w/ replies, xml ) Need Help??
not_japh has asked for the wisdom of the Perl Monks concerning the following question:

I had a particular need for a fast mktime. Search foo lacking, I wrote a naive implementation (below) which seems to work very well, for me at least.

The use case is parsing logfile timestamps from ctime()-ish format into time_t epoch integer. The simplifying assumption is that once you've used a smart date parser to parse one timestamp, the next one you parse can be done much faster by looking only at the previous $time result and the new H:M:S, as long as both stamps lie within the same day. Fortunately, giant logfiles are often conveniently arranged in order so that this is usually true.

The hypothesis appears confirmed by some benchmarking of 1e7 (sequential!) dates:

# POSIX::mktime took: 139 wallclock secs (33.27 usr + + 40.64 sys = 73.91 CPU) # Time::Local::timegm_nocheck took: 58 wallclock secs (57.67 usr + + 0.15 sys = 57.82 CPU) # FastMktime took: 22 wallclock secs (22.53 usr + + 0.02 sys = 22.55 CPU)
Any thoughts appreciated. Should this be cpanned, or round-canned?
#!/usr/bin/perl # Drop-in POSIX::mktime() replacement. This is optimized for repeated # calls within the same date by caching previous result: if the cache # applies, it just performs the H:M:S calculation within the same day, # if not it falls through to POSIX::mktime() and stores the result for # next invocation. use strict; use warnings; use POSIX qw(mktime); package FastMktime; require Exporter; our @ISA = qw(Exporter); our @EXPORT = qw(caching_mktime); my $mk_day = 0; my $mk_mon = 0; my $mk_year = 0; my $mk_time = 0; my $mk_hms = 0; sub caching_mktime { my ($s, $m, $h, $day, $mon, $year) = @_; my $time; my $hmsarg = 3600 * $h + 60 * $m + $s; if ($mk_day > 0 && $mk_day == $day && $mk_mon == $mon && $mk_year == + $year) { $time = $mk_time + $hmsarg - $mk_hms; } else { $time = POSIX::mktime($s, $m, $h, $day, $mon, $year); $mk_day = $day; $mk_mon = $mon; $mk_year = $year; $mk_time = $time; $mk_hms = $hmsarg; } return $time; } 1;

Comment on Special case mktime: semi-sequential access
Select or Download Code
Re: Special case mktime: semi-sequential access
by RichardK (Priest) on Jun 30, 2013 at 17:35 UTC

    Those are interesting results, I wonder why the POSIX::mktime includes so much system time, 40+ seconds?

    Can you post your benchmark code ? I would be interested to try and replicate your result.

    Also, where are you getting 1e7 dates from? maybe reading a file?

    My first thought on seeing such a big difference is OS file caching -- does the result still hold if you run your benchmark 5 times in a row?

      Yeah there might be some swapping but no direct files. I converted the benchmark (code below) to use timethese(5) to do some repeating.

      Weak machine:
      Benchmark: timing 5 iterations of FastMktime::caching_mktime, POSIX::m +ktime, TIME::Local::timegm_nocheck... FastMktime::caching_mktime: 120 wallclock secs (120.21 usr + 0.05 sys + = 120.26 CPU) @ 0.04/s (n=5) POSIX::mktime: 360 wallclock secs (165.42 usr + 193.81 sys = 359.23 CP +U) @ 0.01/s (n=5) TIME::Local::timegm_nocheck: 293 wallclock secs (292.72 usr + 0.14 sy +s = 292.86 CPU) @ 0.02/s (n=5)
      stronger machine:
      Benchmark: timing 5 iterations of FastMktime::caching_mktime, POSIX::m +ktime, TIME::Local::timegm_nocheck... FastMktime::caching_mktime: 65 wallclock secs (64.77 usr + 0.02 sys = + 64.79 CPU) @ 0.08/s (n=5) POSIX::mktime: 118 wallclock secs (58.16 usr + 60.47 sys = 118.63 CPU) + @ 0.04/s (n=5) TIME::Local::timegm_nocheck: 143 wallclock secs (143.07 usr + 0.54 sy +s = 143.61 CPU) @ 0.03/s (n=5)
      benchmark.t
      #!/usr/bin/perl use FastMktime; use Benchmark qw(:all); use POSIX qw(mktime); use Time::Local; my $sstart = time; my $sstop = $sstart + 1e7; my @dates; # Stuff 1e7 sequential datestamps into an array. for (my $s=$sstart; $s<$sstop; $s++) { push(@dates, [ (gmtime $s)[0..5] ]); } my $tmp; my $results = timethese(1, { 'POSIX::mktime' => sub { $tmp = POSIX::mktime(@$_) for @dates }, 'TIME::Local::timegm_nocheck' => sub { $tmp = Time::Local::timegm_ +nocheck(@$_) for @dates }, 'FastMktime::caching_mktime' => sub { $tmp = FastMktime::caching_m +ktime(@$_) for @dates }, });

        Yep, that's very interesting, I get similar sorts of results here but mktime only takes 33 seconds. my machine isn't that new so I don't know why there such a big difference. I'm running perl v5.16 on 64 bit Linux.

        mktime seems to stat /etc/localtime every time so that's where the system time is coming from, it's odd that it isn't cached, but I don't think I want to delve into the depths of the posix library to find out more ;).

        I guess you could tweak your code a little bit, if you're after every last cycle :-

        A valid $day can never be zero so there no point testing if $mk_day > 0, $mk_day == $day is enough

        and if you create your times for midnight i.e.  $time = POSIX::mktime(0, 0, 0, $day, $mon, $year); then you can drop $mk_hms and save yourself one whole subtraction :)

Re: Special case mktime: semi-sequential access
by flexvault (Parson) on Jun 30, 2013 at 19:40 UTC

    Welcome not_japh,

      The hypothesis appears confirmed by some benchmarking of 1e7 (sequential!) dates:

    Originally I looked at this and thought that this is a job for 'use Memoize;', but then I thought if the inputs are sequential, maybe it's a real time program and once a second changes, you never go back. This is an alternate approach with-out the calculations.

    sub new_caching_mktime { my ($sec, $min, $hour, $day, $mon, $year) = @_; my $time; my $curinput = "$sec:$min:$hour:$day:$mon:$year"; if ( $curinput eq $saveinput ) { $time = $savetime; } else { $time = POSIX::mktime( $sec, $min, $hour, $day, $mon, $year ); $savetime = $time; $saveinput = $curinput; } return $time; }

    Obviously '$savetime' and '$saveinput' needs to be defined for global use.

    On most modern computers you can call a small subroutine like this hundreds of thousands times per second, and each time the cached time will be used.

    Personally, what I like is that you're using caching to improve the performance of the script. That technique will help you lots of times in the future.

    Good Luck...Ed

    "Well done is better than well said." - Benjamin Franklin

      To me it looks like a multiplication by 60 is simple enough and doesn't deserve caching. On the other hand, calculating date is tricky as it involves legacy code dating back to the Roman empire. So the OP approach definitely has some grounds. I wonder what benchmark would say though.

      Besides, a log file is not always sequential if several applications run at once and buffering is involved.

        Sequential not necessary. Just within the same day is enough to avoid the date calculations by doing the time ones yourself.

      Your version is more compact and satisfying, but it's slower because it missed the point (probably because I didn't state it!). Mine is not not memoizing, so perhaps calling it caching_* was a poor choice.

      You can do math to avoid calling mktime not just in the cases it's seen, but also for other times within the same day. The math is simply this: if you know any epoch time in some day of interest, and you know what HMS that maps to, you can do two multiplies and a few adds to find a new epoch for a new HMS in that same day.

      Benchmark: timing 5 iterations of FastMktime::caching_mktime, FastMkti +me::new_caching_mktime, POSIX::mktime, TIME::Local::timegm_nocheck... FastMktime::caching_mktime: 65 wallclock secs (64.36 usr + 0.02 sys = + 64.38 CPU) @ 0.08/s (n=5) FastMktime::new_caching_mktime: 220 wallclock secs (157.44 usr + 62.10 + sys = 219.54 CPU) @ 0.02/s (n=5) POSIX::mktime: 122 wallclock secs (60.43 usr + 60.86 sys = 121.29 CPU) + @ 0.04/s (n=5) TIME::Local::timegm_nocheck: 141 wallclock secs (140.86 usr + 0.30 sy +s = 141.16 CPU) @ 0.04/s (n=5)

        not_japh,

        Now that you explained why you are using the calculations, you method makes good sense. The previous suggestion of using midnight also may improve the performance a little more. Your call.

        Regards...Ed

        "Well done is better than well said." - Benjamin Franklin

Re: Special case mktime: semi-sequential access
by CountZero (Bishop) on Jul 01, 2013 at 13:54 UTC
    I think memoizing is something to look into, but you will have to split the time-calculations into the date part and the time part. If you calculate and memoize the value for the date at 00:00 hour then you will benefit even if the dates are not sequential and you save yourself the burden of checking whether the date is indeed within the same day. One out of sequence log entry will "destroy" your cache, whereas memoize will not be bothered.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1041613]
Approved by ww
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2014-07-12 03:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (238 votes), past polls