http://www.perlmonks.org?node_id=1043832

Endless has asked for the wisdom of the Perl Monks concerning the following question:

Hello my new favorite friends,

As part of a lexical processing project I'm working on, I'm parsing millions of dates and converting them to Epoch time. However, Diag::NYTProf showed me that I was losing massive amounts of time by using use Date::Parse::str2time; I guess that's the price you pay for something that seemed like the perfect, effortless way to parse the dates.

So, my question is, how can I most efficiently parse these dates, for those of you who have a sense of the benchmarks? Here was the WRONG way (removing it doubled my speed!):

# Dates of form 'Fri, 01 Mar 2013 01:21:14 +0000' my $created_at = str2time($value);

Update: Solution

Thanks to the discussion between BrowserUK and rjt I high-speed solution came that looked something like this:

use Inline C => q@ int epoch_sec(char * date) { char *tz_str = date + 26; struct tm tm; int tz; if ( strlen(date) != 31 || strptime(date, "%a, %d %b %Y %T", &tm) == NULL || sscanf(tz_str, "%d", &tz) != 1) { printf("Invalid date %s\n", date); return 0; } return timegm(&tm) - (tz < 0 ? -1 : 1)*(abs(tz)/100*3600 + abs(tz)%100*60); } @; our $date = "Fri, 01 Mar 2013 01:21:14 +0200"; my $newDate = epoch_sec($date); say $newDate;

Thanks! You guys are incredible.

Replies are listed 'Best First'.
Re: High-speed Date Formatting
by rjt (Curate) on Jul 12, 2013 at 00:39 UTC

    Update: Another quantum leap brought to you by Inline::C and <time.h> for a 2500% increase over Date::Parse::str2time. Fast enough, I trust?

    Updated with BrowserUk's method

    This one's a natural for split (prior to getting really funky)

    Benchmark code

    Output:

    Rate str2time split_nocheck BrowserUk +inline_c str2time 16554/s -- -87% -93% + -96% split_nocheck 126315/s 663% -- -45% + -71% BrowserUk 229820/s 1288% 82% -- + -47% inline_c 437015/s 2540% 246% 90% + --

      Whoo! Looks like that's the winner rjt. I suspect the ability to switch into c-mode can be a major strength for Perl.

      The constants for BrowserUk are wrong. They should be:
      use constant MONTHS => { qw[ Jan 0 Feb 31 Mar 59 Apr 90 May 120 Jun 151 Jul 181 Aug 212 Sep 243 Oct 273 Nov 304 Dec 334 ] };
Re: High-speed Date Formatting
by tobyink (Canon) on Jul 12, 2013 at 01:12 UTC

    You missed an opportunity to title your post "Speed Dating"!

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
Re: High-speed Date Formatting
by BrowserUk (Patriarch) on Jul 12, 2013 at 00:23 UTC

    If can live with its limitations(1), you could try something like this:

    #! perl -slw use strict; use constant MONTHS => { qw[ Jan 0 Feb 31 Mar 59 Apr 90 May 120 Jun 151 Jul 181 Aug 212 Sep 242 Oct 272 Nov 303 Dec 334 ] }; sub str2epoch { my( $d, $m, $y, $H, $M, $S ) = $_[0] =~ m[^.... (\d\d) (...) (\d\d\d\d) (\d\d):(\d\d):(\d\d)] or die "Bad format $_[0]"; my $leaps = int( ($y - 1970) / 4 + 0.5 ); (((($y-1970)*365 +$leaps+MONTHS->{$m}+($d-1))*24 +$H)*60 +$M)*60 + +$S; } my $date = 'Fri, 01 Mar 2013 01:21:14 +0000';; print str2epoch( $date ); print scalar localtime str2epoch( $date ); print scalar localtime str2epoch( 'Fri, 12 Jul 2013 01:20:34' );; __END__ C:\test\primes>..\str2epoch.pl 1362100874 Fri Mar 1 01:21:14 2013 Fri Jul 12 02:20:34 2013

    And if you need a little more you might parse the numbers with unpack instead of the regex engine.

    1 Limitations include:

    • No leap seconds;
    • No daylight savings;
    • No timezones;
    • Only works for another 87 years.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      No timezones;

      This much is (sort of) trivial to fix. Add  ([+-]\d{4}) to the end of your regex, and then include - $tz/100*3600 - $tz%100*60 at the end of the expression.

      sub str2epoch { my( $d, $m, $y, $H, $M, $S, $tz ) = $_[0] =~ m/^.... (\d\d) (...) (\d\d\d\d) (\d\d):(\d\d):(\d\d) ([+-]\d{4 +})/ or die "Bad format $_[0]"; my $leaps = int( ($y - 1970) / 4 + 0.5 ); (((($y-1970)*365 +$leaps+MONTHS->{$m}+($d-1))*24 +$H)*60 +$M)*60 + +$S - ($tz < 0 ? -1 : 1)*(substr($tz,1,2)*3600 + substr($tz, +3)*60); }
      • No leap seconds;
      • No daylight savings;
      • Only works for another 87 years.

      See my answer for timegm, which does slow things down, but is a bit more robust. ++ for pure-Perl that out-performs the "efficient" Time::Local routines and works in most all cases, though!

      Edit: Updated $tz calc for fractional hours.

        No timezones; This much is trivial to fix....

        Indeed, that is sufficiently efficient to make it silly not to include it. Thank you.

        Though It seems silly not to let the regex do its work. I reformulated that as:

        sub str2epoch { my( $d, $m, $y, $H, $M, $S, $tzs, $tzh, $tzm ) = $_[0] =~ m[^.... (\d\d) (...) (\d\d\d\d) (\d\d):(\d\d):(\d\d) ([+-])(\d +\d)(\d\d)] or die "Bad format $_[0]"; my $leaps = int( ($y - 1970) / 4 + 0.5 ); (((($y-1970)*365 +$leaps+MONTHS->{$m}+($d-1))*24 +$H)*60 +$M)*60 + +$S - ($tzs eq '-' ? -1 : 1)*$tzh*3600 + $tzm*60; }
        See my answer for timegm, which does slow things down, but is a bit more robust

        All the others can be handled, but eventually you just end up with the morass that is DateTime which I have no time for :)

        (There is a simple workaround for the 2100 problem, but it wouldn't come to mind as I write that. And the last version I wrote is archived on a CD somewhere.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      Another small speed up can be achieved by replacing

      my $leaps = int( ($y - 1970) / 4 + 0.5 ); (((($y-1970)*365 +$leaps+MONTHS->{$m}+($d-1))*24 +$H)*60 +$M)*60 + +$S;

      with

      (((int(($y-1970)*365.25-.5)+MONTHS->{$m}+$d)*24 +$H)*60 +$M)*60 +$S;

      in my experiments between 5 to 10%. All other attempts using split and unpack are much slower. I found substr to be very fast:

      sub str2epoch3 { (((int((substr($_[0],12,4)-1970)*365.25-.5)+ MONTHS->{substr($_[0],8,3)}+substr($_[0],5,2))*24 +substr($_[0],1 +7,2))*60 + substr($_[0],20,2))*60 +substr($_[0],23,2); }

      about 60% faster than BrowserUk's code. I wonder whether there is something wrong...

      Here is my full code:

      Only works for another 87 years.

      This reminds me of a comment that I actually found in some source-code once:   “dig me up and I’ll fix it then.”

Re: High-speed Date Formatting
by fullermd (Priest) on Jul 12, 2013 at 07:20 UTC

    If they're all the same format (or possibly a sufficiently small number of formats), you may be able to use strptime. If you can get something that just hits the C func, it may be quite fast (and even if it doesn't, being able to specify a format rather than call a DWIM interpretation function should be faster anyway).

    I see a POSIX::strptime that seems like it may thunk straight through to libc's implementation. DateTime::Format::Strptime looks like a perl reimplementation, so may be slower (but probably more portable, if your system's strptime is broken).

      Great idea! In fact, I had the same one earlier in this thread at least six hours ago. :-) Here's the relevant sub by itself:

      use Inline C => q@ int epoch_sec(char * date) { char *tz_str = date + 26; struct tm tm; int tz; if ( strlen(date) != 31 || strptime(date, "%a, %d %b %Y %T", &tm) == NULL || sscanf(tz_str, "%d", &tz) != 1) { printf("Invalid date %s\n", date); return 0; } return timegm(&tm) - (tz < 0 ? -1 : 1)*(abs(tz)/100*3600 + abs(tz)%100*60); } @;

      Performs about 25 times faster than str2time.