Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: Which DateTime:: module to use?

by Kenosis (Priest)
on Aug 26, 2012 at 23:02 UTC ( #989888=note: print w/replies, xml ) Need Help??

in reply to Which DateTime:: module to use?

The Date::Calc module's Date_to_Time will convert your date/time information almost directly, with the addition of a trailing 0 for seconds. Given this, and your task, perhaps the following will assist your coding:

use Modern::Perl; use Date::Calc qw/Date_to_Time/; my $duration = 10 * 60; my $fileName = 'data.txt'; open my $fh, '<', $fileName or die "Unable to open file: $!"; for ( ; ; ) { my ( $first, @lines ); while ( my $line = <$fh> ) { my ( $date, $time, $values ) = split ',', $line, 3; my @dateTime = "$date $time 0" =~ /(\d+)/g; my $timeInSecs = Date_to_Time @dateTime; $first = $timeInSecs unless $first; push @lines, "$timeInSecs\t$values"; last if $timeInSecs - $first >= $duration; } # Start work with chunk of lines in array # do { /(\S+)\t(\S+)/; say "Seconds: $1; Values: $2" } for @lines; say '----------'; # # End work with chunk of lines in array last if eof $fh; } close $fh;

Output (lines were added to your original data set):

Seconds: 1294953900; Values: 1.33508,1.33524,1.33470,1.33494,391 Seconds: 1294954020; Values: 1.33508,1.33524,1.33470,1.33494,391 Seconds: 1294954200; Values: 1.33494,1.33506,1.33447,1.33453,318 Seconds: 1294954320; Values: 1.33494,1.33506,1.33447,1.33453,318 Seconds: 1294954500; Values: 1.33453,1.33483,1.33417,1.33434,426 ---------- Seconds: 1294954620; Values: 1.33453,1.33483,1.33417,1.33434,426 Seconds: 1294954800; Values: 1.33434,1.33468,1.33417,1.33467,309 Seconds: 1294954920; Values: 1.33434,1.33468,1.33417,1.33467,309 Seconds: 1294955100; Values: 1.33471,1.33493,1.33465,1.33465,233 Seconds: 1294955220; Values: 1.33434,1.33468,1.33417,1.33467,309 ---------- Seconds: 1294955400; Values: 1.33465,1.33475,1.33443,1.33463,184 Seconds: 1294955520; Values: 1.33465,1.33475,1.33443,1.33463,184 Seconds: 1294955700; Values: 1.33463,1.33519,1.33463,1.33493,344 Seconds: 1294955820; Values: 1.33465,1.33475,1.33443,1.33463,184 Seconds: 1294956000; Values: 1.33494,1.33563,1.33489,1.33524,318 ---------- Seconds: 1294956120; Values: 1.33494,1.33563,1.33489,1.33524,318 Seconds: 1294956300; Values: 1.33524,1.33551,1.33512,1.33549,182 ----------

The script will read in N lines from a data file, based upon the value of $duration, which, in this case, is set to 10 minutes (always set $duration to seconds). The output shows clusters of lines within 10 minute intervals. The routine may grab one line beyond the duration, but hopefully that data granularity is sufficient for your analysis.

Not knowing how you want to work with your data, the push @lines, "$timeInSecs\t$values"; line can be changed to push only the raw lines (or whatever you may need) onto the array @lines.

Hope this helps!

Replies are listed 'Best First'.
Re^2: Which DateTime:: module to use?
by kejv2 (Acolyte) on Sep 01, 2012 at 23:21 UTC

    Thanks to all for their suggestions! Eventually I have decided to first write a shorter script which goes through ~50k lines of my input file and checks for any possible gaps in arithmetic progression. I have made several versions of this script in order to benchmark different external modules and also few techniques. I will choose the best one for the final programme. Here goes the testing script:

    #!/usr/bin/perl use strict; use warnings; BEGIN { #234567 local $Date::Calc::XS_DISABLE = 1; #2 require Date::Calc; #234567 Date::Calc->import( qw/Date_to_Time/ ); #234567 } #234567 use Time::Local qw/timelocal_nocheck/; #1 my $file = $ARGV[0]; my ($time_frame) = $file =~ /(\d+)/; $\ = "\n"; $, = ","; #56 my @dateTime; #56 my ($date, $time); #5 my ($year, $mon, $day, $hour, $min); #12347 for (1..10) { open DATA, $file or die "Can't open file $file"; my ($t0, $t1, $sec_int, $min_int); while (<DATA>) { ($year, $mon, $day, $hour, $min) = /^(\d{4}).(\d{2}).(\d{2}),(\d{ +2}):(\d{2})/o; #123 ($year, $mon, $day, $hour, $min) = /^(\d{4}).(\d{2}).(\d{2}),(\d{ +2}):(\d{2})/; #7 ($year, $mon, $day, $hour, $min) = /^(\d+).(\d+).(\d+),(\d+):(\d+ +)/o; #4 ($date, $time) = split ','; #5 @dateTime = "$date $time 0" =~ /(\d+)/g; #5 @dateTime = /^(\d{4}).(\d{2}).(\d{2}),(\d{2}):(\d{2})/o; #6 $t1 = Date_to_Time @dateTime; #5 $t1 = Date_to_Time(@dateTime,0); #6 $t1 = Date_to_Time $year, $mon, $day, $hour, $min, 0; #2347 $t1 = timelocal_nocheck(0, $min, $hour, $day, $mon-1, $year); #1 if ( defined $t0 ) { $sec_int = $t1 - $t0; $min_int = int $sec_int/60; # warn "Leap second encountered around $year.$mon.$day $hour:$ +min" if $sec_int % 60; print "$year.$mon.$day $hour:$min - $min_int" #12347 print @dateTime, " - ", $min_int #56 unless $min_int == $time_frame or $min_int > 36*60; } $t0 = $t1; } close DATA; }

    It is a template with numbers after comments indicating to which version that line belongs. All lines with comments at the end and not containing particular version number will be commented out for that testing run. Versions 1-3 use Time::Local, Date::Calc and Date::Calc::XS modules respectively. Remaining versions all use Date::Calc::XS and differ slightly in the way how the input is parsed. In each version I process the file 10 times to filter out (or rather spread out) the overhead of compiling and loading modules. Here is the shell script by which I measure the times:

    #!/bin/bash ROUNDS=$1 TIMEFORMAT=%R OUT=bench.out for VS in {1..7}; do for ((i=1;i<=$ROUNDS;i++)) do { time sed -r "/#[0-9]+$/ s/^/#/; /#[0-9]*${VS}[0-9]*$/ s/^#// +" | perl - EURUSD5.csv >/dev/null; } 2>> $O +UT done echo Version $VS: perl -ne' BEGIN{ my ($sum, $sumsq, $count) } $count++; $sum+=$_; $sumsq+=$_*$_; END{ $mean = $sum/$count; $stddev = sprintf "%.4f", sqrt( $sumsq/$count - $mean*$mea +n ); $mean = sprintf "%.3f", $mean; print "\tMean: ${mean}s\n\tStdDev: ${stddev}s\n\n" } ' $OUT rm $OUT done

    You can see that I also 'time' (because I don't know a better way) making a script from the template but this is really negligible (about 10ms). When I run this shell script with argument 30 (meaning each batch of 10 runs through file will be repeated 30 times) I get following results:

    Version 1: Mean: 8.945s StdDev: 0.2205s Version 2: Mean: 4.329s StdDev: 0.1283s Version 3: Mean: 1.639s StdDev: 0.0477s Version 4: Mean: 1.706s StdDev: 0.0586s Version 5: Mean: 2.898s StdDev: 0.1029s Version 6: Mean: 1.976s StdDev: 0.0992s Version 7: Mean: 1.659s StdDev: 0.0631s

    I was really surprised how the timelocal_nocheck function from Time::Local module is slow (v1). If that is considered a core module it must be quite badly written because Date_to_Time from Date::Calc (v2) does the same it seems - convert an array of time values to an epoch time. The biggest surprise however came when I installed that Date::Calc::XS version with C internals (v3). You can see its almost 3 times faster than pure Perl version.

    To comment on other (more or less cosmetic) versions:

    • (v4) - if you know exact structure of your data you should utilize it as much as possible.
    • (v5) - good to know that wrong way of parsing data can make quite a big impact.
    • (v6) - that's quite a surprise to me that storing saved values from regex in specific variables rather than into an array is considerably more efficient
    • (v7) - here on the other hand we see only minimal difference between remembering regex or not

    If anyone has comments regarding the output of certain benchmarking version or other ideas how to improve performance still further it would be great.

      You might want to tighten up you RE's a little. '.' matches anything, including a digit, it might be better to code a specific separator character, or perhaps use \s+ so that the separator is more explicit. Just a thought, it probably works fine as it is.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://989888]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (7)
As of 2018-05-21 18:02 GMT
Find Nodes?
    Voting Booth?