kejv2 has asked for the
wisdom of the Perl Monks concerning the following question:
Hello,
first I describe my use case. I need to process a file full of dates with one specific format (%Y.%m.%d,%H:%M). You can see that the precision I will be working with is minutes. Here goes a snippet of the input:
2011.01.13,21:25,1.33508,1.33524,1.33470,1.33494,391
2011.01.13,21:30,1.33494,1.33506,1.33447,1.33453,318
2011.01.13,21:35,1.33453,1.33483,1.33417,1.33434,426
2011.01.13,21:40,1.33434,1.33468,1.33417,1.33467,309
2011.01.13,21:45,1.33471,1.33493,1.33465,1.33465,233
2011.01.13,21:50,1.33465,1.33475,1.33443,1.33463,184
2011.01.13,21:55,1.33463,1.33519,1.33463,1.33493,344
2011.01.13,22:00,1.33494,1.33563,1.33489,1.33524,318
2011.01.13,22:05,1.33524,1.33551,1.33512,1.33549,182
Unfortunately the arithmetical progression pattern you would deduce is not always exact. First, there are no data for weekends (but the hour and minute progression is usually preserved) and second, there are unfortunately some gaps so 13:40 can be followed by 13:50. This apeears to be quite rare but prevents me from handle the data in a nicer way.
And now the goal: I will process the file once by picking 1st date and then comparing next N dates to it until the difference is say 12 hours. (I will definitely not want to compare dates in more detail than hours.) Before each comparing I will do some stuff with the row. After I reach specified time I will proceed with comparing Nth date to next M dates until the difference is again 12 hours and so on.
As I am not very experienced (Perl) coder I first searched CPAN for the right module to use and was quite overwhelmed (and confused) by the rich selection of modules concerning date and time manipulation. Because I will process my file tens of thousands times (each time with different parameters) and it has cca. 1M of rows the time complexity will be quite a concern. Not that my program will be used in any real time apllications but it will be nice if it runs 10 mins instead of 10 hours.
For the date-string parsing part I will probably use DateTime::Format::Strptime so the result will be DateTime object. But what with the comparisons? Should I add specified duration (12 hours) to one object and then compare the two using 'compare' method or first compute the difference in hours and than compare it (as a plain number) with my time interval? Also is there some preferred way how to make use of the fact that for each few hundreds of succesive comparisons one of the objects remains constant or will Perl (or chosen module) do some sort of caching for me?
Last I would like to emphasize again that I can surely do all this somehow but I want to do it also quite fast (but not at too much expense of readability).
Re: Which DateTime:: module to use? by flexvault (Vicar) on Aug 26, 2012 at 17:13 UTC |
kejv2,
Just to help you get a better answer, if you use <code> and </code> and give 4 or 5 sample rows of the input data and then the expected results from that data, you'll get a lot of good to excellent answers for pointers. Just add it to the original post and we'll know what you're talking about.
Thank you
"Well done is better than well said." - Benjamin Franklin
| [reply] |
Re: Which DateTime:: module to use? by moritz (Cardinal) on Aug 26, 2012 at 18:19 UTC |
You can split out the values and pass them to Time::Local::timelocal (which is a core module), and get a UNIX timestamp back. That is just a number of seconds since a reference date, so if you want to compare two dates if they are 12h apart, you can just check their difference for 12 * 60 * 60.
| [reply] [d/l] |
Re: Which DateTime:: module to use? by Kenosis (Deacon) on Aug 26, 2012 at 23:02 UTC |
The Date::Calc module's Date_to_Time will convert your date/time information almost directly, with the addition of a trailing 0 for seconds. Given this, and your task, perhaps the following will assist your coding:
use Modern::Perl;
use Date::Calc qw/Date_to_Time/;
my $duration = 10 * 60;
my $fileName = 'data.txt';
open my $fh, '<', $fileName or die "Unable to open file: $!";
for ( ; ; ) {
my ( $first, @lines );
while ( my $line = <$fh> ) {
my ( $date, $time, $values ) = split ',', $line, 3;
my @dateTime = "$date $time 0" =~ /(\d+)/g;
my $timeInSecs = Date_to_Time @dateTime;
$first = $timeInSecs unless $first;
push @lines, "$timeInSecs\t$values";
last if $timeInSecs - $first >= $duration;
}
# Start work with chunk of lines in array
#
do { /(\S+)\t(\S+)/; say "Seconds: $1; Values: $2" }
for @lines;
say '----------';
#
# End work with chunk of lines in array
last if eof $fh;
}
close $fh;
Output (lines were added to your original data set):
Seconds: 1294953900; Values: 1.33508,1.33524,1.33470,1.33494,391
Seconds: 1294954020; Values: 1.33508,1.33524,1.33470,1.33494,391
Seconds: 1294954200; Values: 1.33494,1.33506,1.33447,1.33453,318
Seconds: 1294954320; Values: 1.33494,1.33506,1.33447,1.33453,318
Seconds: 1294954500; Values: 1.33453,1.33483,1.33417,1.33434,426
----------
Seconds: 1294954620; Values: 1.33453,1.33483,1.33417,1.33434,426
Seconds: 1294954800; Values: 1.33434,1.33468,1.33417,1.33467,309
Seconds: 1294954920; Values: 1.33434,1.33468,1.33417,1.33467,309
Seconds: 1294955100; Values: 1.33471,1.33493,1.33465,1.33465,233
Seconds: 1294955220; Values: 1.33434,1.33468,1.33417,1.33467,309
----------
Seconds: 1294955400; Values: 1.33465,1.33475,1.33443,1.33463,184
Seconds: 1294955520; Values: 1.33465,1.33475,1.33443,1.33463,184
Seconds: 1294955700; Values: 1.33463,1.33519,1.33463,1.33493,344
Seconds: 1294955820; Values: 1.33465,1.33475,1.33443,1.33463,184
Seconds: 1294956000; Values: 1.33494,1.33563,1.33489,1.33524,318
----------
Seconds: 1294956120; Values: 1.33494,1.33563,1.33489,1.33524,318
Seconds: 1294956300; Values: 1.33524,1.33551,1.33512,1.33549,182
----------
The script will read in N lines from a data file, based upon the value of $duration, which, in this case, is set to 10 minutes (always set $duration to seconds). The output shows clusters of lines within 10 minute intervals. The routine may grab one line beyond the duration, but hopefully that data granularity is sufficient for your analysis.
Not knowing how you want to work with your data, the push @lines, "$timeInSecs\t$values"; line can be changed to push only the raw lines (or whatever you may need) onto the array @lines.
Hope this helps! | [reply] [d/l] [select] |
|
#!/usr/bin/perl
use strict;
use warnings;
BEGIN { #234567
local $Date::Calc::XS_DISABLE = 1; #2
require Date::Calc; #234567
Date::Calc->import( qw/Date_to_Time/ ); #234567
} #234567
use Time::Local qw/timelocal_nocheck/; #1
my $file = $ARGV[0];
my ($time_frame) = $file =~ /(\d+)/;
$\ = "\n";
$, = ","; #56
my @dateTime; #56
my ($date, $time); #5
my ($year, $mon, $day, $hour, $min); #12347
for (1..10) {
open DATA, $file or die "Can't open file $file";
my ($t0, $t1, $sec_int, $min_int);
while (<DATA>) {
($year, $mon, $day, $hour, $min) = /^(\d{4}).(\d{2}).(\d{2}),(\d{
+2}):(\d{2})/o; #123
($year, $mon, $day, $hour, $min) = /^(\d{4}).(\d{2}).(\d{2}),(\d{
+2}):(\d{2})/; #7
($year, $mon, $day, $hour, $min) = /^(\d+).(\d+).(\d+),(\d+):(\d+
+)/o; #4
($date, $time) = split ','; #5
@dateTime = "$date $time 0" =~ /(\d+)/g; #5
@dateTime = /^(\d{4}).(\d{2}).(\d{2}),(\d{2}):(\d{2})/o; #6
$t1 = Date_to_Time @dateTime; #5
$t1 = Date_to_Time(@dateTime,0); #6
$t1 = Date_to_Time $year, $mon, $day, $hour, $min, 0; #2347
$t1 = timelocal_nocheck(0, $min, $hour, $day, $mon-1, $year); #1
if ( defined $t0 ) {
$sec_int = $t1 - $t0;
$min_int = int $sec_int/60;
# warn "Leap second encountered around $year.$mon.$day $hour:$
+min" if $sec_int % 60;
print "$year.$mon.$day $hour:$min - $min_int" #12347
print @dateTime, " - ", $min_int #56
unless $min_int == $time_frame or $min_int > 36*60;
}
$t0 = $t1;
}
close DATA;
}
It is a template with numbers after comments indicating to which version that line belongs. All lines with comments at the end and not containing particular version number will be commented out for that testing run. Versions 1-3 use Time::Local, Date::Calc and Date::Calc::XS modules respectively. Remaining versions all use Date::Calc::XS and differ slightly in the way how the input is parsed. In each version I process the file 10 times to filter out (or rather spread out) the overhead of compiling and loading modules. Here is the shell script by which I measure the times:
#!/bin/bash
ROUNDS=$1
TIMEFORMAT=%R
OUT=bench.out
for VS in {1..7}; do
for ((i=1;i<=$ROUNDS;i++)) do
{ time sed -r "/#[0-9]+$/ s/^/#/; /#[0-9]*${VS}[0-9]*$/ s/^#//
+" check_completeness.pl.tpl | perl - EURUSD5.csv >/dev/null; } 2>> $O
+UT
done
echo Version $VS:
perl -ne'
BEGIN{ my ($sum, $sumsq, $count) }
$count++; $sum+=$_; $sumsq+=$_*$_;
END{
$mean = $sum/$count;
$stddev = sprintf "%.4f", sqrt( $sumsq/$count - $mean*$mea
+n );
$mean = sprintf "%.3f", $mean;
print "\tMean: ${mean}s\n\tStdDev: ${stddev}s\n\n"
}
' $OUT
rm $OUT
done
You can see that I also 'time' (because I don't know a better way) making a script from the template but this is really negligible (about 10ms). When I run this shell script with argument 30 (meaning each batch of 10 runs through file will be repeated 30 times) I get following results:
Version 1:
Mean: 8.945s
StdDev: 0.2205s
Version 2:
Mean: 4.329s
StdDev: 0.1283s
Version 3:
Mean: 1.639s
StdDev: 0.0477s
Version 4:
Mean: 1.706s
StdDev: 0.0586s
Version 5:
Mean: 2.898s
StdDev: 0.1029s
Version 6:
Mean: 1.976s
StdDev: 0.0992s
Version 7:
Mean: 1.659s
StdDev: 0.0631s
I was really surprised how the timelocal_nocheck function from Time::Local module is slow (v1). If that is considered a core module it must be quite badly written because Date_to_Time from Date::Calc (v2) does the same it seems - convert an array of time values to an epoch time. The biggest surprise however came when I installed that Date::Calc::XS version with C internals (v3). You can see its almost 3 times faster than pure Perl version.
To comment on other (more or less cosmetic) versions:
- (v4) - if you know exact structure of your data you should utilize it as much as possible.
- (v5) - good to know that wrong way of parsing data can make quite a big impact.
- (v6) - that's quite a surprise to me that storing saved values from regex in specific variables rather than into an array is considerably more efficient
- (v7) - here on the other hand we see only minimal difference between remembering regex or not
If anyone has comments regarding the output of certain benchmarking version or other ideas how to improve performance still further it would be great.
| [reply] [d/l] [select] |
|
You might want to tighten up you RE's a little. '.' matches anything, including a digit, it might be better to code a specific separator character, or perhaps use \s+ so that the separator is more explicit. Just a thought, it probably works fine as it is.
| [reply] |
Re: Which DateTime:: module to use? by Cristoforo (Chaplain) on Aug 27, 2012 at 14:03 UTC |
Here are 2 additional solutions. One uses DateTime and the other uses Date::Parse and POSIX. I don't know about speed, but you could try them out and see. Also, which is the most easy to read might be a consideration, (as you stated).
#!/usr/bin/perl
use strict;
use warnings;
use DateTime::Format::Strptime;
my $dt = DateTime::Format::Strptime->new( pattern => '%Y.%m.%d %H:%M')
+;
chomp(my @line = split /,/, <DATA>);
my @data = \@line;
my $beg = $dt->parse_datetime("@line[0,1]")->truncate( to => 'hour' );
my $end = $beg->clone->add(hours => 12);
printf "%s -+- %s\n", map tr/T/ /r, $beg, $end;
while (<DATA>) {
chomp(my @line = split /,/);
my $date = $dt->parse_datetime("@line[0,1]")->truncate( to => 'hou
+r' );
if ($date < $end) {
push @data, \@line;
}
else {
process(@data);
@data = \@line;
$end = $date->clone->add(hours => 12);
printf "%s -+- %s\n", map tr/T/ /r, $date, $end;
}
}
process(@data);
sub process {
my @data = @_;
# do somthing with data
my @sum;
my @idx = 2 .. 5;
for my $line (@data) {
for my $col (@idx) {
$sum[$col] += $line->[$col];
}
}
printf "Avg of %d lines", scalar @data;
print +(map {sprintf "%10.5f", $sum[$_] / @data} @idx), "\n";
}
The Date::Parse solution is nearly the same.
#!/usr/bin/perl
use strict;
use warnings;
use Date::Parse qw/ str2time /;
use POSIX qw/ strftime /;
chomp(my @line = split /,/, <DATA>);
my @data = \@line;
my $beg = str2time("@line[0,1]");
$beg = int($beg / 3600) * 3600; # truncate minutes and seconds
my $end = $beg + 12 * 60 * 60; # add 12 hours
printf "%s -+- %s\n",
map {strftime "%Y-%m-%d %H:%M:%S", localtime $_} $beg,$end;
while (<DATA>) {
chomp(my @line = split /,/);
my $date = str2time("@line[0,1]");
$date = int($date / 3600) * 3600; # truncate minutes and seconds
if ($date < $end) {
push @data, \@line;
}
else {
process(@data);
@data = \@line;
$end = $date + 12 * 60 * 60;
printf "%s -+- %s\n",
map {strftime "%Y-%m-%d %H:%M:%S", localtime $_} $date, $e
+nd;
}
}
process(@data);
sub process {
my @data = @_;
# do somthing with data
my @sum;
my @idx = 2 .. 5;
for my $line (@data) {
for my $col (@idx) {
$sum[$col] += $line->[$col];
}
}
printf "Avg of %d lines", scalar @data;
print +(map {sprintf "%10.5f", $sum[$_] / @data} @idx), "\n";
}
__DATA__
2011.01.13,21:25,1.33508,1.33524,1.33470,1.33494,391
2011.01.13,21:30,1.33494,1.33506,1.33447,1.33453,318
2011.01.13,21:35,1.33453,1.33483,1.33417,1.33434,426
2011.01.13,21:40,1.33434,1.33468,1.33417,1.33467,309
2011.01.13,21:45,1.33471,1.33493,1.33465,1.33465,233
2011.01.13,21:50,1.33465,1.33475,1.33443,1.33463,184
2011.01.13,21:55,1.33463,1.33519,1.33463,1.33493,344
2011.01.13,22:00,1.33494,1.33563,1.33489,1.33524,318
2011.01.13,22:05,1.33524,1.33551,1.33512,1.33549,182
2011.01.14,22:05,1.33524,1.33551,1.33512,1.33549,182
These produced the output:
C:\Old_Data\perlp>perl t33.pl
2011-01-13 21:00:00 -+- 2011-01-14 09:00:00
Avg of 9 lines 1.33478 1.33509 1.33458 1.33482
2011-01-14 22:00:00 -+- 2011-01-15 10:00:00
Avg of 1 lines 1.33524 1.33551 1.33512 1.33549
| [reply] [d/l] [select] |
|
|