Thanks to all for their suggestions! Eventually I have decided to first write a shorter script which goes through ~50k lines of my input file and checks for any possible gaps in arithmetic progression. I have made several versions of this script in order to benchmark different external modules and also few techniques. I will choose the best one for the final programme. Here goes the testing script:
#!/usr/bin/perl
use strict;
use warnings;
BEGIN { #234567
local $Date::Calc::XS_DISABLE = 1; #2
require Date::Calc; #234567
Date::Calc->import( qw/Date_to_Time/ ); #234567
} #234567
use Time::Local qw/timelocal_nocheck/; #1
my $file = $ARGV[0];
my ($time_frame) = $file =~ /(\d+)/;
$\ = "\n";
$, = ","; #56
my @dateTime; #56
my ($date, $time); #5
my ($year, $mon, $day, $hour, $min); #12347
for (1..10) {
open DATA, $file or die "Can't open file $file";
my ($t0, $t1, $sec_int, $min_int);
while (<DATA>) {
($year, $mon, $day, $hour, $min) = /^(\d{4}).(\d{2}).(\d{2}),(\d{
+2}):(\d{2})/o; #123
($year, $mon, $day, $hour, $min) = /^(\d{4}).(\d{2}).(\d{2}),(\d{
+2}):(\d{2})/; #7
($year, $mon, $day, $hour, $min) = /^(\d+).(\d+).(\d+),(\d+):(\d+
+)/o; #4
($date, $time) = split ','; #5
@dateTime = "$date $time 0" =~ /(\d+)/g; #5
@dateTime = /^(\d{4}).(\d{2}).(\d{2}),(\d{2}):(\d{2})/o; #6
$t1 = Date_to_Time @dateTime; #5
$t1 = Date_to_Time(@dateTime,0); #6
$t1 = Date_to_Time $year, $mon, $day, $hour, $min, 0; #2347
$t1 = timelocal_nocheck(0, $min, $hour, $day, $mon-1, $year); #1
if ( defined $t0 ) {
$sec_int = $t1 - $t0;
$min_int = int $sec_int/60;
# warn "Leap second encountered around $year.$mon.$day $hour:$
+min" if $sec_int % 60;
print "$year.$mon.$day $hour:$min - $min_int" #12347
print @dateTime, " - ", $min_int #56
unless $min_int == $time_frame or $min_int > 36*60;
}
$t0 = $t1;
}
close DATA;
}
It is a template with numbers after comments indicating to which version that line belongs. All lines with comments at the end and not containing particular version number will be commented out for that testing run. Versions 1-3 use Time::Local, Date::Calc and Date::Calc::XS modules respectively. Remaining versions all use Date::Calc::XS and differ slightly in the way how the input is parsed. In each version I process the file 10 times to filter out (or rather spread out) the overhead of compiling and loading modules. Here is the shell script by which I measure the times:
#!/bin/bash
ROUNDS=$1
TIMEFORMAT=%R
OUT=bench.out
for VS in {1..7}; do
for ((i=1;i<=$ROUNDS;i++)) do
{ time sed -r "/#[0-9]+$/ s/^/#/; /#[0-9]*${VS}[0-9]*$/ s/^#//
+" check_completeness.pl.tpl | perl - EURUSD5.csv >/dev/null; } 2>> $O
+UT
done
echo Version $VS:
perl -ne'
BEGIN{ my ($sum, $sumsq, $count) }
$count++; $sum+=$_; $sumsq+=$_*$_;
END{
$mean = $sum/$count;
$stddev = sprintf "%.4f", sqrt( $sumsq/$count - $mean*$mea
+n );
$mean = sprintf "%.3f", $mean;
print "\tMean: ${mean}s\n\tStdDev: ${stddev}s\n\n"
}
' $OUT
rm $OUT
done
You can see that I also 'time' (because I don't know a better way) making a script from the template but this is really negligible (about 10ms). When I run this shell script with argument 30 (meaning each batch of 10 runs through file will be repeated 30 times) I get following results:
Version 1:
Mean: 8.945s
StdDev: 0.2205s
Version 2:
Mean: 4.329s
StdDev: 0.1283s
Version 3:
Mean: 1.639s
StdDev: 0.0477s
Version 4:
Mean: 1.706s
StdDev: 0.0586s
Version 5:
Mean: 2.898s
StdDev: 0.1029s
Version 6:
Mean: 1.976s
StdDev: 0.0992s
Version 7:
Mean: 1.659s
StdDev: 0.0631s
I was really surprised how the timelocal_nocheck function from Time::Local module is slow (v1). If that is considered a core module it must be quite badly written because Date_to_Time from Date::Calc (v2) does the same it seems - convert an array of time values to an epoch time. The biggest surprise however came when I installed that Date::Calc::XS version with C internals (v3). You can see its almost 3 times faster than pure Perl version.
To comment on other (more or less cosmetic) versions:
- (v4) - if you know exact structure of your data you should utilize it as much as possible.
- (v5) - good to know that wrong way of parsing data can make quite a big impact.
- (v6) - that's quite a surprise to me that storing saved values from regex in specific variables rather than into an array is considerably more efficient
- (v7) - here on the other hand we see only minimal difference between remembering regex or not
If anyone has comments regarding the output of certain benchmarking version or other ideas how to improve performance still further it would be great.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Outside of code tags, you may need to use entities for some characters:
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.