http://www.perlmonks.org?node_id=1197142

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, I am working with some data in which I need to take an average every 12 lines (every 5 seconds) and print that average out along with the time that the data was recorded (so one reading every minute). I have done this in 2 ways but neither are particularly neat or speedy. I am relatively new to Perl so I apologise to anyone who thinks I have butchered the code. The data looks as follows:
acceleration (mg) - 2013-10-09 10:00:00 - 2013-10-16 09:59:55 - sample +Rate = 5 seconds, imputed 46.3, 0 17.1, 0 30.1, 0 38.4, 0 97.1, 0 87.3, 0 84, 0 78.5, 0 67.9, 0 83.5, 0 155, 0 103.5, 0
The following is my first method - which as you can see is very chunky and just a bit of a bodge job.
#!/usr/bin/perl use strict; use warnings; use Getopt::Long; use Date::Calc qw(:all); my $input_file = undef; GetOptions ( "input=s" => \$input_file, ); open INPUT, $input_file or die "Can't open $input_file for input: $!"; my @files = undef; #list of files to run script on while(my $line = <INPUT>){ chomp($line); push @files, $line; } for(my $i=1; $i < scalar @files; $i++) { if(-e $files[$i]){ open IN, "gunzip -c $files[$i]|" or die "Can't open $files[$i] for input: $!"; open OUT, "> $files[$i]_out.txt" or die "Can't open $files[$i]_out.txt for output: $!"; my $firstline = 1; my @dt = undef; # date and time my @date1 = undef; my @date2 = undef; if($firstline==1){ #extract header line and get start a +nd end date from line chomp(my $line = <IN>); my @header = split(/,/, $line); @dt= $header[0] =~ /(\d+)/g; @date1 = ($dt[0], $dt[1], $dt[2], $dt[3], $dt[4], $dt[5]); @date2 = ($dt[6], $dt[7], $dt[8], $dt[9], $dt[10], $dt[11] +); $firstline = 0; } else { <IN> for 1..1 } print OUT "Date\tTime\tDay\tmg\n"; #print title my $count = 0; while(my $line1 = <IN>){ #whilst reading f +ile get 12 lines and print time and print average of 12 lines chomp($line1); my @mg1 = split(/,/, $line1); my $line2 = <IN>; chomp($line2); my @mg2 = split(/,/, $line2); my $line3 = <IN>; chomp($line3); my @mg3 = split(/,/, $line3); my $line4 = <IN>; chomp($line4); my @mg4 = split(/,/, $line4); my $line5 = <IN>; chomp($line5); my @mg5 = split(/,/, $line5); my $line6 = <IN>; chomp($line6); my @mg6 = split(/,/, $line6); my $line7 = <IN>; chomp($line7); my @mg7 = split(/,/, $line7); my $line8 = <IN>; chomp($line8); my @mg8 = split(/,/, $line8); my $line9 = <IN>; chomp($line9); my @mg9 = split(/,/, $line9); my $line10 = <IN>; chomp($line10); my @mg10 = split(/,/, $line10); my $line11 = <IN>; chomp($line11); my @mg11 = split(/,/, $line11); my $line12 = <IN>; chomp($line12); my @mg12 = split(/,/, $line12); my ($y, $mo, $d, $h, $m, $s) = Add_Delta_DHMS(@date1, 0, 0 +, $count, 0); printf OUT qq(%d-%02d-%02d %02d:%02d:%02d), $y, $mo, $d, $ +h, $m, $s; print OUT "\t" . Day_of_Week($y, $mo, $d); print OUT "\t" . (($mg1[0]+$mg2[0]+$mg3[0]+$mg4[0]+$mg5[0] ++$mg6[0]+$mg7[0]+$mg8[0]+$mg9[0]+$mg10[0]+$mg11[0]+$mg12[0])/12) . "\ +n"; $count +=1 } close IN; close OUT } }
My second code is a bit more streamline but it takes longer to run.
for(my $i=1; $i < scalar @files; $i++) { #everything same as before un +til we get to reading the files if(-e $files[$i]){ open IN, "gunzip -c $files[$i]|" or die "Can't open $files[$i] for input: $!"; open OUT, "> $files[$i]_out.txt" or die "Can't open $files[$i]_out.txt for output: $!"; my $firstline = 1; my @dt = undef; # date and time my @date1 = undef; my @date2 = undef; if($firstline==1){ chomp(my $line = <IN>); my @header = split(/,/, $line); @dt= $header[0] =~ /(\d+)/g; @date1 = ($dt[0], $dt[1], $dt[2], $dt[3], $dt[4], $dt[5]); @date2 = ($dt[6], $dt[7], $dt[8], $dt[9], $dt[10], $dt[11] +); $firstline = 0; } else { <IN> for 1..1 } print OUT "Date\tTime\tDay\tmg\n"; my $count = 0; my $avg_count = 0; while(<IN>){ #this is w +here it changes, instead of doing the same thing 12 times I say if th +e line number (-1 due to header) modulos 12 is 0 then print the avera +ge and set it back to 0 my ($y, $mo, $d, $h, $m, $s) = Add_Delta_DHMS(@date1, 0, 0 +, $count, 0); chomp($_); my @line = split(/,/, $_); $avg_count += $line[0]; if((($.)-1)%12 ==0){ printf OUT qq(%d-%02d-%02d %02d:%02d:%02d), $y, $mo, $ +d, $h, $m, $s; print OUT "\t" . Day_of_Week($y, $mo, $d); print OUT "\t" . ($avg_count/12) . "\n"; $count +=1; $avg_count =0; } } close IN; close OUT } }
I know that is probably a lot to read and probably doesn't make sense, and I apologise. The scripts get the job done but I am looking for help in improving my skills and making everything clearer. The 1st code takes 1.8 seconds to run (file has 120960 lines) and second takes 9 seconds to run for the same file. Any help would be greatly appreciated, as this needs to be run for about 100,000 files.

Replies are listed 'Best First'.
Re: What can I do to improve my code - I'm a beginner
by AnomalousMonk (Archbishop) on Aug 10, 2017 at 14:16 UTC
    my $input_file = undef;
    ...
    my @files = undef;

    These are examples of types of statements that I see in several places in your code and that I consider programming tics that should be addressed with psychotherapy or powerful behavior modifying drugs.

    The first,
        my $input_file = undef;
    defines a lexical scalar variable that is default-initialized to undef — and then explicitly initializes the variable to undef. I see no point to the explicit initialization, but it can do no harm.

    The second type of statement,
        my @files = undef;
    can potentially do some damage because it doesn't do what I think you think it does. A lexical array is defined in an empty state by default, but this statement explicitly initializes the array with a single undef element.

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @files = undef; ;; dd \@files; ;; print 'number of elements in array: ', scalar(@files); " [undef] number of elements in array: 1
    Is this what you expect and want? I thought not. In the example code you posted, this semantic error, by good fortune, does no harm (that I can see), but it is the kind of error that can bite you in the ass at any time given the opportunity.

    Update: If you want to go the explicit useless initialization route for list-type variables, the correct syntax is
        my @array = ();
        my %hash  = ();
    but again, these statements | initializations would IMHO just be more evidence of a need for medical intervention. The only circumstance I can see in which such statements | initializations can barely be justified is when an initial empty state is vital to the correct operation of a succeeding algorithm and you don't want to bother emphasizing this fact by going to all the trouble of typing a comment.


    Give a man a fish:  <%-{-{-{-<

      To the psychiatrist, I go for my conversion therapy then! But seriously thank you for the constructive criticism. As I say I'm still very much a beginner and it is comments like these which are going to make me better. Could you please explain something for me though? Regarding setting the array to undef - I completely understand what you are saying. I was wondering why the array was 1 element longer than what it was supposed to be and now I know. I also have an understanding with what you are saying with the lexical scalar variable $input_file. But I am not sure what I need to do to get around these. As you can see from my code, I need them to be global variables, so I define them outside of the WHILE loops. If you say that you see them as redundant can you tell me how to get around doing this? Once again, thanks!
        ... I am not sure what I need to do to get around these. ... I need them to be global variables, so I define them outside of the WHILE loops. If you say that you see them as redundant can you tell me how to get around doing this?

        Please understand that what I see as redundant is useless explicit initialization of a variable. In the case of
            my $input_file = undef;
        the assignment does exactly what is done by default; it is purely redundant.

        Semantically erroneous statements like
            my @files = undef;
        are redundant in the present circumstances because the very next thing you do with these arrays (as far as I can see) is to assign them valid data, thus undoing the erroneous initialization. In other circumstances, the erroneous initialization may lead to a nasty bug.

        ... I need them to be global variables, so I define them outside of the WHILE loops.

        Of course, you need to define lexical variables where you need to use them. On this note, you only use the  @dt lexical within the scope of the  if($firstline==1){ ... } statement block, so that's exactly where I would define it:

        if($firstline==1){ ... my @dt = $header[0] =~ /(\d+)/g; ... } else { <IN> for 1..1 }
        (Incidentally, the  <IN> for 1..1 statement is needlessly involved: if you just want to read and discard one line, the simple  <IN>; statement does the trick.)


        Give a man a fish:  <%-{-{-{-<

Re: What can I do to improve my code - I'm a beginner (Updated)
by thanos1983 (Parson) on Aug 10, 2017 at 10:39 UTC

    Hello Anonymous Monk,

    Some commends that could improve the speed of your script. Unfortunately I do not have the time to review it all and commend as many things I could suggest (sorry for that) but with a quick look:

    Regarding your while loop. Since you are using a while loop (meaning that automatically) will read one line at a time you are assigning next line and repeating the same process for the first 12 lines, why not use also an if condition based on line number. Sample bellow:

    The data that I used are coming from Re: Multiple values for a single key (Updated), but it should work out of the box for your case also.

    By creating a HASHES OF ARRAYS you have the ability to extract the keys and values easier.

    Update: Or if you prefer to reduce it by one line more and create HASHES OF ARRAYS and use as a key the line number (for easier data retrieval) you can do it like this. Sample bellow:

    Update2: You can reduce to minimum, just check if line contains comma (process) else skip.

    Update3: Even further:

    Update4: Line numbering reset, sorry just remembered you said you want to read every 12 lines a file with thousands of lines:

    Hope this helps, BR.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
      Thanks for your reply! As I say I'm a beginner so this took quite a while to digest but now I've re-read it I'm starting to understand what you are saying. Thank you for your help!
Re: What can I do to improve my code - I'm a beginner
by Marshall (Canon) on Aug 13, 2017 at 09:59 UTC
    Hi Anon Monk!
    I encourage you to sign up for an account. An account is free and helps at least me understand who is posting what...

    I found your code hard to follow. I present an alternative below and will offer a few comments.

    • I have no comments about the command line and Get::Opts. My code starts with a list of files to process.
    • The use of array subscripts is rare in Perl. My code below does have one such use to index the translation of a number into a string representing the day of week.
      Your code has an inappropriate and I think an incorrect use. for(my $i=1; $i < scalar @files; $i++) I think $i=0 would be correct. But in any event, in Perl use a foreach iterator over an array.
    • Instead of complicated flags like if($firstline==1){}, call a subroutine to deal with the first line.
    • Then setup a loop to process the 12 line chunks. I show a very standard way to do this. process_one_minute() will be called repeatedly until there a no complete 12 line chunks left.
    • I am sure that there are some formatting and spacing issues with the printout.
    • But I hope that his helps..
    • Some Code: The Input: The Output:
Re: What can I do to improve my code - I'm a beginner
by Anonymous Monk on Aug 10, 2017 at 14:47 UTC
    @date1 = ($dt[0], $dt[1], $dt[2], $dt[3], $dt[4], $dt[5]);
    This kind of thing is redundant and error-prone. That's why we have array slices!
    @date1 = @dt[0..5];
    Another redundancy:
    chomp($line1); my @mg1 = split(/,/, $line1); my $line2 = <IN>; chomp($line2); my @mg2 = split(/,/, $line2); my $line3 = <IN>; chomp($line3); ...
    You've replaced it with a loop that calls Add_Delta_DHMS on every line, not just once per twelve lines, so it's not equivalent at all. I would tend to write something like this:
    my ($sum) = split /,/, $line; for (2..12) { my ($val) = split /,/, <IN>; $sum += $val; }
      That array slice thing makes so much sense! Thank you! Also I now understand why my second code was running a lot slower - because I was calling the Add_Delta_DHMS every line. You've been a big help! Thank you!
Re: What can I do to improve my code - I'm a beginner
by Anonymous Monk on Aug 12, 2017 at 17:54 UTC

    Some other comments, suggestions:

    - There is this section where you read in the list of input files:

    my @files = undef; #list of files to run script on while(my $line = <INPUT>){ chomp($line); push @files, $line; }

    The @files = undef part was addressed already, but there is one more typical beginner pattern here. You read the file line by line then push every line to your array. This is inefficient and unnecessary. The <> operator in list context will read the entire file into the array for you, and you can use chomp on the array to remove newlines from every line. So a simple

    my @files = <INPUT>; chomp @files;

    is enough.

    - I bet your script spends most of its time in the Add_Delta_DHMS function. Because you keep your dates in their complicated, human readable formats, you have to do complicated date-processing arithmetic every time you want to add 5 seconds to them. (It was the right call to use a module like Date::Calc instead of rolling your own buggy time increment code, but you pay the price for it: it's slow.) It would be better to convert the date/time you read from the first line of the file to a Unix timestamp with Time::Local, so that incrementing the rolling timestamp simply become $t += 5;, then use strftime or localtime when you print the date.

    - Those chomps in the inner loop are unnecessary. If you really want to pare it down, and you are sure about your input format, you can even do away with the splits. If you have a string like "45.6, foo", and force it into scalar context (e.g. with an operator like +), Perl will take the 45.6 and ignore the rest. So you can even do something like

    my $avg = <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN>; $avg /= 12;
      Honestly - thank you so much. I wondered why it was taking so much longer and this answered my question. Thank you!