What can I do to improve my code

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, I am working with some data in which I need to take an average every 12 lines (every 5 seconds) and print that average out along with the time that the data was recorded (so one reading every minute). I have done this in 2 ways but neither are particularly neat or speedy. I am relatively new to Perl so I apologise to anyone who thinks I have butchered the code. The data looks as follows:

acceleration (mg) - 2013-10-09 10:00:00 - 2013-10-16 09:59:55 - sample
+Rate = 5 seconds,    imputed
46.3,    0
17.1,    0
30.1,    0
38.4,    0
97.1,    0
87.3,    0
84,    0
78.5,    0
67.9,    0
83.5,    0
155,    0
103.5,    0
[download]

The following is my first method - which as you can see is very chunky and just a bit of a bodge job.

#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
use Date::Calc qw(:all);

my $input_file = undef;

GetOptions (
    "input=s" => \$input_file,
);
open INPUT, $input_file
    or die "Can't open $input_file for input: $!";

my @files = undef;   #list of files to run script on
while(my $line = <INPUT>){
    chomp($line);
    push @files, $line;
}


for(my $i=1; $i < scalar @files; $i++) {
    if(-e $files[$i]){
        open IN, "gunzip -c $files[$i]|"
            or die "Can't open $files[$i] for input: $!";

        open OUT, "> $files[$i]_out.txt"
            or die "Can't open $files[$i]_out.txt for output: $!";

        my $firstline = 1;
        my @dt = undef; # date and time
        my @date1 = undef;
        my @date2 = undef;

        if($firstline==1){        #extract header line and get start a
+nd end date from line
            chomp(my $line = <IN>);
            my @header = split(/,/, $line);
            @dt= $header[0] =~ /(\d+)/g;
            @date1 = ($dt[0], $dt[1], $dt[2], $dt[3], $dt[4], $dt[5]);
            @date2 = ($dt[6], $dt[7], $dt[8], $dt[9], $dt[10], $dt[11]
+);
            $firstline = 0;
        } else {
            <IN> for 1..1
        }
        print OUT "Date\tTime\tDay\tmg\n";    #print title

        my $count = 0;

        while(my $line1 = <IN>){                     #whilst reading f
+ile get 12 lines and print time and print average of 12 lines
            chomp($line1);
            my @mg1 = split(/,/, $line1);
            my $line2 = <IN>;
            chomp($line2);
            my @mg2 = split(/,/, $line2);
            my $line3 = <IN>;
            chomp($line3);
            my @mg3 = split(/,/, $line3);
            my $line4 = <IN>;
            chomp($line4);
            my @mg4 = split(/,/, $line4);
            my $line5 = <IN>;
            chomp($line5);
            my @mg5 = split(/,/, $line5);
            my $line6 = <IN>;
            chomp($line6);
            my @mg6 = split(/,/, $line6);
            my $line7 = <IN>;
            chomp($line7);
            my @mg7 = split(/,/, $line7);
            my $line8 = <IN>;
            chomp($line8);
            my @mg8 = split(/,/, $line8);
            my $line9 = <IN>;
            chomp($line9);
            my @mg9 = split(/,/, $line9);
            my $line10 = <IN>;
            chomp($line10);
            my @mg10 = split(/,/, $line10);
            my $line11 = <IN>;
            chomp($line11);
            my @mg11 = split(/,/, $line11);
            my $line12 = <IN>;
            chomp($line12);
            my @mg12 = split(/,/, $line12);
            my ($y, $mo, $d, $h, $m, $s) = Add_Delta_DHMS(@date1, 0, 0
+, $count, 0);
            printf OUT qq(%d-%02d-%02d %02d:%02d:%02d), $y, $mo, $d, $
+h, $m, $s;
            print OUT "\t" . Day_of_Week($y, $mo, $d);
            print OUT "\t" . (($mg1[0]+$mg2[0]+$mg3[0]+$mg4[0]+$mg5[0]
++$mg6[0]+$mg7[0]+$mg8[0]+$mg9[0]+$mg10[0]+$mg11[0]+$mg12[0])/12) . "\
+n";
            $count +=1
        }
        close IN;
        close OUT
    }
}
[download]

My second code is a bit more streamline but it takes longer to run.

for(my $i=1; $i < scalar @files; $i++) { #everything same as before un
+til we get to reading the files
    if(-e $files[$i]){
        open IN, "gunzip -c $files[$i]|"
            or die "Can't open $files[$i] for input: $!";

        open OUT, "> $files[$i]_out.txt"
            or die "Can't open $files[$i]_out.txt for output: $!";

        my $firstline = 1;
        my @dt = undef; # date and time
        my @date1 = undef;
        my @date2 = undef;

        if($firstline==1){
            chomp(my $line = <IN>);
            my @header = split(/,/, $line);
            @dt= $header[0] =~ /(\d+)/g;
            @date1 = ($dt[0], $dt[1], $dt[2], $dt[3], $dt[4], $dt[5]);
            @date2 = ($dt[6], $dt[7], $dt[8], $dt[9], $dt[10], $dt[11]
+);
            $firstline = 0;
        } else {
            <IN> for 1..1
        }
        print OUT "Date\tTime\tDay\tmg\n";

        my $count = 0;
        my $avg_count = 0;

        while(<IN>){                                        #this is w
+here it changes, instead of doing the same thing 12 times I say if th
+e line number (-1 due to header) modulos 12 is 0 then print the avera
+ge and set it back to 0
            my ($y, $mo, $d, $h, $m, $s) = Add_Delta_DHMS(@date1, 0, 0
+, $count, 0);
            chomp($_);
            my @line = split(/,/, $_);
            $avg_count += $line[0];
            if((($.)-1)%12 ==0){
                printf OUT qq(%d-%02d-%02d %02d:%02d:%02d), $y, $mo, $
+d, $h, $m, $s;
                print OUT "\t" . Day_of_Week($y, $mo, $d);
                print OUT "\t" . ($avg_count/12) . "\n";
                $count +=1;
                $avg_count =0;
            }
        }
        close IN;
        close OUT
    }
}
[download]

I know that is probably a lot to read and probably doesn't make sense, and I apologise. The scripts get the job done but I am looking for help in improving my skills and making everything clearer. The 1st code takes 1.8 seconds to run (file has 120960 lines) and second takes 9 seconds to run for the same file. Any help would be greatly appreciated, as this needs to be run for about 100,000 files.

Comment on What can I do to improve my code - I'm a beginner Select or Download Code

Replies are listed 'Best First'.
Re: What can I do to improve my code - I'm a beginner by AnomalousMonk (Archbishop) on Aug 10, 2017 at 14:16 UTC
`my $input_file = undef;` `...` `my @files = undef;` These are examples of types of statements that I see in several places in your code and that I consider programming tics that should be addressed with psychotherapy or powerful behavior modifying drugs. The first, `my $input_file = undef;` defines a lexical scalar variable that is default-initialized to undef — and then explicitly initializes the variable to undef. I see no point to the explicit initialization, but it can do no harm. The second type of statement, `my @files = undef;` can potentially do some damage because it doesn't do what I think you think it does. A lexical array is defined in an empty state by default, but this statement explicitly initializes the array with a single undef element. `c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @files = undef; ;; dd \@files; ;; print 'number of elements in array: ', scalar(@files); " [undef] number of elements in array: 1` [download] Is this what you expect and want? I thought not. In the example code you posted, this semantic error, by good fortune, does no harm (that I can see), but it is the kind of error that can bite you in the ass at any time given the opportunity. Update: If you want to go the explicit useless initialization route for list-type variables, the correct syntax is `my @array = ();` `my %hash = ();` but again, these ~~statements~~ \| initializations would IMHO just be more evidence of a need for medical intervention. The only circumstance I can see in which such ~~statements~~ \| initializations can barely be justified is when an initial empty state is vital to the correct operation of a succeeding algorithm and you don't want to bother emphasizing this fact by going to all the trouble of typing a comment. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: What can I do to improve my code - I'm a beginner by Anonymous Monk on Aug 11, 2017 at 08:43 UTC
To the psychiatrist, I go for my conversion therapy then! But seriously thank you for the constructive criticism. As I say I'm still very much a beginner and it is comments like these which are going to make me better. Could you please explain something for me though? Regarding setting the array to undef - I completely understand what you are saying. I was wondering why the array was 1 element longer than what it was supposed to be and now I know. I also have an understanding with what you are saying with the lexical scalar variable $input_file. But I am not sure what I need to do to get around these. As you can see from my code, I need them to be global variables, so I define them outside of the WHILE loops. If you say that you see them as redundant can you tell me how to get around doing this? Once again, thanks!	[reply]
Re^3: What can I do to improve my code - I'm a beginner by AnomalousMonk (Archbishop) on Aug 11, 2017 at 14:53 UTC
... I am not sure what I need to do to get around these. ... I need them to be global variables, so I define them outside of the WHILE loops. If you say that you see them as redundant can you tell me how to get around doing this? Please understand that what I see as redundant is useless explicit initialization of a variable. In the case of `my $input_file = undef;` the assignment does exactly what is done by default; it is purely redundant. Semantically erroneous statements like `my @files = undef;` are redundant in the present circumstances because the very next thing you do with these arrays (as far as I can see) is to assign them valid data, thus undoing the erroneous initialization. In other circumstances, the erroneous initialization may lead to a nasty bug. ... I need them to be global variables, so I define them outside of the WHILE loops. Of course, you need to define lexical variables where you need to use them. On this note, you only use the `@dt` lexical within the scope of the `if($firstline==1){ ... }` statement block, so that's exactly where I would define it: `if($firstline==1){ ... my @dt = $header[0] =~ /(\d+)/g; ... } else { <IN> for 1..1 }` [download] (Incidentally, the `<IN> for 1..1` statement is needlessly involved: if you just want to read and discard one line, the simple `<IN>;` statement does the trick.) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: What can I do to improve my code - I'm a beginner (Updated) by thanos1983 (Parson) on Aug 10, 2017 at 10:39 UTC
Hello Anonymous Monk, Some commends that could improve the speed of your script. Unfortunately I do not have the time to review it all and commend as many things I could suggest (sorry for that) but with a quick look: Regarding your while loop. Since you are using a while loop (meaning that automatically) will read one line at a time you are assigning next line and repeating the same process for the first 12 lines, why not use also an if condition based on line number. Sample bellow: Read more... (2 kB) The data that I used are coming from Re: Multiple values for a single key (Updated), but it should work out of the box for your case also. Read more... (308 Bytes) By creating a HASHES OF ARRAYS you have the ability to extract the keys and values easier. Update: Or if you prefer to reduce it by one line more and create HASHES OF ARRAYS and use as a key the line number (for easier data retrieval) you can do it like this. Sample bellow: Read more... (2 kB) Update2: You can reduce to minimum, just check if line contains comma (process) else skip. Read more... (471 Bytes) Update3: Even further: Read more... (461 Bytes) Update4: Line numbering reset, sorry just remembered you said you want to read every 12 lines a file with thousands of lines: Read more... (461 Bytes) Hope this helps, BR. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re^2: What can I do to improve my code - I'm a beginner (Updated) by Anonymous Monk on Aug 11, 2017 at 08:35 UTC
Thanks for your reply! As I say I'm a beginner so this took quite a while to digest but now I've re-read it I'm starting to understand what you are saying. Thank you for your help!	[reply]
Re: What can I do to improve my code - I'm a beginner by Marshall (Canon) on Aug 13, 2017 at 09:59 UTC
Hi Anon Monk! I encourage you to sign up for an account. An account is free and helps at least me understand who is posting what... I found your code hard to follow. I present an alternative below and will offer a few comments. I have no comments about the command line and Get::Opts. My code starts with a list of files to process. The use of array subscripts is rare in Perl. My code below does have one such use to index the translation of a number into a string representing the day of week. Your code has an inappropriate and I think an incorrect use. `for(my $i=1; $i < scalar @files; $i++)` I think $i=0 would be correct. But in any event, in Perl use a foreach iterator over an array. Instead of complicated flags like `if($firstline==1){}`, call a subroutine to deal with the first line. Then setup a loop to process the 12 line chunks. I show a very standard way to do this. `process_one_minute()` will be called repeatedly until there a no complete 12 line chunks left. I am sure that there are some formatting and spacing issues with the printout. But I hope that his helps.. Some Code: Read more... (3 kB) The Input: Read more... (495 Bytes) The Output: Read more... (450 Bytes)	[reply] [d/l] [select]
Re: What can I do to improve my code - I'm a beginner by Anonymous Monk on Aug 10, 2017 at 14:47 UTC
`@date1 = ($dt[0], $dt[1], $dt[2], $dt[3], $dt[4], $dt[5]);` [download] This kind of thing is redundant and error-prone. That's why we have array slices! `@date1 = @dt[0..5];` [download] Another redundancy: `chomp($line1); my @mg1 = split(/,/, $line1); my $line2 = <IN>; chomp($line2); my @mg2 = split(/,/, $line2); my $line3 = <IN>; chomp($line3); ...` [download] You've replaced it with a loop that calls `Add_Delta_DHMS` on every line, not just once per twelve lines, so it's not equivalent at all. I would tend to write something like this: `my ($sum) = split /,/, $line; for (2..12) { my ($val) = split /,/, <IN>; $sum += $val; }` [download]	[reply] [d/l] [select]
Re^2: What can I do to improve my code - I'm a beginner by Anonymous Monk on Aug 11, 2017 at 08:52 UTC
That array slice thing makes so much sense! Thank you! Also I now understand why my second code was running a lot slower - because I was calling the Add_Delta_DHMS every line. You've been a big help! Thank you!	[reply]
Re: What can I do to improve my code - I'm a beginner by Anonymous Monk on Aug 12, 2017 at 17:54 UTC
Some other comments, suggestions: - There is this section where you read in the list of input files: `my @files = undef; #list of files to run script on while(my $line = <INPUT>){ chomp($line); push @files, $line; }` [download] The `@files = undef` part was addressed already, but there is one more typical beginner pattern here. You read the file line by line then push every line to your array. This is inefficient and unnecessary. The <> operator in list context will read the entire file into the array for you, and you can use chomp on the array to remove newlines from every line. So a simple `my @files = <INPUT>; chomp @files;` [download] is enough. - I bet your script spends most of its time in the Add_Delta_DHMS function. Because you keep your dates in their complicated, human readable formats, you have to do complicated date-processing arithmetic every time you want to add 5 seconds to them. (It was the right call to use a module like Date::Calc instead of rolling your own buggy time increment code, but you pay the price for it: it's slow.) It would be better to convert the date/time you read from the first line of the file to a Unix timestamp with Time::Local, so that incrementing the rolling timestamp simply become `$t += 5;`, then use strftime or localtime when you print the date. - Those chomps in the inner loop are unnecessary. If you really want to pare it down, and you are sure about your input format, you can even do away with the splits. If you have a string like "45.6, foo", and force it into scalar context (e.g. with an operator like +), Perl will take the 45.6 and ignore the rest. So you can even do something like `my $avg = <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN> + <IN>; $avg /= 12;` [download]	[reply] [d/l] [select]
Re^2: What can I do to improve my code - I'm a beginner by Anonymous Monk on Aug 25, 2017 at 09:29 UTC
Honestly - thank you so much. I wondered why it was taking so much longer and this answered my question. Thank you!	[reply]

Back to Seekers of Perl Wisdom