http://www.perlmonks.org?node_id=176832


in reply to Processing large files many times over

Edit: This was originally its own root level question. tye moved it here to merge the two related threads.

Okay, here's the story.
I wrote a perl script to analyze some data from a repository of large files (ranging from 600 to 2500KB, roughly, the majority around 1300-1550KB). The first script I'll show you handled the lower end of the spectrum (~600KB) in about 3 to 5 minutes (per file, that is), and the hihger end in an hour and more (MUCH more; try 12 hours). I could not explain this discrepency myself.

The REALLY interesting thing is that when I took the looping structure out (the stuff that gobbled up all the txt files in an array of directories) and just took one of the mega-files (1550 or so KB) into the 'chunk' of the code it took . . . about 25 seconds!!!!!!

Okay, so I've had that first script running about a week (with no hope of stopping!), this second one I'll show you (actually two scripts, one's a wrapper, repeatedly calling the other) handles a file in about 25 seconds! (now, I used some optimizations, but the original test I was talking about--taking out the for loops--was with the old, slow code).

Now, someone out there has to know why (and it sure isn't me). I'm intrigued; what could it be?

To get to the code, you'll have to

read more.
These were quick writeups. I know I should have used strict/warnings/etc. I'm going to imrove the coding style, ignore the discrepancies in comments. This is several versions compiled in a hurry trying to find one that will work.

First version

#!/usr/bin/perl ####################################### ## test.pl ## Program to parse a txt file given in ## tab-delimited format of physio data ## Parses all files in a directory and ## compiles the results into one file ## Looks through desired folders in dir ####################################### $dir_path = 'G:\\Test Data'; @dir_folds = ('Ts1', 'Ts10', 'Ts12', 'Ts13', 'Ts14', 'Ts15', 'Ts16', ' +Ts17', 'Ts18', 'Ts19', 'Ts2', 'Ts20', 'Ts21', 'Ts22', 'Ts23', 'Ts24', 'Ts25', 'Ts26', +'Ts27', 'Ts3', 'Ts4', 'Ts5', 'Ts6', 'Ts7', 'Ts8', 'Ts9'); ## holds the folders at $d +ir_path where desired files are located $full_path; ## Full directory path $avg_file = ">>averages.txt"; ## Name of file averages are written to + (oppened for appending) $full_name; ## Full file name (i.e., with directory path) $ln_num = 0; ## Line number, used in @zeroat array $i = 0; ## Reference counter $j = 0; ## Reference counter (specifically for @avg_frmt array; count +s through file loop) $sum1 = 0; ## HRT sum $sum2 = 0; ## SKT sum $sum3 = 0; ## EMG sum $avg11 = 0; ## HRT avg 1 $avg12 = 0; ## HRT avg 2 $avg21 = 0; ## SKT avg 1 $avg22 = 0; ## SKT avg 2 $avg31 = 0; ## EMG avg 1 $avg32 = 0; ## EMG avg 2 @files; ## Array to hold desired filenames @avg_frmt; ## Array to hold formatting for $avg_file document @lines; ## Array to hold lines of file @lines1; ## Holds first part to be aberaged @lines2; ## Holds second part to be averaged @zeroat; ## Tells where zeros are at in array @lines ## Do procedure for all desired folders at $dir_path for $dir_fold(@dir_folds) { ## Print Status: i.e., which folder the program is currently on print "$dir_fold\n"; ## Retrieve all .txt files at $dir_path\$dir_fold $full_path = "$dir_path\\$dir_fold"; opendir(DIR, $full_path) or die "$full_path failed to open: $!"; @files = grep { /\.txt$/ } readdir(DIR); closedir(DIR); ## Begin looping over files, compiling averages for each for $i_file(@files) { next if $i_file =~ /RAREEVENT/; $full_name = "$full_path\\$i_file"; open(IN,$full_name) or die "$i_file failed to open: $!"; @lines = <IN>; ## Print Status: i.e., which file the program is currently on print "\t$i_file\n"; ## Retrieve desired rows for $curline(@lines) { $curline =~ /.*?\t.*?\t.*?\t.*?\t([05])/; ## parse line $zeros[$ln_num] = $curline; $zeroat[$i] = $ln_num if $1 == 0; $ln_num++; $i++; } ## Take Average ## Get all points between the starting and ending points, and sepa +rate into different arrays for $i(@zeroat) { ## $i is an index in @lines to where a zero is +at $lines[$i] =~ /(.*?)\t.*?\t.*?\t.*?\t([05])/; ## parse line if ($1 > .5 && $2 == 0) { @lines1 = @lines[0..$i-1]; ## @lines1 equals the first $i-1 e +lements of @lines @lines2 = @lines[$i+1..$#lines]; ## @lines2 equals everything + past the $i+1 element of @lines @lines = (@lines1,@lines2); ## @lines equals @lines1 followed + by @lines2 ({$i}th element removed) } ## the zero is in the middle: split for averaging } ## Reset sums $sum1 = 0; $sum2 = 0; $sum3 = 0; for $i(@lines1) { ## go through first part and average $i =~ /.*?\t(.*?)\t(.*?)\t(.*?)\t5/; ## parse line $sum1 += $1; $sum2 += $2; $sum3 += $3; } ## Get first average $avg11 = $sum1/$#lines1; $avg21 = $sum2/$#lines1; $avg31 = $sum3/$#lines1; ## Reset sums $sum1 = 0; $sum2 = 0; $sum3 = 0; for $i(@lines2) { ## go through second part and average $i =~ /.*?\t(.*?)\t(.*?)\t(.*?)\t5/; ## parse line $sum1 += $1; $sum2 += $2; $sum3 += $3; } ## Get second average $avg12 = $sum1/$#lines2; $avg22 = $sum2/$#lines2; $avg32 = $sum3/$#lines2; ## Put averages into tab delimited columns with desired format: Fi +le name followed by tab followed ## by averages; first line is resting condition; second line is cl +oud condition. $avg_frmt[$j] = "$i_file\t$avg11\t$avg21\t$avg31\n"; ## HRT, SKT, + EMG is the $avg_frmt[$j+1] = "$i_file\t$avg12\t$avg22\t$avg32\n"; ## order f +or the averages $j += 2; } ## End looping over files in folder } ## End looping over folders in directory ## Open and print averages to $avg_file open(OUT,$avg_file) or die "$avg_file failed to be created: $!"; print OUT @avg_frmt;

Second version--file 'test.pl'

#!/usr/bin/perl ####################################### ## final_v02_TEST.pl ## test.pl (for short) ## Program to parse a txt file given in ## tab-delimited format of physio data ## Parses all files in a directory and ## compiles the results into one file ## Looks through desired folders in dir ## ## Possible Additions: ## ADDITIONS MADE ## ## final_v02_TEST.pl Version .02 (6/20) ## - ADDED 'close IN;' statement (6/21) ## - Code altered to only work on given ## text file (6/24) ## - NOTE USAGE: ## perl test.pl FILE_NAME DIR_PATH ####################################### $i_file; ## Name of input file $dir_path; ## Directory path for $i_file $full_name; ## '$dir_path\$i_file' $avg_file = ">>averages.txt"; ## Name of file where file averages are + written to (append mode) $ln_num = 0; ## Line number in current file, used in @zeroat array to + mark zeros $i = 0; ## Reference counter $j = 0; ## Reference counter (specifically for @avg_frmt array; count +s through file loop) @sums[3]; ## Holds HRT, SKT, and EMG sums, respectively @avg->[3][2]; ## Holds 1st HRT, 2nd HRT, 1st SKT, 2nd SKT, 1st EMG, a +nd 2nd EMG, averages, respectively @files; ## Array to hold desired filenames for current folder @avg_frmt; ## Array to hold formatting for $avg_file document (i.e., +the formatted output) @lines; ## Array to hold lines of current file @lines1; ## Holds first part to be averaged @lines2; ## Holds second part to be averaged @zeroat; ## Tells where zeros are at in array @lines (holds the line +number of the zeros; an index to @lines) $i_file = $ARGV[0]; ## Get file name from command line (first argumen +t) $dir_path = $ARGV[1]; ## Get directory path from command line (second + argument) $full_name = "$dir_path\\$i_file"; open(IN,$full_name) or die "$i_file failed to open: $!"; @lines = <IN>; ## Give file input to @lines close IN; ## Retrieve desired rows for $curline(@lines) { ## $curline contains the current line being wo +rked on (reverse $line) =~ /^\s*([05])/; ## Get 0 or 5 from end $zeros[$ln_num] = $curline; $zeroat[$i] = $ln_num if $1 == 0; $ln_num++; $i++; } ## Take Average ## Get all points between the starting and ending points, and separate + into different arrays for $i(@zeroat) { ## $i is an index in @lines to where a zero is at $lines[$i] =~ /^([0-9]+.?[0-9]*)\t.*([05])\s*$/; ## Get time (first + column) and 0 or 5 (last column) if ($1 > .5 && $2 == 0) { ## {If} time ($1) is more than .5 {AND} e +nd column ($2) is 0 ... splice @lines,$i,1; ## Remove $lines[$i] from @lines @lines1 = @lines; ## Copy neccessary for next statement @lines2 = splice @lines1,$i,$#lines1-$i+1; ## Splice removes desi +red elements from @lines1, ## which are given to +@lines2 (splice's return value) } ## the zero is in the middle: split for averaging } ## Reset sums @sums = map { $sums[$_] = 0 } (0..2); for $i(@lines1) { ## go through first part and average @vals = split /\t/, $lines[$i]; ## Each column in the line has its +own place in @vals map { $sums[$_] += $vals[$_] } (0..2); } ## Get first average map { $avg->[$_][1] = $sums[$_]/$#lines1 } (0..2); ## Reset sums @sums = map { $sums[$_] = 0 } (0..2); for $i(@lines2) { ## go through second part and average @vals = split /\t/, $lines[$i]; ## Each column in the line has its +own place in @vals map { $sums[$_] += $vals[$_] } (0..2); } ## Get second average map { $avg->[$_][2] = $sums[$_]/$#lines1 } (0..2); ## Put averages into tab delimited columns with desired format: File n +ame followed by tab followed ## by averages; first line is resting condition; second line is cloud +condition. $avg_frmt[$j] = "$i_file\t$avg->[1][1]\t$avg->[2][1]\t$avg->[3][1]\n"; + ## HRT, SKT, EMG is the $avg_frmt[$j+1] = "$i_file\t$avg->[1][2]\t$avg->[2][2]\t$avg->[3][2]\n +"; ## order for the averages $j += 2; ## Open and print averages to $avg_file open(OUT,$avg_file) or die "$avg_file failed to be created: $!"; print OUT @avg_frmt;

Second version--Wrapper

#!/usr/bin/perl ############################# # wrapper.pl # # Runs test.pl on designated # directory and files ############################# use strict; use warnings; my $base_dir = 'G:\Test Data'; my @included_dirs = ('Ts1', 'Ts10', 'Ts12', 'Ts13', 'Ts14', 'Ts15', 'T +s16', 'Ts17', 'Ts18', 'Ts19', 'Ts2', 'Ts20', 'Ts21', + 'Ts22', 'Ts23', 'Ts24', 'Ts25', 'Ts26', 'Ts27', 'Ts3', 'Ts4', +'Ts5', 'Ts6', 'Ts7', 'Ts8', 'Ts9'); my @files; for my $dir(@included_dirs) { ## STATUS CHECK print "$dir\n"; opendir(DIR, "$base_dir\\$dir") or die "$dir failed to open: $!"; @files = grep { /\.txt$/ } readdir(DIR); closedir(DIR); for my $file(@files) { next if $file =~ /RAREEVENT/; ## STATUS CHECK print "\t$file\n\t\tRunning test.pl"; my $arg1 = $file; my $arg2 = "$base_dir\\$dir"; system('E:\perl\bin\perl', 'C:\WINDOWS\Profiles\chemphysio\Desktop +\Test data\TEST\test.pl', $arg1, $arg2); ## STATUS CHECK print "\t\tControl returned\n"; } }

Edit by tye to clean up "read more" bit.

Replies are listed 'Best First'.
Re: A very odd happening (at least. . . to me)
by maverick (Curate) on Jun 24, 2002 at 16:09 UTC
    From a quick glance at the code, one of the first questions that comes to mind is "how many files are in these directories?" I suspect that part of the source of your slowness is that you read both the entire list of files, and the entire contents of each file into memory. If you alter your reading structure like so:
    open(DIR,"$base_dir\\$dir") or die "$dir failed to open: $!"; while (my $file = readdir(DIR)) { next unless $file =~ /\.txt$/; # etc, etc. open(IN,"$full_name") || die "can't open $!"; while (my $line = <IN>) { # processing } close(IN); } closedir(DIR);
    you won't have the overhead of all the memory allocation. In your second example there's a system call to a secondary perl script. That's going to be time consuming too. Consider making the second perl program a subroutine...that will avoid a fork, exec, and compile for every file you have.

    HTH

    /\/\averick
    OmG! They killed tilly! You *bleep*!!

Re: A very odd happening (at least. . . to me)
by educated_foo (Vicar) on Jun 24, 2002 at 16:11 UTC
    I (and probably others here) would be curious to see what is going on with your code, but when it's just thrown at the community in a big heap, it's not likely to be looked at. Reading raw code isn't much fun. I would suggest
    • Simplifying the code as much as possible. Debugging statements and obviously unimportant details can be safely removed.
    • Presenting pseudo-code along with the original code.
    • Giving a high-level English description of your algorithm.
    Doing these things will make the problem more accssible to the reader, and may even help you understand what's going on in your own code.

    You might also want to outline what you have figured out about the problem so far. This is partly to help the reader, and partly to prove that you have invested your own time and interest. Sure, it's both annoying and a waste of your time to have to prove to a bunch of strangers in writing that you have "R'ed the FM" etc., but CS/IT people often seem to demand that proof before deigning to answer questions.

    /s

      I would also offer the point that your commenting style's a little difficult to follow. Consider throwing a little more whitespace at this code to make it more readable. (For instance: When a variable has a sufficiently advanced use as to require a long comment, consider putting a blank line above and below it, to visually separate it from the rest of the variables.)

      I realize that's not quite what needs to be done with the code, but believe me -- the easier your code is to read, the more likely you're going to get helpful responses.

      In case you're curious, there is a largish list of recommended practices for whitespace, comments, etc. Nothing there is set in stone, but nobody will complain if you do the things it outlines. ;)

      -----------------------
      You are what you think.

Re: A very odd happening (at least. . . to me)
by kvale (Monsignor) on Jun 24, 2002 at 16:09 UTC
    Well, there is a lot happening in the two pieces of code, so it is hard to say for sure what is making the second version faster. But one thing you have improved is replacing all that array copying with splicing. Even better would be to manipulate just indices of the single @lines array for your computations.

    -Mark
Re: A very odd happening (at least. . . to me)
by dimmesdale (Friar) on Jun 24, 2002 at 16:54 UTC
    UPDATED First, let me apologize: my computer is running slowly because of the program, I've been at it forever, and I'm a little bit frustrated. I got happier this morning when I thought I'd made an improvement in the speed, but now I'm not so sure.

    Description of problem
    The resulting file 'averages.txt' is incorrect. The averages are wrong. (They are too low) I went back to the 'slow' code (that was tested, and did work), but its not working anymore it seems. I would be VERY grateful if anyone could help me with a solution that works and won't take five weeks. See below for a description of the code.

    The above was the problem. HOWEVER, now that I have that under control (the array @avg_frmt is correct under debugging tests) I have an odd problem. Nothing is printing to the averages.txt file. It is being created, but it is blank.

    Description of Code The directories have 12 files each (that are of interest to us, i.e., .txt). These files contain 30-60 thousand lines of data, in the following format:
    "(time)\t(hrt)\t(skt)\t\(emg)\t(0 or 5)"
    (They are measured values that I'm trying to analyze)

    Here's an example of a chunk:

    0 61.2245 83.129 0.000128174 0 0.000333333 61.2245 83.1305 0.000128174 0 0.000666667 61.2245 83.132 0.000109863 0 0.001 61.2245 83.129 0.000115967 5 0.00133333 61.2245 83.132 0.000115967 5 0.00166667 61.2245 83.1305 0.00012207 5 0.002 61.2245 83.132 0.000115967 5 0.00233333 61.2245 83.132 0.00012207 5 0.00266667 61.2245 83.132 0.000115967 5 0.003 61.2245 83.132 0.00012207 5 0.00333333 61.2245 83.132 0.00012207 5 0.00366667 61.2245 83.1335 0.000134277 5 0.004 61.2245 83.132 0.000140381 5 0.00433333 61.2245 83.1305 0.00012207 5 0.00466667 61.2245 83.132 0.000134277 5 0.005 61.2245 83.132 0.000115967 5 0.00533333 61.2245 83.1335 0.000128174 5 0.00566667 61.2245 83.1335 0.00012207 5 0.006 61.2245 83.132 0.000134277 5 0.00633333 61.2245 83.1351 0.000134277 5
    The 0 at the end represents the push of a button (a 5 for no push). It separates the data into two conditions (the first average we want and the second). The zeros at the beginning represent the start, so we treat those just like it were a 5. However, there is a group of zeros in the middle that we are interested in. Take the first line of data, to the first 0 (from the ones in the middle) and average the desired values. THEN, from the last zero (from the ones in the middle) we average until the end.

    Files with the name RAREEVENT in them we ignore.

    I'd be glad to clarify anything.

    #!/usr/bin/perl use strict; use warnings; my $base_dir = 'G:\Test Data'; my @included_dirs = ('Ts1', 'Ts10', 'Ts12', 'Ts13', 'Ts14', 'Ts15', 'T +s16', 'Ts17', 'Ts18', 'Ts19', 'Ts2', 'Ts20', 'Ts21', + 'Ts22', 'Ts23', 'Ts24', 'Ts25', 'Ts26', 'Ts27', 'Ts3', 'Ts4', +'Ts5', 'Ts6', 'Ts7', 'Ts8', 'Ts9'); my @files; for my $dir(@included_dirs) { opendir(DIR, "$base_dir\\$dir") or die "$dir failed to open: $!"; @files = grep { /\.txt$/ } readdir(DIR); closedir(DIR); print "$dir\n"; for my $file(@files) { next if $file =~ /RAREEVENT/; print "\t$file\n"; my $arg1 = $file; my $arg2 = "$base_dir\\$dir"; process_file($arg1,$arg2); } } sub process_file { my $i_file; ## Name of input file my $dir_path; ## Directory path for $i_file my $full_name; ## '$dir_path\$i_file' my $avg_file = ">>averages.txt"; ## Name of file where file average +s are written to (append mode) my $ln_num = 0; ## Line number in current file, used in @zeroat arr +ay to mark zeros my $i = 0; ## Reference counter my $sum1 = 0; ## HRT sum my $sum2 = 0; ## SKT sum my $sum3 = 0; ## EMG sum my $avg11 = 0; ## HRT avg 1 my $avg12 = 0; ## HRT avg 2 my $avg21 = 0; ## SKT avg 1 my $avg22 = 0; ## SKT avg 2 my $avg31 = 0; ## EMG avg 1 my $avg32 = 0; ## EMG avg 2 my @files; ## Array to hold desired filenames for current folder my @avg_frmt; ## Array to hold formatting for $avg_file document (i +.e., the formatted output) my @lines; ## Array to hold lines of current file my @lines1; ## Holds first part to be averaged my @lines2; ## Holds second part to be averaged my @zeroat; ## Tells where zeros are at in array @lines (holds the +line number of the zeros; an index to @lines) $i_file = shift; ## Get file name from @_ $dir_path = shift; ## Get directory path from @_ $full_name = "$dir_path\\$i_file"; open(IN,$full_name) or die "$i_file failed to open: $!"; @lines = <IN>; ## Give file input to @lines close IN; ## Retrieve desired rows for my $curline(@lines) { $curline =~ /.*?\t.*?\t.*?\t.*?\t([05])/; ## parse line $zeroat[$i++] = $ln_num if $1 == 0; $ln_num++; } ## Take Average ## Get all points between the starting and ending points, and separa +te into different arrays LOOP: for my $i(@zeroat) { ## $i is an index in @lines to where a z +ero is at $lines[$i] =~ /(.*?)\t.*?\t.*?\t.*?\t([05])/; ## parse line if ($1 > .5 && $2 == 0) { @lines1 = @lines[0..$i-1]; ## @lines1 equals the first $i-1 ele +ments of @lines @lines2 = @lines[$i+1..$#lines]; ## @lines2 equals everything p +ast the $i+1 element of @lines last LOOP; } ## the zero is in the middle: split for averaging } ## Reset sums $sum1 = 0; $sum2 = 0; $sum3 = 0; for my $i(@lines1) { ## go through first part and average $i =~ /.*?\t(.*?)\t(.*?)\t(.*?)\t[05]/; ## parse line $sum1 += $1; $sum2 += $2; $sum3 += $3; } ## Get first average $avg11 = $sum1/$#lines1; $avg21 = $sum2/$#lines1; $avg31 = $sum3/$#lines1; ## Reset sums $sum1 = 0; $sum2 = 0; $sum3 = 0; for my $i(@lines2) { ## go through second part and average $i =~ /.*?\t(.*?)\t(.*?)\t(.*?)\t[05]/; ## parse line $sum1 += $1; $sum2 += $2; $sum3 += $3; } ## Get second average $avg12 = $sum1/$#lines2; $avg22 = $sum2/$#lines2; $avg32 = $sum3/$#lines2; ## Put averages into tab delimited columns with desired format: File + name followed by tab followed ## by averages; first line is resting condition; second line is clou +d condition. $avg_frmt[0] = "$i_file\t$avg11\t$avg21\t$avg31\n"; ## HRT, SKT, EM +G is the $avg_frmt[1] = "$i_file\t$avg12\t$avg22\t$avg32\n"; ## order for th +e averages ## Open and print averages to $avg_file open(OUT,$avg_file) or die "$avg_file failed to be created: $!"; print OUT @avg_frmt; }

    Added closing code tag - dvergin 2002-06-24

      It won't make a miraculous difference, but you might try parsing each line only once at the top of the function, rather than re-parsing them each time, i.e.:
      my @lines = map { [ /.*?\t(.*?)\t(.*?)\t(.*?)\t(\d)/ ] } <IN>; # ... rest of function
      Regex matching can be expensive, so if you're doing the same match multiple times, it's usually better to do it once and save the results.

      I also noticed you're using $average = $sum / $#things to take an average. Despite appearances, $#things isn't the number of @things. Instead, you'll want to use $average = $sum / @things, since an array evaluates to its length in a scalar context.

      /s