in reply to Processing large files many times over
Edit: This was originally its own root level question. tye moved it here to merge the two related threads.
Okay, here's the story.I wrote a perl script to analyze some data from a repository of large files (ranging from 600 to 2500KB, roughly, the majority around 1300-1550KB). The first script I'll show you handled the lower end of the spectrum (~600KB) in about 3 to 5 minutes (per file, that is), and the hihger end in an hour and more (MUCH more; try 12 hours). I could not explain this discrepency myself.
The REALLY interesting thing is that when I took the looping structure out (the stuff that gobbled up all the txt files in an array of directories) and just took one of the mega-files (1550 or so KB) into the 'chunk' of the code it took . . . about 25 seconds!!!!!!
Okay, so I've had that first script running about a week (with no hope of stopping!), this second one I'll show you (actually two scripts, one's a wrapper, repeatedly calling the other) handles a file in about 25 seconds! (now, I used some optimizations, but the original test I was talking about--taking out the for loops--was with the old, slow code).
Now, someone out there has to know why (and it sure isn't me). I'm intrigued; what could it be?
To get to the code, you'll have to
These were quick writeups. I know I should have used strict/warnings/etc. I'm going to imrove the coding style, ignore the discrepancies in comments. This is several versions compiled in a hurry trying to find one that will work.
First version
#!/usr/bin/perl ####################################### ## test.pl ## Program to parse a txt file given in ## tab-delimited format of physio data ## Parses all files in a directory and ## compiles the results into one file ## Looks through desired folders in dir ####################################### $dir_path = 'G:\\Test Data'; @dir_folds = ('Ts1', 'Ts10', 'Ts12', 'Ts13', 'Ts14', 'Ts15', 'Ts16', ' +Ts17', 'Ts18', 'Ts19', 'Ts2', 'Ts20', 'Ts21', 'Ts22', 'Ts23', 'Ts24', 'Ts25', 'Ts26', +'Ts27', 'Ts3', 'Ts4', 'Ts5', 'Ts6', 'Ts7', 'Ts8', 'Ts9'); ## holds the folders at $d +ir_path where desired files are located $full_path; ## Full directory path $avg_file = ">>averages.txt"; ## Name of file averages are written to + (oppened for appending) $full_name; ## Full file name (i.e., with directory path) $ln_num = 0; ## Line number, used in @zeroat array $i = 0; ## Reference counter $j = 0; ## Reference counter (specifically for @avg_frmt array; count +s through file loop) $sum1 = 0; ## HRT sum $sum2 = 0; ## SKT sum $sum3 = 0; ## EMG sum $avg11 = 0; ## HRT avg 1 $avg12 = 0; ## HRT avg 2 $avg21 = 0; ## SKT avg 1 $avg22 = 0; ## SKT avg 2 $avg31 = 0; ## EMG avg 1 $avg32 = 0; ## EMG avg 2 @files; ## Array to hold desired filenames @avg_frmt; ## Array to hold formatting for $avg_file document @lines; ## Array to hold lines of file @lines1; ## Holds first part to be aberaged @lines2; ## Holds second part to be averaged @zeroat; ## Tells where zeros are at in array @lines ## Do procedure for all desired folders at $dir_path for $dir_fold(@dir_folds) { ## Print Status: i.e., which folder the program is currently on print "$dir_fold\n"; ## Retrieve all .txt files at $dir_path\$dir_fold $full_path = "$dir_path\\$dir_fold"; opendir(DIR, $full_path) or die "$full_path failed to open: $!"; @files = grep { /\.txt$/ } readdir(DIR); closedir(DIR); ## Begin looping over files, compiling averages for each for $i_file(@files) { next if $i_file =~ /RAREEVENT/; $full_name = "$full_path\\$i_file"; open(IN,$full_name) or die "$i_file failed to open: $!"; @lines = <IN>; ## Print Status: i.e., which file the program is currently on print "\t$i_file\n"; ## Retrieve desired rows for $curline(@lines) { $curline =~ /.*?\t.*?\t.*?\t.*?\t([05])/; ## parse line $zeros[$ln_num] = $curline; $zeroat[$i] = $ln_num if $1 == 0; $ln_num++; $i++; } ## Take Average ## Get all points between the starting and ending points, and sepa +rate into different arrays for $i(@zeroat) { ## $i is an index in @lines to where a zero is +at $lines[$i] =~ /(.*?)\t.*?\t.*?\t.*?\t([05])/; ## parse line if ($1 > .5 && $2 == 0) { @lines1 = @lines[0..$i-1]; ## @lines1 equals the first $i-1 e +lements of @lines @lines2 = @lines[$i+1..$#lines]; ## @lines2 equals everything + past the $i+1 element of @lines @lines = (@lines1,@lines2); ## @lines equals @lines1 followed + by @lines2 ({$i}th element removed) } ## the zero is in the middle: split for averaging } ## Reset sums $sum1 = 0; $sum2 = 0; $sum3 = 0; for $i(@lines1) { ## go through first part and average $i =~ /.*?\t(.*?)\t(.*?)\t(.*?)\t5/; ## parse line $sum1 += $1; $sum2 += $2; $sum3 += $3; } ## Get first average $avg11 = $sum1/$#lines1; $avg21 = $sum2/$#lines1; $avg31 = $sum3/$#lines1; ## Reset sums $sum1 = 0; $sum2 = 0; $sum3 = 0; for $i(@lines2) { ## go through second part and average $i =~ /.*?\t(.*?)\t(.*?)\t(.*?)\t5/; ## parse line $sum1 += $1; $sum2 += $2; $sum3 += $3; } ## Get second average $avg12 = $sum1/$#lines2; $avg22 = $sum2/$#lines2; $avg32 = $sum3/$#lines2; ## Put averages into tab delimited columns with desired format: Fi +le name followed by tab followed ## by averages; first line is resting condition; second line is cl +oud condition. $avg_frmt[$j] = "$i_file\t$avg11\t$avg21\t$avg31\n"; ## HRT, SKT, + EMG is the $avg_frmt[$j+1] = "$i_file\t$avg12\t$avg22\t$avg32\n"; ## order f +or the averages $j += 2; } ## End looping over files in folder } ## End looping over folders in directory ## Open and print averages to $avg_file open(OUT,$avg_file) or die "$avg_file failed to be created: $!"; print OUT @avg_frmt;
Second version--file 'test.pl'
#!/usr/bin/perl ####################################### ## final_v02_TEST.pl ## test.pl (for short) ## Program to parse a txt file given in ## tab-delimited format of physio data ## Parses all files in a directory and ## compiles the results into one file ## Looks through desired folders in dir ## ## Possible Additions: ## ADDITIONS MADE ## ## final_v02_TEST.pl Version .02 (6/20) ## - ADDED 'close IN;' statement (6/21) ## - Code altered to only work on given ## text file (6/24) ## - NOTE USAGE: ## perl test.pl FILE_NAME DIR_PATH ####################################### $i_file; ## Name of input file $dir_path; ## Directory path for $i_file $full_name; ## '$dir_path\$i_file' $avg_file = ">>averages.txt"; ## Name of file where file averages are + written to (append mode) $ln_num = 0; ## Line number in current file, used in @zeroat array to + mark zeros $i = 0; ## Reference counter $j = 0; ## Reference counter (specifically for @avg_frmt array; count +s through file loop) @sums[3]; ## Holds HRT, SKT, and EMG sums, respectively @avg->[3][2]; ## Holds 1st HRT, 2nd HRT, 1st SKT, 2nd SKT, 1st EMG, a +nd 2nd EMG, averages, respectively @files; ## Array to hold desired filenames for current folder @avg_frmt; ## Array to hold formatting for $avg_file document (i.e., +the formatted output) @lines; ## Array to hold lines of current file @lines1; ## Holds first part to be averaged @lines2; ## Holds second part to be averaged @zeroat; ## Tells where zeros are at in array @lines (holds the line +number of the zeros; an index to @lines) $i_file = $ARGV[0]; ## Get file name from command line (first argumen +t) $dir_path = $ARGV[1]; ## Get directory path from command line (second + argument) $full_name = "$dir_path\\$i_file"; open(IN,$full_name) or die "$i_file failed to open: $!"; @lines = <IN>; ## Give file input to @lines close IN; ## Retrieve desired rows for $curline(@lines) { ## $curline contains the current line being wo +rked on (reverse $line) =~ /^\s*([05])/; ## Get 0 or 5 from end $zeros[$ln_num] = $curline; $zeroat[$i] = $ln_num if $1 == 0; $ln_num++; $i++; } ## Take Average ## Get all points between the starting and ending points, and separate + into different arrays for $i(@zeroat) { ## $i is an index in @lines to where a zero is at $lines[$i] =~ /^([0-9]+.?[0-9]*)\t.*([05])\s*$/; ## Get time (first + column) and 0 or 5 (last column) if ($1 > .5 && $2 == 0) { ## {If} time ($1) is more than .5 {AND} e +nd column ($2) is 0 ... splice @lines,$i,1; ## Remove $lines[$i] from @lines @lines1 = @lines; ## Copy neccessary for next statement @lines2 = splice @lines1,$i,$#lines1-$i+1; ## Splice removes desi +red elements from @lines1, ## which are given to +@lines2 (splice's return value) } ## the zero is in the middle: split for averaging } ## Reset sums @sums = map { $sums[$_] = 0 } (0..2); for $i(@lines1) { ## go through first part and average @vals = split /\t/, $lines[$i]; ## Each column in the line has its +own place in @vals map { $sums[$_] += $vals[$_] } (0..2); } ## Get first average map { $avg->[$_][1] = $sums[$_]/$#lines1 } (0..2); ## Reset sums @sums = map { $sums[$_] = 0 } (0..2); for $i(@lines2) { ## go through second part and average @vals = split /\t/, $lines[$i]; ## Each column in the line has its +own place in @vals map { $sums[$_] += $vals[$_] } (0..2); } ## Get second average map { $avg->[$_][2] = $sums[$_]/$#lines1 } (0..2); ## Put averages into tab delimited columns with desired format: File n +ame followed by tab followed ## by averages; first line is resting condition; second line is cloud +condition. $avg_frmt[$j] = "$i_file\t$avg->[1][1]\t$avg->[2][1]\t$avg->[3][1]\n"; + ## HRT, SKT, EMG is the $avg_frmt[$j+1] = "$i_file\t$avg->[1][2]\t$avg->[2][2]\t$avg->[3][2]\n +"; ## order for the averages $j += 2; ## Open and print averages to $avg_file open(OUT,$avg_file) or die "$avg_file failed to be created: $!"; print OUT @avg_frmt;
Second version--Wrapper
#!/usr/bin/perl ############################# # wrapper.pl # # Runs test.pl on designated # directory and files ############################# use strict; use warnings; my $base_dir = 'G:\Test Data'; my @included_dirs = ('Ts1', 'Ts10', 'Ts12', 'Ts13', 'Ts14', 'Ts15', 'T +s16', 'Ts17', 'Ts18', 'Ts19', 'Ts2', 'Ts20', 'Ts21', + 'Ts22', 'Ts23', 'Ts24', 'Ts25', 'Ts26', 'Ts27', 'Ts3', 'Ts4', +'Ts5', 'Ts6', 'Ts7', 'Ts8', 'Ts9'); my @files; for my $dir(@included_dirs) { ## STATUS CHECK print "$dir\n"; opendir(DIR, "$base_dir\\$dir") or die "$dir failed to open: $!"; @files = grep { /\.txt$/ } readdir(DIR); closedir(DIR); for my $file(@files) { next if $file =~ /RAREEVENT/; ## STATUS CHECK print "\t$file\n\t\tRunning test.pl"; my $arg1 = $file; my $arg2 = "$base_dir\\$dir"; system('E:\perl\bin\perl', 'C:\WINDOWS\Profiles\chemphysio\Desktop +\Test data\TEST\test.pl', $arg1, $arg2); ## STATUS CHECK print "\t\tControl returned\n"; } }
Edit by tye to clean up "read more" bit.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: A very odd happening (at least. . . to me)
by maverick (Curate) on Jun 24, 2002 at 16:09 UTC | |
Re: A very odd happening (at least. . . to me)
by educated_foo (Vicar) on Jun 24, 2002 at 16:11 UTC | |
by chaoticset (Chaplain) on Jun 24, 2002 at 17:01 UTC | |
Re: A very odd happening (at least. . . to me)
by kvale (Monsignor) on Jun 24, 2002 at 16:09 UTC | |
Re: A very odd happening (at least. . . to me)
by dimmesdale (Friar) on Jun 24, 2002 at 16:54 UTC | |
by educated_foo (Vicar) on Jun 24, 2002 at 17:17 UTC |