Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Processing large files many times over

by dimmesdale (Friar)
on Jun 24, 2002 at 17:49 UTC ( #176869=perlquestion: print w/ replies, xml ) Need Help??
dimmesdale has asked for the wisdom of the Perl Monks concerning the following question:

I posted something akin to this at the node entitled A very odd happening (at least to me).

Here I want to clarify my question. CAN anyone help me to try to make the code work quickly?

I've tried a plethora of techniques, and it seems nothing works. I have 27 folders each with 12 files in them that I want to process (600-2500 KB, most ~1500 KB). Now, I've stripped the code down to just reading in the file and running a few lines of code (details at aforementioned node)
(I've read the file into @lines)

for my $curline(@lines) { next unless (reverse $curline) =~ /^\s*([05])/; $zeroat[$i++] = $ln_num if $1 == 0; $ln_num++; } for $i(@zeroat) { $lines[$i] =~ /^([0-9]+.?[0-9]*)\t.*([05])\s*$/; if ($1 > .5 && $2 == 0) { splice @lines,$i,1; @lines1 = @lines; @lines2 = splice @lines1,$i,$#lines1-$i+1; } }

The code aside, though, it seems that just reading in these files takes too long. What can I do (do I need a faster computer, read line by line, optimize code, etc.)?

Comment on Processing large files many times over
Download Code
A very odd happening (at least. . . to me)
by dimmesdale (Friar) on Jun 24, 2002 at 15:45 UTC

    Edit: This was originally its own root level question. tye moved it here to merge the two related threads.

    Okay, here's the story.
    I wrote a perl script to analyze some data from a repository of large files (ranging from 600 to 2500KB, roughly, the majority around 1300-1550KB). The first script I'll show you handled the lower end of the spectrum (~600KB) in about 3 to 5 minutes (per file, that is), and the hihger end in an hour and more (MUCH more; try 12 hours). I could not explain this discrepency myself.

    The REALLY interesting thing is that when I took the looping structure out (the stuff that gobbled up all the txt files in an array of directories) and just took one of the mega-files (1550 or so KB) into the 'chunk' of the code it took . . . about 25 seconds!!!!!!

    Okay, so I've had that first script running about a week (with no hope of stopping!), this second one I'll show you (actually two scripts, one's a wrapper, repeatedly calling the other) handles a file in about 25 seconds! (now, I used some optimizations, but the original test I was talking about--taking out the for loops--was with the old, slow code).

    Now, someone out there has to know why (and it sure isn't me). I'm intrigued; what could it be?

    To get to the code, you'll have to

      From a quick glance at the code, one of the first questions that comes to mind is "how many files are in these directories?" I suspect that part of the source of your slowness is that you read both the entire list of files, and the entire contents of each file into memory. If you alter your reading structure like so:
      open(DIR,"$base_dir\\$dir") or die "$dir failed to open: $!"; while (my $file = readdir(DIR)) { next unless $file =~ /\.txt$/; # etc, etc. open(IN,"$full_name") || die "can't open $!"; while (my $line = <IN>) { # processing } close(IN); } closedir(DIR);
      you won't have the overhead of all the memory allocation. In your second example there's a system call to a secondary perl script. That's going to be time consuming too. Consider making the second perl program a subroutine...that will avoid a fork, exec, and compile for every file you have.

      HTH

      /\/\averick
      OmG! They killed tilly! You *bleep*!!

      Well, there is a lot happening in the two pieces of code, so it is hard to say for sure what is making the second version faster. But one thing you have improved is replacing all that array copying with splicing. Even better would be to manipulate just indices of the single @lines array for your computations.

      -Mark
      I (and probably others here) would be curious to see what is going on with your code, but when it's just thrown at the community in a big heap, it's not likely to be looked at. Reading raw code isn't much fun. I would suggest
      • Simplifying the code as much as possible. Debugging statements and obviously unimportant details can be safely removed.
      • Presenting pseudo-code along with the original code.
      • Giving a high-level English description of your algorithm.
      Doing these things will make the problem more accssible to the reader, and may even help you understand what's going on in your own code.

      You might also want to outline what you have figured out about the problem so far. This is partly to help the reader, and partly to prove that you have invested your own time and interest. Sure, it's both annoying and a waste of your time to have to prove to a bunch of strangers in writing that you have "R'ed the FM" etc., but CS/IT people often seem to demand that proof before deigning to answer questions.

      /s

        I would also offer the point that your commenting style's a little difficult to follow. Consider throwing a little more whitespace at this code to make it more readable. (For instance: When a variable has a sufficiently advanced use as to require a long comment, consider putting a blank line above and below it, to visually separate it from the rest of the variables.)

        I realize that's not quite what needs to be done with the code, but believe me -- the easier your code is to read, the more likely you're going to get helpful responses.

        In case you're curious, there is a largish list of recommended practices for whitespace, comments, etc. Nothing there is set in stone, but nobody will complain if you do the things it outlines. ;)

        -----------------------
        You are what you think.

      UPDATED First, let me apologize: my computer is running slowly because of the program, I've been at it forever, and I'm a little bit frustrated. I got happier this morning when I thought I'd made an improvement in the speed, but now I'm not so sure.

      Description of problem
      The resulting file 'averages.txt' is incorrect. The averages are wrong. (They are too low) I went back to the 'slow' code (that was tested, and did work), but its not working anymore it seems. I would be VERY grateful if anyone could help me with a solution that works and won't take five weeks. See below for a description of the code.

      The above was the problem. HOWEVER, now that I have that under control (the array @avg_frmt is correct under debugging tests) I have an odd problem. Nothing is printing to the averages.txt file. It is being created, but it is blank.

      Description of Code The directories have 12 files each (that are of interest to us, i.e., .txt). These files contain 30-60 thousand lines of data, in the following format:
      "(time)\t(hrt)\t(skt)\t\(emg)\t(0 or 5)"
      (They are measured values that I'm trying to analyze)

      Here's an example of a chunk:

      0 61.2245 83.129 0.000128174 0 0.000333333 61.2245 83.1305 0.000128174 0 0.000666667 61.2245 83.132 0.000109863 0 0.001 61.2245 83.129 0.000115967 5 0.00133333 61.2245 83.132 0.000115967 5 0.00166667 61.2245 83.1305 0.00012207 5 0.002 61.2245 83.132 0.000115967 5 0.00233333 61.2245 83.132 0.00012207 5 0.00266667 61.2245 83.132 0.000115967 5 0.003 61.2245 83.132 0.00012207 5 0.00333333 61.2245 83.132 0.00012207 5 0.00366667 61.2245 83.1335 0.000134277 5 0.004 61.2245 83.132 0.000140381 5 0.00433333 61.2245 83.1305 0.00012207 5 0.00466667 61.2245 83.132 0.000134277 5 0.005 61.2245 83.132 0.000115967 5 0.00533333 61.2245 83.1335 0.000128174 5 0.00566667 61.2245 83.1335 0.00012207 5 0.006 61.2245 83.132 0.000134277 5 0.00633333 61.2245 83.1351 0.000134277 5
      The 0 at the end represents the push of a button (a 5 for no push). It separates the data into two conditions (the first average we want and the second). The zeros at the beginning represent the start, so we treat those just like it were a 5. However, there is a group of zeros in the middle that we are interested in. Take the first line of data, to the first 0 (from the ones in the middle) and average the desired values. THEN, from the last zero (from the ones in the middle) we average until the end.

      Files with the name RAREEVENT in them we ignore.

      I'd be glad to clarify anything.

      #!/usr/bin/perl use strict; use warnings; my $base_dir = 'G:\Test Data'; my @included_dirs = ('Ts1', 'Ts10', 'Ts12', 'Ts13', 'Ts14', 'Ts15', 'T +s16', 'Ts17', 'Ts18', 'Ts19', 'Ts2', 'Ts20', 'Ts21', + 'Ts22', 'Ts23', 'Ts24', 'Ts25', 'Ts26', 'Ts27', 'Ts3', 'Ts4', +'Ts5', 'Ts6', 'Ts7', 'Ts8', 'Ts9'); my @files; for my $dir(@included_dirs) { opendir(DIR, "$base_dir\\$dir") or die "$dir failed to open: $!"; @files = grep { /\.txt$/ } readdir(DIR); closedir(DIR); print "$dir\n"; for my $file(@files) { next if $file =~ /RAREEVENT/; print "\t$file\n"; my $arg1 = $file; my $arg2 = "$base_dir\\$dir"; process_file($arg1,$arg2); } } sub process_file { my $i_file; ## Name of input file my $dir_path; ## Directory path for $i_file my $full_name; ## '$dir_path\$i_file' my $avg_file = ">>averages.txt"; ## Name of file where file average +s are written to (append mode) my $ln_num = 0; ## Line number in current file, used in @zeroat arr +ay to mark zeros my $i = 0; ## Reference counter my $sum1 = 0; ## HRT sum my $sum2 = 0; ## SKT sum my $sum3 = 0; ## EMG sum my $avg11 = 0; ## HRT avg 1 my $avg12 = 0; ## HRT avg 2 my $avg21 = 0; ## SKT avg 1 my $avg22 = 0; ## SKT avg 2 my $avg31 = 0; ## EMG avg 1 my $avg32 = 0; ## EMG avg 2 my @files; ## Array to hold desired filenames for current folder my @avg_frmt; ## Array to hold formatting for $avg_file document (i +.e., the formatted output) my @lines; ## Array to hold lines of current file my @lines1; ## Holds first part to be averaged my @lines2; ## Holds second part to be averaged my @zeroat; ## Tells where zeros are at in array @lines (holds the +line number of the zeros; an index to @lines) $i_file = shift; ## Get file name from @_ $dir_path = shift; ## Get directory path from @_ $full_name = "$dir_path\\$i_file"; open(IN,$full_name) or die "$i_file failed to open: $!"; @lines = <IN>; ## Give file input to @lines close IN; ## Retrieve desired rows for my $curline(@lines) { $curline =~ /.*?\t.*?\t.*?\t.*?\t([05])/; ## parse line $zeroat[$i++] = $ln_num if $1 == 0; $ln_num++; } ## Take Average ## Get all points between the starting and ending points, and separa +te into different arrays LOOP: for my $i(@zeroat) { ## $i is an index in @lines to where a z +ero is at $lines[$i] =~ /(.*?)\t.*?\t.*?\t.*?\t([05])/; ## parse line if ($1 > .5 && $2 == 0) { @lines1 = @lines[0..$i-1]; ## @lines1 equals the first $i-1 ele +ments of @lines @lines2 = @lines[$i+1..$#lines]; ## @lines2 equals everything p +ast the $i+1 element of @lines last LOOP; } ## the zero is in the middle: split for averaging } ## Reset sums $sum1 = 0; $sum2 = 0; $sum3 = 0; for my $i(@lines1) { ## go through first part and average $i =~ /.*?\t(.*?)\t(.*?)\t(.*?)\t[05]/; ## parse line $sum1 += $1; $sum2 += $2; $sum3 += $3; } ## Get first average $avg11 = $sum1/$#lines1; $avg21 = $sum2/$#lines1; $avg31 = $sum3/$#lines1; ## Reset sums $sum1 = 0; $sum2 = 0; $sum3 = 0; for my $i(@lines2) { ## go through second part and average $i =~ /.*?\t(.*?)\t(.*?)\t(.*?)\t[05]/; ## parse line $sum1 += $1; $sum2 += $2; $sum3 += $3; } ## Get second average $avg12 = $sum1/$#lines2; $avg22 = $sum2/$#lines2; $avg32 = $sum3/$#lines2; ## Put averages into tab delimited columns with desired format: File + name followed by tab followed ## by averages; first line is resting condition; second line is clou +d condition. $avg_frmt[0] = "$i_file\t$avg11\t$avg21\t$avg31\n"; ## HRT, SKT, EM +G is the $avg_frmt[1] = "$i_file\t$avg12\t$avg22\t$avg32\n"; ## order for th +e averages ## Open and print averages to $avg_file open(OUT,$avg_file) or die "$avg_file failed to be created: $!"; print OUT @avg_frmt; }

      Added closing code tag - dvergin 2002-06-24

        It won't make a miraculous difference, but you might try parsing each line only once at the top of the function, rather than re-parsing them each time, i.e.:
        my @lines = map { [ /.*?\t(.*?)\t(.*?)\t(.*?)\t(\d)/ ] } <IN>; # ... rest of function
        Regex matching can be expensive, so if you're doing the same match multiple times, it's usually better to do it once and save the results.

        I also noticed you're using $average = $sum / $#things to take an average. Despite appearances, $#things isn't the number of @things. Instead, you'll want to use $average = $sum / @things, since an array evaluates to its length in a scalar context.

        /s

Re: Processing large files many times over
by kvale (Monsignor) on Jun 24, 2002 at 18:38 UTC
    A couple of possibilities occur to me. First, depending on you anout of ram you have and the other processes running, reading whole files in may consume enough RAM to start swapping, which will really slow things down. So it may be best to process one file at a time and read line by line, constructing your array as you go along, rather than splicing the whole array representing your file.

    Second, the first regexp could be written as
    next unless $curline =~ /([05])\s*$/;
    eliminating a reverse operation on each line.

    -Mark
      I know that 'thank you' replies aren't much welcomed (I seem to remember a few nasty comments a while back), but I have to THANK YOU tremendously. It's the computer's RAM! I can't believe I didn't think of it (well, actually, it's not that hard to think I could have missed it). The files are taking only 5 seconds each (thus far). You saved me from (more) countless headaches! It seemed the slower my programs were going the faster I was getting depressed.
Re: Processing large files many times over
by maverick (Curate) on Jun 24, 2002 at 18:43 UTC
    Given this little snippet, it appears that the goal of the first loop is to remove all the lines that don't end with 0 and put those into an array. The next loop then reads that array. So, you're iterating over your data twice, but you don't need to. You've got 1 completely copy of the file in memory, and another mostly completely copy. It takes time to create and destroy those copies. You don't need to do that either.
    open(IN,$my_file) || die "Can't open: $!"; while (my $line = <IN>) { next unless $line =~ /0\s*$/; $line =~ /^(\d+.?\d*)//; if ($1 > .5) { # etc, etc } }
    here's the break down line by line
    • Open the file
    • Process each line in turn. So, we're not storing the whole thing in memory (takes time to allocate said memory)
    • The if $1 == 0 part of your first loop only kept lines that ended in 0 for @zeroat. Just test directly for the line ending in 0 and throw away the rest.
    • no need for the regexp to catch the last character, we already know it's a 0
    • Test only the > .5 part, again no need for the $2 == 0 since we already threw away every line that didn't end in 0.
    • Rest of processing
    HTH

    /\/\averick
    OmG! They killed tilly! You *bleep*!!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://176869]
Approved by Snuggle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (10)
As of 2014-07-23 06:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (133 votes), past polls