http://www.perlmonks.org?node_id=596081


in reply to How to deal with Huge data

First of all, what is the nature of the text file? Does it have repeated keys inside the same file?

Using hashes does consumes a lot of memory... but you can allways divide to conquer. ;-)

Suposing that you have two different files that you want to merge, you could try to reduce the file of each one looking for repeated keys and summing the numeric data.

Of course, this depends on the possibility to reduce the size of each file to an acceptable size. If this is not possible, you could consider working with slices of the file or using a database, since it will hold the data on the disc this should work. You can use any database, but DBD::DBM looks like ideal to your needs

What do you mean by saying "The cols are different in each file (but not always ...)"? This means the columns are in different places? Is easier to normalize that (put all columns and values in a previous defined position) and start working. For instance, once the files that the columns in the correct position, you can even forget about jumping over the first line: the program can easialy print the columns names in the output file later.

You're using hash references inside hashes... the more complicate structures you start using, the more memory the program will required.

Some other tips:

  1. If you're in a UNIX OS, then change the DOS new line characters (\r\n) to UNIX new lines (\n) before start processing the files. You can avoid using a regular expression and start using chomp to remove the new line
  2. Do not initialize an variable with my everytime the program enter in a loop. Initialize the variable before the loop and them, inside the loop, once the variable will not be used anymore, just clean up the value it holds. This is faster.
  3. Use Benchmark.
Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

Replies are listed 'Best First'.
Re^2: How to deal with Huge data
by chromatic (Archbishop) on Jan 23, 2007 at 23:06 UTC
    Do not initialize an variable with my everytime the program enter in a loop. Initialize the variable before the loop and them, inside the loop, once the variable will not be used anymore, just clean up the value it holds. This is faster.

    The OP is doing I/O. How could this possibly matter, if it's even true?

      I didn't understand very well what means "OP", but anyway... the tip is a bit off-topic since it's not related to the memory issue. But is a tip anyway.

      Doing things like the code below:

      my @t; T: while( my $line = <GSE> ) { $line =~ s/[\r\n]//g; @t = split(/\t/, $ligne); if( $. == 1 ) { shift(@t); @samples = @t; next T; } @t = ();

      Should avoid memory allocation everytime the variable is created/removed. If this really does not work like that, please let me know.

      Alceu Rodrigues de Freitas Junior
      ---------------------------------
      "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

        That's just noise. Even if Perl doesn't keep around an AV internally to avoid the cost of reallocating a variable (and I believe there's an optimization which does exactly that), look at all of the other, more expensive, work in that snippet:

        • Reading a line from a file. Here's the biggest time sink: doing system calls, seek times, transferring data across multiple busses, checking for cache hits and paying for cache misses, running through any IO layers....
        • Doing an unanchored regular expression with a character class; that means examining every character in the string and allocating and building an entirely new string--and just try to guess beforehand how long that new string needs to be.
        • Creating new SVs for every tab-separated element in the line.

        You have to do a tremendous amount of optimization before hoisting your variable declaration out of the loop makes any measure difference, and that's if Perl doesn't do that optimization already. Besides that, changing the memory layout of your program probably has a bigger effect on performance, if you take I/O out of the picture. What if you create an extra page fault per loop by needing an extra page? What if you fragment memory more this way? How do you even measure this in a meaningful way?

        Thus I say it's a silly pseudo-optimization.