Re: Parallel::Forkmanager and large hash, running out of memory

http://www.perlmonks.org?node_id=1030434

in reply to Parallel::Forkmanager and large hash, running out of memory

If you data does not fit in the available memory you will have to use an algorithm that does not require random access to the data.

In practice that means you have to use one of the following alternatives:

Compacting/compressing your data in some way so that it fits in the available RAM
Sorting the data and then processing it sequentially.
Using a multi-pass approach. The data is divided in ranges that fit on the available memory and then, you process *all* the data repeatly but only considering the data in one range every time.

Comment on Re: Parallel::Forkmanager and large hash, running out of memory

Replies are listed 'Best First'.
Re^2: Parallel::Forkmanager and large hash, running out of memory by Laurent_R (Canon) on Apr 24, 2013 at 21:22 UTC
Yeah, I had a similar problem recently: first remove duplicates from two files A and B, and then comparing the two files to output four files: data in A and not in B, data in B and not in A and two files with common data (well records having the same key from each file). Often, both operations can be done very efficiently using hashes (or using CPAN modules that use hashes). But my specific problem here is that the files were simply too big to fit in memory (about 15 gigabytes each), even only storing only the comparison keys did not work. I tried various solutions using the various tied hashes and DBM modules, but it turned out that loading the data into DBM files was awfully slow. I finally decided to sort the files (using the Unix utility) and process sequentially each file to remove the duplicates and then reading the two files in parallel to detect the "orphan" records and the common records. This ended up to be quite efficient (about 30 minutes for sorting the data and removing duplicates, and another 20 minutes to dispatch the records where they should go. I have of course no idea whether the same can apply to the OP's requirement. This is getting slightly off-topic, but I actually made (or started to make) a relatively generic module to do this, because this is something that we have to do quite frequently, but each time with different data format (although usually CSV type of format) and different comparison keys. This module is still under development, as I am adding new functionalities into it (like, once I have the "common" files, i.e. records with the same comparison key, comparing the data). The good thing with this approach is that the files need to be sorted only once; once they are sorted, all the various operations can be done one after the other, the records remain sorted. Once this module is finished, I am hoping to make is available on the CPAN, but I will first need to learn how to build a CPAN distribution, as I have never done that yet (even though I am using and testing this module at work, I am developing it entirely at home on my free time, so that I can be the only owner of the code and so that my company cannot object to open source distribution). Now, reading this thread, I realize that it could be that an SQLlite or MySQL solution might work as well or possibly even better. But then, I am not sure that it will be easy to obtain the installation of these products on our production servers (especially that I cannot prove that this would be useful without first trying).	[reply]
Re^2: Parallel::Forkmanager and large hash, running out of memory by mabossert (Scribe) on Apr 24, 2013 at 15:27 UTC
So, the keys and values I am storing are hashes similar to MD5...so, this approach would probably work...but not sure about how to go about it. Can you give me a pointer in the right direction? I don't need a step-by-step...but maybe an existing module for this?	[reply]

In Section Seekers of Perl Wisdom