Parallel::Forkmanager and large hash, running out of memory

mabossert has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parallel::Forkmanager and large hash, running out of memory by salva (Canon) on Apr 24, 2013 at 15:07 UTC
If you data does not fit in the available memory you will have to use an algorithm that does not require random access to the data. In practice that means you have to use one of the following alternatives: Compacting/compressing your data in some way so that it fits in the available RAM Sorting the data and then processing it sequentially. Using a multi-pass approach. The data is divided in ranges that fit on the available memory and then, you process all the data repeatly but only considering the data in one range every time.	[reply]
Re^2: Parallel::Forkmanager and large hash, running out of memory by Laurent_R (Canon) on Apr 24, 2013 at 21:22 UTC
Yeah, I had a similar problem recently: first remove duplicates from two files A and B, and then comparing the two files to output four files: data in A and not in B, data in B and not in A and two files with common data (well records having the same key from each file). Often, both operations can be done very efficiently using hashes (or using CPAN modules that use hashes). But my specific problem here is that the files were simply too big to fit in memory (about 15 gigabytes each), even only storing only the comparison keys did not work. I tried various solutions using the various tied hashes and DBM modules, but it turned out that loading the data into DBM files was awfully slow. I finally decided to sort the files (using the Unix utility) and process sequentially each file to remove the duplicates and then reading the two files in parallel to detect the "orphan" records and the common records. This ended up to be quite efficient (about 30 minutes for sorting the data and removing duplicates, and another 20 minutes to dispatch the records where they should go. I have of course no idea whether the same can apply to the OP's requirement. This is getting slightly off-topic, but I actually made (or started to make) a relatively generic module to do this, because this is something that we have to do quite frequently, but each time with different data format (although usually CSV type of format) and different comparison keys. This module is still under development, as I am adding new functionalities into it (like, once I have the "common" files, i.e. records with the same comparison key, comparing the data). The good thing with this approach is that the files need to be sorted only once; once they are sorted, all the various operations can be done one after the other, the records remain sorted. Once this module is finished, I am hoping to make is available on the CPAN, but I will first need to learn how to build a CPAN distribution, as I have never done that yet (even though I am using and testing this module at work, I am developing it entirely at home on my free time, so that I can be the only owner of the code and so that my company cannot object to open source distribution). Now, reading this thread, I realize that it could be that an SQLlite or MySQL solution might work as well or possibly even better. But then, I am not sure that it will be easy to obtain the installation of these products on our production servers (especially that I cannot prove that this would be useful without first trying).	[reply]
Re^2: Parallel::Forkmanager and large hash, running out of memory by mabossert (Scribe) on Apr 24, 2013 at 15:27 UTC
So, the keys and values I am storing are hashes similar to MD5...so, this approach would probably work...but not sure about how to go about it. Can you give me a pointer in the right direction? I don't need a step-by-step...but maybe an existing module for this?	[reply]
Re: Parallel::Forkmanager and large hash, running out of memory by Random_Walk (Prior) on Apr 24, 2013 at 15:16 UTC
If I understand the crux of the matter is: You need to look up data in a file that is too big to hold in RAM? Perhaps you can build an index for the file that is smaller and then use that to look into the file. If performance is an issue memoizing your lookup may help. But then again premature optimization and all that... You may also want to put it into a DB. SQLite is great way if you just want to create it then use it for lookups. Not so great if you need concurrent access, then you may need a real DB server Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply]
Re^2: Parallel::Forkmanager and large hash, running out of memory by mabossert (Scribe) on Apr 24, 2013 at 15:25 UTC
Gotcha...thanks for the quick response. Will SQLite handle large sizes? I thought it had a 2GB limit...or am I smoking proverbial crack?	[reply]
Re^3: Parallel::Forkmanager and large hash, running out of memory by Anonymous Monk on Apr 24, 2013 at 15:30 UTC
https://sqlite.org/limits.html	[reply]
Re^4: Parallel::Forkmanager and large hash, running out of memory by mabossert (Scribe) on Apr 24, 2013 at 16:42 UTC
Re^5: Parallel::Forkmanager and large hash, running out of memory by pokki (Monk) on Apr 24, 2013 at 18:37 UTC
Re: Parallel::Forkmanager and large hash, running out of memory by talexb (Chancellor) on Apr 24, 2013 at 15:04 UTC
It seems you must be leaving out some important ionformation. If you're converting a single TSV into an RDF, and just doing that is choking on a machine with half a terabyte of RAM, there's clearly something horribly wrong. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re^2: Parallel::Forkmanager and large hash, running out of memory by mabossert (Scribe) on Apr 24, 2013 at 15:22 UTC
Sorry if I was not clear enough. I am processing thousands of files. The files that contain the values that I am loading into a hash are at about 1800 files right now.	[reply]
Re^3: Parallel::Forkmanager and large hash, running out of memory by talexb (Chancellor) on Apr 24, 2013 at 15:30 UTC
Oh -- so you're not processing a single file at once -- that's what I took away from your OP. My mistake. If you're loading stuff into a hash, and that's overflowing memory, then it sounds like you'll need another approach, and the one that comes to mind right away is to use a database. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re: Parallel::Forkmanager and large hash, running out of memory by sundialsvc4 (Abbot) on Apr 24, 2013 at 19:59 UTC
Strange as it may initially seem to suggest it ... perhaps the very best approach to this problem would be to create a simple-minded program that “finds one TSV file and converts it to RDF,” then, if necessary, to (e.g. from the command-line ...) spawn as many concurrent copies of “that one simple-minded program” as you know that you have CPUs. Reduce the problem to a simple subset that “can be parallelized, if necessary,” among a collection of one-or-more processes that do not (have to) care if other instances of themselves exist. “One solitary instance” can solve the problem. “n instances” can merely do it faster. Q.E.D.	[reply]
Re^2: Parallel::Forkmanager and large hash, running out of memory by mabossert (Scribe) on Apr 24, 2013 at 21:57 UTC
I wish it were that simple...unfortunately, the needed lookup values are distributed across a couple thousand files and it is not possible to predict which file it will be found in... As-is, I am running the code in parallel using Parallel::Forkmanager, which seems to be working just fine...as long as I don't run out of memory ;-)	[reply]
Re^3: Parallel::Forkmanager and large hash, running out of memory by sundialsvc4 (Abbot) on Apr 25, 2013 at 13:05 UTC
A problem like that one could be handled by scanning all the files ahead of time and pushing the lookup values into a database table. This would avoid the need to “look for” the answers you want, which could largely defeat your efforts at parallelization. A pre-scanner could loop through the directory, query to see if it has seen this particular file before, and if not, grab the lookups and store them. Each time, it would only consider new files. (In the database table, you could also note whether a particular file had already been processed. Something like an SHA1 hash could be used to recognize changes.) This, once again, could be used to reduce the problem to a single-process handler that can be run in parallel with itself on the same and/or different systems. By all means, if you have now hit-upon a procedure that works, I am not suggesting that you rewrite it.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks