Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Parallel::Forkmanager and large hash, running out of memory

by mabossert (Sexton)
on Apr 24, 2013 at 14:35 UTC ( #1030423=perlquestion: print w/ replies, xml ) Need Help??
mabossert has asked for the wisdom of the Perl Monks concerning the following question:

I have searched and found several alternatives including DB_File and different flavors of Tie for the hash. Here is my problem:

I am reading in several thousand files and converting from a TSV format to RDF. The process is pretty simple and works quite well except for one part. I need to be able to look up a value in one of the files and find corresponding values in another (this is only true of one particular file type). When I first started, I was able to (just barely) store a hash containing the contents of the needed files for lookups (really, its a join, ;-)). The overall size of the files to be processed is over a terabyte and growing each time...I am working on a large server that has 24 CPU's and 512GB of memory...but the use of the forked processes has only exacerbated the problem....so, the more data I am getting, the fewer parallel processes I can run...and now have gotten to the point where even one process is running of memory.

I would appreciate any suggestions for how to handle the lookup efficiently...I did see a few posts related to Bloom::Filter...which looks promising, but frankly, I am not comfortable with too many false positives...and am not sure how to handle those.

Update: Thanks to all for the suggestions...I ended up retooling the code to dump needed values into a table in a MySql DB...I have my fingers crossed as it is crunching through the data now...Thanks again!

Comment on Parallel::Forkmanager and large hash, running out of memory
Re: Parallel::Forkmanager and large hash, running out of memory
by talexb (Canon) on Apr 24, 2013 at 15:04 UTC

    It seems you must be leaving out some important ionformation.

    If you're converting a single TSV into an RDF, and just doing that is choking on a machine with half a terabyte of RAM, there's clearly something horribly wrong.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      Sorry if I was not clear enough. I am processing thousands of files. The files that contain the values that I am loading into a hash are at about 1800 files right now.

        Oh -- so you're not processing a single file at once -- that's what I took away from your OP. My mistake.

        If you're loading stuff into a hash, and *that's* overflowing memory, then it sounds like you'll need another approach, and the one that comes to mind right away is to use a database.

        Alex / talexb / Toronto

        "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Parallel::Forkmanager and large hash, running out of memory
by salva (Monsignor) on Apr 24, 2013 at 15:07 UTC
    If you data does not fit in the available memory you will have to use an algorithm that does not require random access to the data.

    In practice that means you have to use one of the following alternatives:

    • Compacting/compressing your data in some way so that it fits in the available RAM
    • Sorting the data and then processing it sequentially.
    • Using a multi-pass approach. The data is divided in ranges that fit on the available memory and then, you process *all* the data repeatly but only considering the data in one range every time.

      So, the keys and values I am storing are hashes similar to MD5...so, this approach would probably work...but not sure about how to go about it. Can you give me a pointer in the right direction? I don't need a step-by-step...but maybe an existing module for this?

      Yeah, I had a similar problem recently: first remove duplicates from two files A and B, and then comparing the two files to output four files: data in A and not in B, data in B and not in A and two files with common data (well records having the same key from each file). Often, both operations can be done very efficiently using hashes (or using CPAN modules that use hashes). But my specific problem here is that the files were simply too big to fit in memory (about 15 gigabytes each), even only storing only the comparison keys did not work.

      I tried various solutions using the various tied hashes and DBM modules, but it turned out that loading the data into DBM files was awfully slow.

      I finally decided to sort the files (using the Unix utility) and process sequentially each file to remove the duplicates and then reading the two files in parallel to detect the "orphan" records and the common records. This ended up to be quite efficient (about 30 minutes for sorting the data and removing duplicates, and another 20 minutes to dispatch the records where they should go. I have of course no idea whether the same can apply to the OP's requirement.

      This is getting slightly off-topic, but I actually made (or started to make) a relatively generic module to do this, because this is something that we have to do quite frequently, but each time with different data format (although usually CSV type of format) and different comparison keys. This module is still under development, as I am adding new functionalities into it (like, once I have the "common" files, i.e. records with the same comparison key, comparing the data). The good thing with this approach is that the files need to be sorted only once; once they are sorted, all the various operations can be done one after the other, the records remain sorted. Once this module is finished, I am hoping to make is available on the CPAN, but I will first need to learn how to build a CPAN distribution, as I have never done that yet (even though I am using and testing this module at work, I am developing it entirely at home on my free time, so that I can be the only owner of the code and so that my company cannot object to open source distribution).

      Now, reading this thread, I realize that it could be that an SQLlite or MySQL solution might work as well or possibly even better. But then, I am not sure that it will be easy to obtain the installation of these products on our production servers (especially that I cannot prove that this would be useful without first trying).

Re: Parallel::Forkmanager and large hash, running out of memory
by Random_Walk (Parson) on Apr 24, 2013 at 15:16 UTC

    If I understand the crux of the matter is: You need to look up data in a file that is too big to hold in RAM?

    Perhaps you can build an index for the file that is smaller and then use that to look into the file. If performance is an issue memoizing your lookup may help. But then again premature optimization and all that...

    You may also want to put it into a DB. SQLite is great way if you just want to create it then use it for lookups. Not so great if you need concurrent access, then you may need a real DB server

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!

      Gotcha...thanks for the quick response. Will SQLite handle large sizes? I thought it had a 2GB limit...or am I smoking proverbial crack?

Re: Parallel::Forkmanager and large hash, running out of memory
by sundialsvc4 (Monsignor) on Apr 24, 2013 at 19:59 UTC

    Strange as it may initially seem to suggest it ... perhaps the very best approach to this problem would be to create a simple-minded program that “finds one TSV file and converts it to RDF,” then, if necessary, to (e.g. from the command-line ...) spawn as many concurrent copies of “that one simple-minded program” as you know that you have CPUs.   Reduce the problem to a simple subset that “can be parallelized, if necessary,” among a collection of one-or-more processes that do not (have to) care if other instances of themselves exist.   “One solitary instance” can solve the problem.   “n instances” can merely do it faster.   Q.E.D.

      I wish it were that simple...unfortunately, the needed lookup values are distributed across a couple thousand files and it is not possible to predict which file it will be found in...

      As-is, I am running the code in parallel using Parallel::Forkmanager, which seems to be working just fine...as long as I don't run out of memory ;-)

        A problem like that one could be handled by scanning all the files ahead of time and pushing the lookup values into a database table.   This would avoid the need to “look for” the answers you want, which could largely defeat your efforts at parallelization.   A pre-scanner could loop through the directory, query to see if it has seen this particular file before, and if not, grab the lookups and store them.   Each time, it would only consider new files.   (In the database table, you could also note whether a particular file had already been processed.   Something like an SHA1 hash could be used to recognize changes.)

        This, once again, could be used to reduce the problem to a single-process handler that can be run in parallel with itself on the same and/or different systems.

        By all means, if you have now hit-upon a procedure that works, I am not suggesting that you rewrite it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1030423]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2014-04-21 02:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (489 votes), past polls