http://www.perlmonks.org?node_id=1030423

mabossert has asked for the wisdom of the Perl Monks concerning the following question:

I have searched and found several alternatives including DB_File and different flavors of Tie for the hash. Here is my problem:

I am reading in several thousand files and converting from a TSV format to RDF. The process is pretty simple and works quite well except for one part. I need to be able to look up a value in one of the files and find corresponding values in another (this is only true of one particular file type). When I first started, I was able to (just barely) store a hash containing the contents of the needed files for lookups (really, its a join, ;-)). The overall size of the files to be processed is over a terabyte and growing each time...I am working on a large server that has 24 CPU's and 512GB of memory...but the use of the forked processes has only exacerbated the problem....so, the more data I am getting, the fewer parallel processes I can run...and now have gotten to the point where even one process is running of memory.

I would appreciate any suggestions for how to handle the lookup efficiently...I did see a few posts related to Bloom::Filter...which looks promising, but frankly, I am not comfortable with too many false positives...and am not sure how to handle those.

Update: Thanks to all for the suggestions...I ended up retooling the code to dump needed values into a table in a MySql DB...I have my fingers crossed as it is crunching through the data now...Thanks again!