http://www.perlmonks.org?node_id=1023666

Laurent_R has asked for the wisdom of the Perl Monks concerning the following question:

Hello dear esteemed PerlMonks,

I am using a tied SDBM file to store a large volume of data and I get the following error after about 20 minutes of running time:

sdbm store returned -1, errno 22, key "0677202576" at -e line 11, <> line 15830957.

As it can be seen, the keys for my hash are just ten-digit phone numbers, so thie problem has nothing to do with the key being too long.

It seems more likely that I have hit some physical limit regarding the size of the DBM library. The size of the DBM file after the failure is:

-rwxr-xr-x 1 prod dqd 262144 2013-03-15 10:23 DBM_DOS.dir -rwxr-xr-x 1 prod dqd 2147429376 2013-03-15 10:23 DBM_DOS.pag

The file size 2147429376 is pretty close to 2^31, which may be a physical limit for the underlying C libraries.

The platform is running on AIX 6.1.6.15.

.

Please note that I have also tried dbmopen, NDBM and ODBM, with a similar problem, I am not able to load all my data (the input data has 30.4 million records) with any of these.

I have found another way of doing what I was needing but it would still be very nice if I could use tied hashes on such volumes of data. Does anyone of you know any recipe or workaround to make it possible with tied hashes?

As a side note, this problem had led me to develop a file comparison script in which I read in parallel two sorted files, A and B, and extract data into three output files: records that are both in A and B, records that are only in A and records that are only in B. I looked for modules doing this type of file comparison and there does not seem to be any (or they use hashes, which make them unusable for large data sets). Since this is something I am doing regularly, I have put this utility in a module that I can easily reuse. The question is: in your opinion, would it be useful to make this module available on the CPAN? I am asking the question because it would require quite a bit of additional work on my part to make more general-purpose than it currently is, and and would not want to do this additional work unless it can be really useful to other people (not to speak of other things that would have to be done, like providing a test suite and installation procedures and scripts, which I have absolutely no clue on how to do it).