Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Problems with SDBM

by Laurent_R (Canon)
on Mar 15, 2013 at 11:06 UTC ( [id://1023666]=perlquestion: print w/replies, xml ) Need Help??

Laurent_R has asked for the wisdom of the Perl Monks concerning the following question:

Hello dear esteemed PerlMonks,

I am using a tied SDBM file to store a large volume of data and I get the following error after about 20 minutes of running time:

sdbm store returned -1, errno 22, key "0677202576" at -e line 11, <> line 15830957.

As it can be seen, the keys for my hash are just ten-digit phone numbers, so thie problem has nothing to do with the key being too long.

It seems more likely that I have hit some physical limit regarding the size of the DBM library. The size of the DBM file after the failure is:

-rwxr-xr-x 1 prod dqd 262144 2013-03-15 10:23 DBM_DOS.dir -rwxr-xr-x 1 prod dqd 2147429376 2013-03-15 10:23 DBM_DOS.pag

The file size 2147429376 is pretty close to 2^31, which may be a physical limit for the underlying C libraries.

The platform is running on AIX 6.1.6.15.

.

Please note that I have also tried dbmopen, NDBM and ODBM, with a similar problem, I am not able to load all my data (the input data has 30.4 million records) with any of these.

I have found another way of doing what I was needing but it would still be very nice if I could use tied hashes on such volumes of data. Does anyone of you know any recipe or workaround to make it possible with tied hashes?

As a side note, this problem had led me to develop a file comparison script in which I read in parallel two sorted files, A and B, and extract data into three output files: records that are both in A and B, records that are only in A and records that are only in B. I looked for modules doing this type of file comparison and there does not seem to be any (or they use hashes, which make them unusable for large data sets). Since this is something I am doing regularly, I have put this utility in a module that I can easily reuse. The question is: in your opinion, would it be useful to make this module available on the CPAN? I am asking the question because it would require quite a bit of additional work on my part to make more general-purpose than it currently is, and and would not want to do this additional work unless it can be really useful to other people (not to speak of other things that would have to be done, like providing a test suite and installation procedures and scripts, which I have absolutely no clue on how to do it).

Replies are listed 'Best First'.
Re: Problems with SDBM
by BrowserUk (Patriarch) on Mar 15, 2013 at 11:14 UTC
    Does anyone of you know any recipe or workaround to make it possible with tied hashes?

    BerkeleyDB doesn't suffer the 2GB limitation, and its faster than the *dbm modules for lookups.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Problems with SDBM
by Tux (Canon) on Mar 15, 2013 at 12:48 UTC

    Every module that can tie a hash to some persistence mechanism has pros and cons. It not only depends on the size of your data or the number of elements in the hash, but also on the usage of the hash. How do reads compare to writes? Is it write once, read often? Is the reason to tie resource limits or is it persistence?

    YMMV across architectures and the type of data stored.

    See this table, this table, this graph, this graph, this graph, and this graph for speed comparisons. It compares DB_File with other serializer modules. I wanted to see the results after I wrote Tie::Hash::DBD that I created after I ran into serious trouble when DB_File hit resource limits and caused data corruption.


    Enjoy, Have FUN! H.Merijn

      Hi, thanks everyone for the answers already provided.

      The main reason to tie is resource limits: the data input has about 30 million records (and slightly less than 2 GB) and that is just too large for a hash (untied hash, that is). Having said that, persistence would also be a bonus because later processes would use the same data and would not have to load it again. But persistence is not the primary reason for using tied hashes.

      I am not too much concerned with speed performance at this point (although it might become important at some point, given the large data volume), my concern is that the process fails when I have loaded only about half of the data (15.8 million records), presumably because of the large volume of data. I could use several tied hashes to get around this volume limit, but that would be sort of awkward and unwieldy (and not very scalable).

      It seems that the Berkeley DB is not available on our system, so it seems that it will not be an option.

      Those are he most confusing graphs I've ever seen.

      For example: both this & this are labelled "Write records per second", carry the same numbers on the x=axis, and list the same (10) DBs in the legend; but the are totally different graphs. On one, only 7 lines; on the other 9 of 10.

      DB_File hit resource limits and caused data corruption.

      What resource limit?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        The graphs are part of a talk, and you miss the spoken explanation here :)

        The graphs come in pairs. The second is a zoom-in on the bottom part of the first. You might notice that the lines that are high on the first graph of each set do not appear on the second. The colors might have hinted you to this.

        The resource limits esp. in memory. At start, most memory was available. Halfway the long running process, the system also needed (lots of) memory for other processes and started swapping. They tied hashes where about 4 Gb each (4 of them).


        Enjoy, Have FUN! H.Merijn

      after I ran into serious trouble when DB_File hit resource limits and caused data corruption.

      Which backend, which db_version?

      Another benchmark at SQLite vs CDB_File vs BerkeleyDB

        I have to guess here, as it is too long ago to be sure, and trouble hit the fan at a customer site with less resources than where they had tested the script (which had to run a long analysis on two databases that took close to 30 hours, which makes it obvious why data corruption after 20+ hours is not an option.

        I started Tie::Hash::DBD in August 2010, which makes me assume we ran perl-5.10.1/64all on HP-UX 11.11 (at the customer site) with DB_File-1.020 targetting libdb-4.2.52 (after which I stopped upgrading, as Oracle made it close to impossible to port).

        Is that what you wanted to know?


        Enjoy, Have FUN! H.Merijn

        New actual numbers (higher is better), now include BerkeleyDB:

        updated with *DBM_File columns (compressed the output a bit to make it "fit")

        Linux 3.4.33-2.24-desktop [openSUSE 12.2 (Mantis)] i386 Core(TM) i7-2 +620M CPU @ 2.70GHz/800(4) i686 7969 Mb This is perl 5, version 16, subversion 3 (v5.16.3) built for i686-linu +x-64int Size op GDBM NDBM ODBM SDBM DB_File CDB_File BerkDB Re +dis Redis2 SQLite Pg mysql CSV ------ -- ------- ------- ------- ------- ------- -------- ------- --- +--- ------ ------- ------- ------- ------- 20 rd 32573 27972 27855 165289 24752 1111111 18587 4 +754 7186 30257 6197 3003 883 20 wr 19685 10678 9182 20855 6361 26917 5762 4 +848 6289 11961 2107 723 953 200 rd 142959 113636 116822 161550 62092 1333333 53404 5 +033 7507 37943 6312 1143 124 200 wr 65189 54555 64578 89007 58479 221483 37800 7 +700 8325 25687 4092 1417 230 600 rd 155925 114832 120992 183486 49285 1263157 43687 6 +366 7551 37657 11386 428 - 600 wr 101437 71633 83148 109950 44886 444115 41649 8 +717 6311 27700 5081 670 - 2000 rd 156311 97092 102202 138169 44295 1006036 39761 6 +209 8277 34599 10931 142 - 2000 wr 100376 76438 82474 107060 40711 577700 39096 8 +724 12205 27475 6241 260 - 20000 rd 141094 92384 94677 123507 49629 693096 43771 6 +098 8201 30522 9721 - - 20000 wr 94704 76299 80329 103297 30815 527676 29369 8 +284 8595 23866 5667 - - 200000 rd 134909 110688 99839 138195 45577 677541 40508 5 +385 7658 30463 8482 - - 200000 wr 51296 58657 59119 99944 28033 592327 26488 7 +949 9728 22878 5160 - -

        Below is the script I run


        Enjoy, Have FUN! H.Merijn
        help|?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1023666]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2024-04-23 06:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found