Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Any limit to large hashes?

by Dogg (Scribe)
on Dec 07, 2000 at 19:05 UTC ( [id://45504]=perlquestion: print w/replies, xml ) Need Help??

Dogg has asked for the wisdom of the Perl Monks concerning the following question:

I'm going to write a little script that will generate random sequences of letters and then work with an outside program to analyze these sequences. The entire sequence space is greater than 4x10^15 so I'll want to test more than 100,000 sequences.

But I only want to test each sequence once. My first thought was to put each analyzed sequence into a hash and then check the hash for a match before starting analysis of a new sequence. Is there a point, though, where a hash becomes problematic in terms of size? My other thought was to just check a file (all sequences and the analysis results will have to be written to a file) for the new sequence. This seems like it would take quite a bit longer, however.

Any thoughts?

Replies are listed 'Best First'.
Re (tilly) 1: Any limit to large hashes?
by tilly (Archbishop) on Dec 07, 2000 at 19:26 UTC
    Perl's hashing algorithm computes a 32 bit number and then does a shift trick that could result in the hashing starting to collide more often after about 100,000,000 elements. Of course if you are hashing that much stuff you are likely to run into issues with the fact that Perl's process is (even on 64-bit systems) unable to handle more than about 2 GB of memory (assuming you have that much available between RAM and swap).

    If this is an issue you can tie to a tempfile using DB_File and that is fine for data sets up to your available disk space, your filesize (on many systems that is 2 GB) or IIRC about 100 terabytes (depending on how Berkeley DB was compiled).

    If you have only a 100,000 or so keys in your hash, I wouldn't worry about it. :-)

Re: Any limit to large hashes?
by arturo (Vicar) on Dec 07, 2000 at 19:21 UTC

    Your hashes can get as big as your system memory (RAM and swap) will allow (subject to OS limitations such as ulimit). Hashes are generally more memory intensive than arrays; but a hash w/ 100K keys doesn't, AFAIK, present Perl with any special problem (as long as the memory is there).

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Re: Any limit to large hashes?
by Dogg (Scribe) on Dec 08, 2000 at 01:03 UTC
    Thanks for the advice. I went with a simple hash and was able to buzz through a test of 1,000 in 3 seconds. I then kicked it up a notch and tried to do 2.5 million. It looks like it ran linearly up until the memory was full (on a SGI Indigo2 with 192Mb RAM).

    I put the time points I collected below if anyone's interested. It looks like I'll need something more advanced to get significantly above 2 million sequences.

    Thanks for the help.
    Minutes thousands of sequences 1.0000 50.000 2.0000 87.000 3.0000 131.00 4.0000 174.00 5.0000 218.00 7.5000 326.00 10.000 435.00 12.000 523.00 15.000 652.00 17.500 762.00 69.000 2097.0 77.000 2097.0 90.000 2097.0
      If you have BerkeleyDB installed, slip this in at the beginning (where %seq_hash is whatever your big hash is):
      use DB_File; tie(%seq_hash, "DB_File", "tmp.db", O_RDWR|O_CREAT);
      This will be significantly slower, but should reduce the memory requirements drastically.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://45504]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-04-25 13:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found