Dogg has asked for the wisdom of the Perl Monks concerning the following question:
I'm going to write a little script that will generate random sequences of letters and then work with an outside program to analyze these sequences. The entire sequence space is greater than 4x10^15 so I'll want to test more than 100,000 sequences.
But I only want to test each sequence once. My first thought was to put each analyzed sequence into a hash and then check the hash for a match before starting analysis of a new sequence. Is there a point, though, where a hash becomes problematic in terms of size? My other thought was to just check a file (all sequences and the analysis results will have to be written to a file) for the new sequence. This seems like it would take quite a bit longer, however.
Any thoughts?
Re (tilly) 1: Any limit to large hashes?
by tilly (Archbishop) on Dec 07, 2000 at 19:26 UTC
|
Perl's hashing algorithm computes a 32 bit number and then
does a shift trick that could result in the hashing
starting to collide more often after about 100,000,000
elements. Of course if you are hashing that much stuff
you are likely to run into issues with the fact that
Perl's process is (even on 64-bit systems) unable to handle
more than about 2 GB of memory (assuming you have that much
available between RAM and swap).
If this is an issue you
can tie to a tempfile using DB_File and that is fine for
data sets up to your available disk space, your filesize
(on many systems that is 2 GB) or IIRC about 100 terabytes
(depending on how Berkeley DB was compiled).
If you have only a 100,000 or so keys in your hash, I
wouldn't worry about it. :-) | [reply] |
Re: Any limit to large hashes?
by arturo (Vicar) on Dec 07, 2000 at 19:21 UTC
|
Your hashes can get as big as your system memory (RAM and swap) will allow (subject to OS limitations such as ulimit). Hashes are generally more memory intensive than arrays; but a hash w/ 100K keys doesn't, AFAIK, present Perl with any special problem (as long as the memory is there).
Philosophy can be made out of anything. Or less -- Jerry A. Fodor
| [reply] [d/l] |
Re: Any limit to large hashes?
by Dogg (Scribe) on Dec 08, 2000 at 01:03 UTC
|
Thanks for the advice. I went with a simple hash and was able to buzz through a test of 1,000 in 3 seconds. I then kicked it up a notch and tried to do 2.5 million. It looks like it ran linearly up until the memory was full (on a SGI Indigo2 with 192Mb RAM).
I put the time points I collected below if anyone's interested. It looks like I'll need something more advanced to get significantly above 2 million sequences.
Thanks for the help.
Minutes thousands of sequences
1.0000 50.000
2.0000 87.000
3.0000 131.00
4.0000 174.00
5.0000 218.00
7.5000 326.00
10.000 435.00
12.000 523.00
15.000 652.00
17.500 762.00
69.000 2097.0
77.000 2097.0
90.000 2097.0
| [reply] [d/l] |
|
If you have BerkeleyDB installed, slip this in at the
beginning (where %seq_hash is whatever your big hash is):
use DB_File;
tie(%seq_hash, "DB_File", "tmp.db", O_RDWR|O_CREAT);
This will be significantly slower, but should reduce the
memory requirements drastically. | [reply] [d/l] |
|