Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Disk based hash (as opposed to RAM based)

by techtruth (Novice)
on Oct 07, 2012 at 19:37 UTC ( #997703=perlquestion: print w/ replies, xml ) Need Help??
techtruth has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I would like to call upon your wisdom again today.

How can I create a hash that stores it's data to a physical disk? 

I have looked over some documentation and it looks like there are may ways to do this, but I can't seem to make sense of it. One method involved writing a special perl module and defining my own functions for "store", "delete", etc. I would like to avoid that of possible. I also saw the use of tie and tie::stdhash but found them to be confusing to me. Currently I am using tie with DB_File to tie a hash to a DB file, but am having trouble inserting new data.

I have a need to store roughly 5gb of data as a hash of arrays, thus I need to not use my systems RAM. My problem comes when I attempt to push a new value onto an array. 

Is there a simple way to do this that I am missing? My code functions without writing the hash to a file, but fails when I tie the hash to a disk. The speed of read/writes on the hash is still of importance to me, although I realize writing to disk is much slower than RAM.

Here is an example of my code:
use DB_File ; my  %hash;      unlink "tempfile"; #Remove previous file, if any tie %hash, "DB_File", "tempfile", O_RDWR|O_CREAT, 0666, $DB_HASH       or die "Cannot open file 'tempfile': $!\n"; while($sourceString =~ /example(key)regex(value)example\b/ig ) {      my $key = $1;      my $value = $2;      push( @{ $hash{ $key } },  $value );  #Push the value into the ha +sh }
I understand that if I write my own handlers for "store" "delete", etc I could make the values be appended to an array each time a new value was assigned, but would like to stay away from hairy situations...  Update:

I need to store around 1000 values in each array.

Solved:

I have returned from MySQL land with a solution. Since my input data is formatted as strings, "value.key" I wrote them in bulk to a temporary file. I then used MySQLs load_data_infile function to populate a temporary table. I then used insert with combinations of MySQLs string functions to make a table with two columns: key and value. The insert function took all data in the temporary table and inserted it into the new 2-column table. Now I can "select where key equals" to emulate perls amazing hashes. Not as fast as I would like it to be now, but I can process massive input files.

Thank you monks. I accept and appreciate your wisdom.

Comment on Disk based hash (as opposed to RAM based)
Download Code
Re: Disk based hash (as opposed to RAM based)
by Corion (Pope) on Oct 07, 2012 at 19:46 UTC

    How does your program fail?

    I remember that maybe DB_File (or BDB_File?) had a filesize limit of 4GB, or maybe a hash key count limit of some size.

      cant use string ("ARRAY(0x18fe8258)") as an ARRAY ref while "strict refs" in use at scrape.pl

      It seems to me that the hash is failing to read an array as an array ref. But push requires the first argument to be an array... This error only happens when I have the hash tied to the db file.

      The size of the file isn't a worry to me. I said 5gb a rough example.

Re: Disk based hash (as opposed to RAM based)
by kcott (Abbot) on Oct 07, 2012 at 20:02 UTC
Re: Disk based hash (as opposed to RAM based)
by BrowserUk (Pope) on Oct 07, 2012 at 20:09 UTC

    Unless there has been some recent changes, DB_File doesn't support storing nested structures as values; only scalar values.

    One way around the limitation would be to store your array values as joined strings:

    my $key = $1; my $value = $2; $hash{ $key } .= ' ', $value; #Add the value into the hash }

    And then split the string to recover the array when you need it.

    The alternative is to use something like MLDBM or DBM::Deep or possibly DBD:SQLite


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

    doesn

      One way around the limitation would be to store your array values as joined strings:
      That works for simple strings, but once you get to more diverse strings, you should probably consider a serializer. I find that JSON::XS works great for this purpose, as does Data::Dumper with an eval. The latter method may not always be the best choice, though. JSON is more portable and can be read by another non-Perl application more easily.
      For example:

      # Using JSON::XS use strict; use warnings; use DB_File; use JSON::XS qw( encode_json ); my %hash; unlink "tempfile"; #Remove previous file, if any tie %hash, "DB_File", "tempfile", O_RDWR | O_CREAT, 0666, $DB_HASH or die "Cannot open file 'tempfile': $!\n"; while ( $sourceString =~ /example(key)regex(value)example\b/ig ) { my $key = $1; my $value = $2; my $string_to_insert = encode_json( [ $key, $value ] ); push( @{ $hash{$key} }, $string_to_insert ); #Push the value in +to the hash } #And then use decode_json() from JSON::XS somewhere else to get an arr +ay of your values.

      HTH

      ~Thomas~
      confess( "I offer no guarantees on my code." );

        I realise that you are trying to be helpful; but I do not think you have thought this through.

        • Firstly, the OP clearly states Hash of Arrays.

          Hence, catering for anything more is overkill.

        • More importantly, the OPs code clearly shows that he needs to build up the arrays piecemeal -- ie. value by value.

          If he were to use a serialiser module for this, he would need to deserialise the current state of the appropriate array; add the latest new element; and then re-serialise; for each line in the file. Which would be horribly slow no matter which of the serialiser alternatives he used.

          The only other alternative would be to wait until each array was complete in memory before serialising and adding to DB_File, but that would mean waiting until the entire file had been read, and thus, the entire structure would be required to be hend in memory before serialisation could be performed. And if he had the memory to do that, he wouldn't be looking to use a disk-based hash.

        For a one-off process, he might consider pre-sorting the input file by the key field, so that the contents of each (sub) array could be built up in memory before being serialised once, but for that to be a viable option requires a whole set of circumstances that are not in evidence from the OP.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Re: Disk based hash (as opposed to RAM based)
by techtruth (Novice) on Oct 07, 2012 at 22:57 UTC

    I have looked around more and found http://search.cpan.org/~chorny/MLDBM-2.04/lib/MLDBM.pm

    This seemed to do the trick with a bit of modification to my code. However now an error 22 is output, which means that the values I am storing under keys is too large. I suppose it is worth mentioning that I my data is "wide" in the sense that each key will have 1000 or so values that I hoped to store on the array.

    use MLDBM; # this gets SDBM and Data::Dumper use Fcntl; # to get them constants my $dbm; my %hash; unlink ('testmldb.dir'); unlink ('testmldb.pag'); $dbm = tie %hash, 'MLDBM', 'testmldbm', O_CREAT|O_RDWR, 0640 or di +e $!; while($html =~ m/example(key)regex(value)example/ig ) { my $tempArray = $hash{$key}; push( @{ $tempArray } , $value); $hash{$domain} = $tempArray; undef $tempArray; #Just for fun-zies }

    MLDBM docs do say something about berkleyDB not having this limit, because of how it stores the values. (MLDBM apparently uses joined strings as my array instead of a real array) I will look into the berklyDB approach... I was hoping to avoid using a db and just a file, because I might as well just dump it into a temporary table in my database (which the program also uses) I would just like to take advantage of perls amazing hashes if I could.

      I think BrowserUk's mention of DBM::Deep is on track. It should be able to handle whatever level of depth you want to throw at it, and arbitrarily long key lengths. The fact that it's written in pure-Perl is a good feature, as its compatibility with Perl's flexible data types should be higher than is sometimes symptomatic of solutions built in the more rigid environment of lower-level statically typed languages.


      Dave

Re: Disk based hash (as opposed to RAM based)
by aufflick (Deacon) on Oct 08, 2012 at 04:43 UTC
    If it was me, for that sort of size of data I wouldn't be using any sort of tied hash. Do you need a hash to pass in to someone else's API? (In which case you couldn't assume they wouldn't force all data into ram at once anyway).

    What's the scenario that you need to use it in? What format does your data come in? If you need to do fairly sequential access then there are a lot of storage/access options. If you need a lot of random access I'd probably suggest importing the data into an sqlite database (a one-time process) and then refer to that data from your analysis script. Sqlite will automatically do memory based caching etc. for you so it should be surprisingly fast (and free you from worrying about memory utilisation).

Re: Disk based hash (as opposed to RAM based)
by techtruth (Novice) on Oct 09, 2012 at 03:18 UTC
    After asking myself a series of questions about how best to do this, I believe that the best way is to load everything into a temporary database table and use the database to process all this data. I would love to use perls hashes (which is initially why I chose perl) but this approach should be the best I can come up with. I will post the code once I finish it. I am rolling my own code on this one and am not using an API or other library of code other than perl and MySQL. My thought is to load the data, and use a combination of mysqls left, substr, and other string functions to populate data in a second column with the values I would use as a key. The MySQL select...where should let me then retrieve data in a way that mirrors perls way of getting values from a hash by key. Does that make sense? I will post code later.
      I have ventured a little farther into mysql land than I wanted to, Still working on an issue or two that isn't relevant to the monastery, I will return.
Re: Disk based hash (as opposed to RAM based)
by techtruth (Novice) on Nov 01, 2012 at 23:49 UTC
    solved. see above. :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://997703]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (8)
As of 2014-08-22 21:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (165 votes), past polls