Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Optimizing Iterating over a giant hash

by oron (Novice)
on Dec 25, 2009 at 22:24 UTC ( #814383=perlquestion: print w/ replies, xml ) Need Help??
oron has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a VERY big hash i've built with the folowing structure:
$my_hash{$entry}{$timeStamp}{$uid} = number;
Basically each (entry,timeStamp) pair should contain a list of users and a nubmer of each user.
I've built this to (naturally) gather information that was not grouped the way I wanted.
The problem is that I later need to print the entire hash so that for each pair (entry,timestamp) i print the list of user X number in the same row and this takes me a LONG time (hours - computer is still working).
I should mention that i can't use the "keys" keyword for iterating (keys %my_hash) because that leads to an "out of memory!" message (big hash...).
what i've done so far is this (works but take a long time)

my $temp;<br/> while (($entry,$temp) = each %my_hash) {<br/> foreach $timeStamp ( sort keys %$temp) {<br/> print $Hentry "$entry\t$timeStamp\t" , scalar(keys %{$my_hash{$ent +ry}{$timeStamp}}) ;<br/> foreach my $id (sort keys %{$temp->{$timeStamp}}) {<br/> print $Hentry "\t$id\t" , $my_hash{$entry}{$timeStamp}{$id} ;<br +/> }<br/> print $Hentry "\n";<br/> }<br/> }<br/>

That's the only thing that finally let me iterate without using too much memory.

So what im looking for is either a better way to store that data or (preferebly) a good way to iterate through the hash.
any suggestions?
Thanks :)

PS
I've tried using DBM:Deep to save on memory but it just takes to long to fill the hash this way (i have millions of entries...)

Comment on Optimizing Iterating over a giant hash
Download Code
Re: Optimizing Iterating over a giant hash
by jethro (Monsignor) on Dec 25, 2009 at 23:47 UTC

    You might use a divide and conquer method.

    Store the data sequentially into files, one file for each different $entry value. If that number is too big (bigger than the number of files you can have open at any one time), you might group the entry values with a suitable scheme.

    The difference to DMBM:Deep is that you write to these files sequentially, it should be much faster because of buffering.

    After that you can reread and work through the files one by one. Hopefully now a single file should fit into memory without having to resort to swapping (which probably makes your application slow at the moment)

Re: Optimizing Iterating over a giant hash
by biohisham (Priest) on Dec 26, 2009 at 01:45 UTC
    Also check Suffering from Buffering, you might want to set $| to a nonzero, this can alleviate the 'out of memory!', and just like jethro suggested, consider divide and conquer...



    Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
      you might want to set $| to a nonzero, this can alleviate the 'out of memory!'

      How does that work?

        When the program output is directed to the terminal the filehandle associated with the output will be in line-buffered mode but would be block-buffered otherwise (when the output is directed to a filehandle other than the terminal STDOUT)...Buffering can not be switched off but setting the output-autoflush variable to non-zero would make the print statement not wait for the buffer to flush rather than directly outputting to the file associated with the filehandle.

        I understand that 'out of memory!' would flash when the buffer is not autoflushed in such cases and that there's large amount of data, now, if I am mistaken or cloudy on this one (I believe I am somehow) you're welcome to go ahead and correct/clarify for I am basically from a non-IT backrgound...



        Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
Re: Optimizing Iterating over a giant hash
by kyle (Abbot) on Dec 26, 2009 at 05:27 UTC

    This might be a little better.

    my $temp; while (($entry,$temp) = each %my_hash) { foreach $timeStamp ( sort keys %$temp) { my @tskeys = sort keys %{$temp->{$timeStamp}}; print $Hentry "$entry\t$timeStamp\t", scalar @tskeys; print $Hentry "\t$_\t", $my_hash{$entry}{$timeStamp}{$_} for @tske +ys; print $Hentry "\n"; } }

    What I've done here is factor out the repeated call to keys on %{$temp->{$timeStamp}} and turned the foreach my $id loop into the statement modifier form. The lack of a block might be faster.

      Fully agree with eliminating the multiple call to keys.

      I'm also a big believer in reducing nested hash references. For the small price of assigning another variable, you can eliminate the cost of several steps of hash key lookups, and improve readability as a bonus.

      my $temp; while (($entry,$temp) = each %my_hash) { foreach $timeStamp ( sort keys %$temp) { my $hash_ref = $temp->{$timeStamp}; my @tskeys = sort keys %$hash_ref; print $Hentry "$entry\t$timeStamp\t", scalar @tskeys; print $Hentry "\t$_\t", $hash_ref->{$_} for @tskeys; print $Hentry "\n"; } }
Re: Optimizing Iterating over a giant hash
by GrandFather (Cardinal) on Dec 28, 2009 at 21:21 UTC
Re: Optimizing Iterating over a giant hash
by oron (Novice) on Dec 30, 2009 at 16:58 UTC
    OK, problem solved :)
    first of all thanks for all the answers.

    What I eventually did was some version of divide and conquer. When reading the inital data (filling the hash) I instead wrote it in a more suitable way for a new file that was then sorted (shamefully - with the linux sort util) and then i could process the lines that started out the same (same entry + timestamp) and out put. this also gave me a sorted result.

    I liked the idea of seperating the internal hash to a list - this actually might decrease lookups and not run out of memory for those lists which are relatively short.

    I did not use a database because i was under the impression that i need an sql server (for example) to be running and i don't have one. am i wrong? this could be usefull...
      There is, on the one hand, a Core-Module Called FileDB, wich enables you to ,simply saying, put a HASH on disk, but every access to the HASH then is a Filesystem-I/O operation (wich means it is slow). There is another Module at CPAN, called BerkeleyDB, but the documentation has many TODOs in it.

      I strongly recommend you have a play with SQLite (DBD::SQLite) to dip your toe into the database waters. It is completely stand alone, even to the extent that the DBD 'driver' includes the database engine. It is ideal for the sort of application this threat discusses (although I'm not recommending you re-engineer your current solution). Having database tools in your toolbox is pretty likely to be useful to you, to the extent that having a bit of a play now is likely to pay dividends in the longer term.

      True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://814383]
Approved by redgreen
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2014-10-21 02:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (95 votes), past polls