Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Re: Slurping BIG files into Hashes

by waswas-fng (Curate)
on Jun 18, 2003 at 18:56 UTC ( #266952=note: print w/ replies, xml ) Need Help??


in reply to Re: Slurping BIG files into Hashes
in thread Slurping BIG files into Hashes

Looks like you have something goofy going on there. look at the time report at the bottom of this post for my runtime on a 2 proc sun box.

open (CONFIG, "<iaout.txt") || die "Coulnd't open config file!"; my %lookup; while (<CONFIG>) { $lookup{substr($_, 0, 13)} = substr ($_, 13); } #script used to generate data in the form of: # 21 random alpha chars per line # # #use Data::Random qw(:all); #open IA, ">iaout.txt"; #for $x (1 .. 160000) { #my @random_chars = rand_chars( set => 'alphanumeric', min => 21, max +=> 21 ); #print IA @random_chars, "\n";; # #}[1:50pm] 161 [/var/tmp]: time perl t 2.95u 0.18s 0:03.30 94.8%
How simular are the key parts of the data in your file? I am wondering if you are getting a very high collision rate on they key for some reason? either that or memory is my best guess.

-Waswas


Comment on Re: Re: Slurping BIG files into Hashes
Download Code
Re: Re: Re: Slurping BIG files into Hashes
by Elgon (Curate) on Jun 18, 2003 at 19:33 UTC

    Aha, t'was written...

    "I am wondering if you are getting a very high collision rate on they key for some reason?"

    I reckon that this is what the problem is as the key values are very similar all the way through and there's not a lot which can be done about it. Hmmm... I'm trying to think of a better data structure. Thanks for the help everybody who contributed.

    Elgon

    PS - The box is an 8 processor Sun server running Solaris with 8GB of RAM. Neither the IO nor the memory seem to be the problem from continuous observation of the stats.

    update - Thanks to BrowserUK et al. for their help unfortunately the version we are using is 5.004_5 and I am not allowed to change it. Oh well. I'm trying to find a workaround as we speak...

    update 2 - Thanks to jsprat, the script now runs in about a minute. Ta to all...

    Please, if this node offends you, re-read it. Think for a bit. I am almost certainly not trying to offend you. Remember - Please never take anything I do or say seriously.

      Try presizing the hash - keys %lookup = 160_000;

      If it is hash collisions, this might solve the problem.

      dominus has an interesting bit at perl.plover.com called When Hashes Go Wrong.

      Update: Meant to ask you to "print scalar %lookup;" after all is done. scalar %hash will give you the number of used buckets / number of allocated buckets. If the number of used buckets is low (like 1/16) all your hash items have been put in the same bucket!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://266952]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (9)
As of 2014-10-22 15:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (119 votes), past polls