Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

datastructure of array and hash?

by cool (Scribe)
on Aug 02, 2007 at 07:36 UTC ( #630227=perlquestion: print w/replies, xml ) Need Help??

cool has asked for the wisdom of the Perl Monks concerning the following question:

If we have a huge array and a hash with same values. And we search for a particular value, which is going to take more time and WHY? Can anyone tell me, how these work inside? Or in other words, why are hashes stored randomly??
#! /usr/bin/perl use strict; #array.pl my @array; for (1..10000) { my $x=$_; $x="a".$x; push (@array, $x); } foreach (@array) { print "$array[$_]=>$_", "\n"; }
#! /usr/bin/perl use strict; #hash.pl my %hash; for (1..10000) { my $x=$_; $x="a".$x; $hash{$_}=$x; } foreach (keys %hash) { print "$hash{$_}=>$_", "\n"; }
[sid@nucleix scope]$ time perl hash.pl > A real 0m0.103s user 0m0.097s sys 0m0.006s [sid@nucleix scope]$ time perl array.pl > B real 0m0.091s user 0m0.086s sys 0m0.005s

Replies are listed 'Best First'.
Re: datastructure of array and hash?
by roboticus (Chancellor) on Aug 02, 2007 at 11:53 UTC

    cool:

    Hashes are stored randomly because they use a function to turn a key into a slot number. So when you do a read or store, it can be very quick: you just calculate the slot number from the key, and then get the data out of the specified slot. An array is faster still, as you already have the slot number in hand.

    So, is an array faster? Yes, for certain things. If you have a dense collection of integer numeric keys, say even numbers from 100 to 200, then you can use a simple array to get your values: $val = $arr[($key-100)/2];. If your index values are sparse, then you can waste a good bit of memory to hold nothing, and hashes may be very attractive as your key range increases to a large size with very sparse data. A hash, on the other hand, need not be very large in this case.

    Another case where arrays can be better is if you're always processing a set of items in order and you don't really care about random access. In that case, an array is fine, you just add stuff to your array and process it in order for your application. A hash won't help you here. This is the situation that your benchmark demonstrates: You get all the overhead of the hash (calculating the key), with none of the benefits.

    Frequently, however, you have a text string that you want to use as a key. In that case, you could approach it with an array by having two values in each array slot (key and value), or you could have two arrays, one for the keys and one for the values. In the first case, you have to search half the array (on average) to find the value for a specified key. In the second case, you have to search half the key array (on average) to find the key, then you can directly look up the value from the second array with the same index. Either way, using the array makes you have to do a lot of lookups to get a particular value. For the hash you need only calculate the expected slot position from the key and check to see if it's there. One calculation and one lookup, resulting in *much* faster access. Here's an example where we search for random values:

    #! /usr/bin/perl use strict; # Build the dataset my $iterations = 100000; my $range = 10000; my $num_keys = 1000; my %hash = map { "a_" . int(rand($range)) => 0 } (0..$num_keys); my @array = sort keys %hash; my $cnt_array=0; my $cnt_hash=0; # look for random array entries my $start = time; for (0..$iterations) { my $key = "a_" . int(rand($range)); for (0..@array) { ++$cnt_array if $key eq $array[$_]; } } my $ttl = time - $start; print "Array search took $ttl seconds, found $cnt_array\n"; # look for random hash entries $start = time; for (0..$iterations) { my $key = "a_" . int(rand($range)); ++$cnt_hash if defined $hash{$key}; } $ttl = time - $start; print "Hash search took $ttl seconds, found $cnt_hash\n";

    When I ran this, I got the results:<\p>

    $ ./foo.pl Array search took 33 seconds, found 9534 Hash search took 0 seconds, found 9426

    So hashes definitely have their use...

    Note: I glossed over some of the details of a hash, so if you want to really get into the details you'll need to google a bit.

    ...roboticus

    Update: I just ran it with 10 million iterations, and got:

    $ ./foo.pl Array search took 3258 seconds, found 955277 Hash search took 16 seconds, found 954543
    So in this boneheaded example, the hash was about 200 times faster than an array search.
Re: datastructure of array and hash?
by citromatik (Curate) on Aug 02, 2007 at 08:21 UTC

    It is stated (for example, in Mastering Algorithms with Perl) that it's about 30% faster to store data in an array than in a hash, and about 20% faster to retrieve data from an array than from a hash. More or less it is what you are getting with your own benchmark

    A simple explanation would be that this happens because the subscript in an array tells Perl where to find that value in memory, while a hash must first convert its key into a hash value.

    citromatik

Re: datastructure of array and hash?
by holli (Abbot) on Aug 02, 2007 at 08:49 UTC
    Do yourself a favour and read about map:
    my @array = map { "a$_" } (1..10000); my %hash = map { $_ => "a$_" } (1..10000);


    holli, /regexed monk/
Re: datastructure of array and hash?
by Anonymous Monk on Aug 02, 2007 at 19:25 UTC
Re: datastructure of array and hash?
by Aim9b (Monk) on Aug 02, 2007 at 16:19 UTC
    Perl internals (& my neophyte status) not withstanding, I can only offer the following... Every memory location in every computer ever made, is accessed "numerically". That's just the way things are, so if you want the corresponding "Value" of green to the "Key" of Color, You must use a "hashing algorithm" (ha)to compute a numerical memory address from the bits-value of 'color'. There are several known hashing Algorithms and each is designed to work better than another on different sizes of files, arays, tables, hashes, or what-have-you. On the systems I work on, we can set the seed of these ha's to 0,1,2,etc. & create a more efficient (polynomial)Algorithm as the file size increases. Not sure if perl utilizes different ha's internally or not. By accessing an Array via its relative numeric index, getting the nth element would be faster, but looking for the element containing 'Color' would require a sort & a B-Tree Algorithm, making it slower than a hash. IMHO, but again, I'm just learning. ;-) That's it, for what it's worth.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://630227]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (7)
As of 2020-04-02 20:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The most amusing oxymoron is:
















    Results (26 votes). Check out past polls.

    Notices?