Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
Syntactic Confectionery Delight
 
PerlMonks  

Re: An irrational coding choice found

by BrowserUk (Pope)
on Mar 18, 2012 at 22:31 UTC ( #960336=note: print w/ replies, xml ) Need Help??


in reply to An irrational coding choice found

First, you should ignore AoH versus HoH, and instead concentrate of array versus hash.

If the 'keys' to your data are integers, contiguous, and smallish (or can be arranged to be smallish by the subtraction of some constant), use an array.

Because:

  • arrays use less memory;
  • arrays are faster;

Vis:

$t=time; $a[ $_ ] = $_ for 1 .. 1e6; print time() - $t; print total_si +ze \@a;; 0.208095073699951 32388768 $t=time; $h{ $_ } = $_ for 1 .. 1e6; print time() - $t; print total_si +ze \%h;; 0.586999893188477 112277576

So, 60% faster and 70% less space used. (That's a single specific example, but quite typical.)

Wherever in a nested structure -- be it AoH or HoA or AoHoA .v. HoHoH -- that the keys lend themselves to the use of arrays, your code will benefit from using them in most cases.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?


Comment on Re: An irrational coding choice found
Download Code
Re^2: An irrational coding choice found
by Tux (Monsignor) on Mar 19, 2012 at 07:05 UTC

    True but do not forget the distribution of the keys and the size of your data set. Size may matter too! Note the following example, one single entry with a simple key

    $ perl -MDevel::Size=total_size -E'$key=123;$foo{$key}=1;$foo[$key]=1; +say total_size\%foo;say total_size\@foo' 205 1080 $ perl -MDevel::Size=total_size -E'$key=123456;$foo{$key}=1;$foo[$key] +=1;say total_size\%foo;say total_size\@foo' 208 987744 $ perl -MDevel::Size=total_size -E'$key=123456789;$foo{$key}=1;$foo[$k +ey]=1;say total_size\%foo;say total_size\@foo' 211 987654408

    (I consider 123 to be small). Making the "simple" key smaller might be possible in many cases, but the calculation method to getting it to "smallish" will probably defeat the gain over hashes.

    To me the most important reasons to use arrays are:

    • Data must stay in original order
    • Data is not guaranteed to be unique
    • The "target" API works only with lists/arrays

    Enjoy, Have FUN! H.Merijn
      Making the "simple" key smaller might be possible in many cases, but the calculation method to getting it to "smallish" will probably defeat the gain over hashes.

      The subtraction of a simple constant makes almost no difference:

      $t=time; $a[ $_-123 ] = $_ for 123 .. 123+1e6; print time() - $t; prin +t total_size \@a;; 0.252718925476074 32388792 $t=time; $h{ $_ } = $_ for 123 .. 123+1e6; print time() - $t; print to +tal_size \%h;; 0.625 112278277
      To me the most important reasons to use arrays are: 1) Data must stay in original order 2) Data is not guaranteed to be unique 3) The "target" API works only with lists/arrays

      Of those, the first two are moot. If the data can be stored in an array, then it can also be stored in a hash whilst meeting both of those criteria.

      That is, it is the keys of a hash that must be unique, and that is easily achieved by incrementing a integer variable as you build the hash. And once you've done that retrieval in insertion order is just a matter of iterating the keys.

      For your third criteria, if the APIs don't accept hashes, then there is no choice.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        I quite often have data sets where I have "keys" but no values. Only the fact that a "key" is present is important. If this set gets me unique keys, and the keys are all numeric and smallish, the choice is easy ("select c_foo from bar where c_foo < 1000 and bummer > 4"). But what if c_foo is a not unique (like I want to do something with foo for all keys appearing in some other dataset not necessarily from a database)? If I want to count the number of c_foo occurrences, a hash is of course the way to go: the count is the value to the key c_foo. However I quite often have to deal with foo for a slightly different query: "select c_foo from bar where c_foo < 1000 and bummer > 4 order by glome" where the order matters and c_foo is not unique. In this case, an array is the way to go. Of course one can store the data in a hash and keep the index in the value or - for easier perusal, store the index in the key and the key in the value, but that is very counterintuitive. Hence I do not think the first two are moot in this discussion: it is all about when to decide to use a hash or an array. In most cases both can be used but not always as efficient. That means both ways: efficient to maintain (and read 6 months later) and efficient in performance or memory.


        Enjoy, Have FUN! H.Merijn
Re^2: An irrational coding choice found
by raybies (Chaplain) on Mar 21, 2012 at 14:16 UTC
    one nitpick (probably too obvious to mention, but I'm a simpleton...), You can save time using a hash if you need to access a single element of a hash, instead of searching the array for your elements. Also hash insertion's easy, as opposed to splicing into a sorted array (assuming you had to keep an array sorted). It really comes down to how you intend to use your data as to whether it saves time.

      New (or rather, newly identified) software pattern: Simpleton Object.

      one nitpick (probably too obvious to mention ...

      It's not so much "too obvious" as not really applicable to the OPs question.

      I wasn't trying to define the absolute criteria for when to use an array or a hash.

      Just clarify the reasoning for trading the habitual use of hashes for arrays where that is otherwise a practical proposition.

      In order for that trade to even begin to be a practical proposition, the "natural keys" to the data have to be small, near contiguous integers; or otherwise be simply and quickly transformable into a range of small, near contiguous integers. If that is not the case, no such trade is possible and the OPs question is moot.

      You can save time using a hash if you need to access a single element of a hash, instead of searching the array for your elements.

      In order to use the lookup abilities of an associative array, you must start with the key and either:

      • be looking for the associated value;
      • checking for the existence of that key.

      For the very specific subset of cases where the array-instead-of-hash trade is possible, it would mean that you would be starting with an index into the array. And either looking to retrieve the value at that index; or test if there is any value at that index.

      In both cases, if the initial trade is feasible then the lookup in the array will be substantially faster than lookup in the hash.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://960336]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2014-04-18 23:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (473 votes), past polls