Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: An irrational coding choice found

by Tux (Monsignor)
on Mar 19, 2012 at 07:05 UTC ( #960373=note: print w/ replies, xml ) Need Help??


in reply to Re: An irrational coding choice found
in thread An irrational coding choice found

True but do not forget the distribution of the keys and the size of your data set. Size may matter too! Note the following example, one single entry with a simple key

$ perl -MDevel::Size=total_size -E'$key=123;$foo{$key}=1;$foo[$key]=1; +say total_size\%foo;say total_size\@foo' 205 1080 $ perl -MDevel::Size=total_size -E'$key=123456;$foo{$key}=1;$foo[$key] +=1;say total_size\%foo;say total_size\@foo' 208 987744 $ perl -MDevel::Size=total_size -E'$key=123456789;$foo{$key}=1;$foo[$k +ey]=1;say total_size\%foo;say total_size\@foo' 211 987654408

(I consider 123 to be small). Making the "simple" key smaller might be possible in many cases, but the calculation method to getting it to "smallish" will probably defeat the gain over hashes.

To me the most important reasons to use arrays are:

  • Data must stay in original order
  • Data is not guaranteed to be unique
  • The "target" API works only with lists/arrays

Enjoy, Have FUN! H.Merijn


Comment on Re^2: An irrational coding choice found
Download Code
Re^3: An irrational coding choice found
by BrowserUk (Pope) on Mar 19, 2012 at 07:33 UTC
    Making the "simple" key smaller might be possible in many cases, but the calculation method to getting it to "smallish" will probably defeat the gain over hashes.

    The subtraction of a simple constant makes almost no difference:

    $t=time; $a[ $_-123 ] = $_ for 123 .. 123+1e6; print time() - $t; prin +t total_size \@a;; 0.252718925476074 32388792 $t=time; $h{ $_ } = $_ for 123 .. 123+1e6; print time() - $t; print to +tal_size \%h;; 0.625 112278277
    To me the most important reasons to use arrays are: 1) Data must stay in original order 2) Data is not guaranteed to be unique 3) The "target" API works only with lists/arrays

    Of those, the first two are moot. If the data can be stored in an array, then it can also be stored in a hash whilst meeting both of those criteria.

    That is, it is the keys of a hash that must be unique, and that is easily achieved by incrementing a integer variable as you build the hash. And once you've done that retrieval in insertion order is just a matter of iterating the keys.

    For your third criteria, if the APIs don't accept hashes, then there is no choice.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      I quite often have data sets where I have "keys" but no values. Only the fact that a "key" is present is important. If this set gets me unique keys, and the keys are all numeric and smallish, the choice is easy ("select c_foo from bar where c_foo < 1000 and bummer > 4"). But what if c_foo is a not unique (like I want to do something with foo for all keys appearing in some other dataset not necessarily from a database)? If I want to count the number of c_foo occurrences, a hash is of course the way to go: the count is the value to the key c_foo. However I quite often have to deal with foo for a slightly different query: "select c_foo from bar where c_foo < 1000 and bummer > 4 order by glome" where the order matters and c_foo is not unique. In this case, an array is the way to go. Of course one can store the data in a hash and keep the index in the value or - for easier perusal, store the index in the key and the key in the value, but that is very counterintuitive. Hence I do not think the first two are moot in this discussion: it is all about when to decide to use a hash or an array. In most cases both can be used but not always as efficient. That means both ways: efficient to maintain (and read 6 months later) and efficient in performance or memory.


      Enjoy, Have FUN! H.Merijn
        where the order matters and c_foo is not unique. In this case, an array is the way to go

        In that case, the "'keys' to your data are integers, contiguous, and smallish" -- 0 .. n-1. I see no conflict between the needs of your example and my criteria.

        I think that you are also losing sight of the OP; and the question to which I responded.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://960373]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (17)
As of 2014-07-28 18:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (205 votes), past polls