http://www.perlmonks.org?node_id=1085557

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I was wondering if there is a difference between:
my %hash_data=();

and
my %hash_data;

Is there a more "correct" one? Which one should I use?

Replies are listed 'Best First'.
Re: Is there a difference in this declaration?
by kcott (Archbishop) on May 09, 2014 at 08:37 UTC

    There's no difference. Both declare a hash with zero key/value pairs:

    $ perl -Mstrict -Mwarnings -E ' my %x; say scalar keys %x; say scalar values %x; my %y = (); say scalar keys %y; say scalar values %y; ' 0 0 0 0

    Including an assignment has some overhead: typically negligible but may be significant in looping code.

    #!/usr/bin/env perl use strict; use warnings; use Benchmark qw{cmpthese}; cmpthese -1 => { no_assignment => sub { my %hash }, assignment => sub { my %hash = () }, };

    Output:

    Rate assignment no_assignment assignment 6672755/s -- -60% no_assignment 16770827/s 151% --

    I wouldn't necessarily consider one form to be "more correct" than the other.

    I generally use the "my %hash_data;" form.

    [Minor Update: I removed "use autodie;" from the benchmark code as it wasn't necessary (it was an artefact from the last use of this script which I often rework for example code); retested; much the same results.]

    -- Ken

      but may be significant in looping code

      No, not really. You've fallen for the classic fallacy that Benchmark's overblown attempts to "eliminate overhead" can often lead to. The huge values in the "rate" column are a good indicator.

      Let's test your theory by actually writing looping code and seeing how "significant" this difference can be. We'll have to come up with a loop that has a useful declaration of a hash inside of it and yet can complete iterations at something close to 6 million times each second and yet where the loop gets enough useful stuff done that almost no other code is required to get a useful result (as other code will further dilute the relative speed-up and thus reduce its significance).

      When talking about a Perl operation that can happen 6 million times each second, it is pretty much impossible to make such a single operation be a non-trivial percentage of a useful script's run time. This is classic "micro optimization", a fool's errand.

      So, for a declaration of a hash to be useful, surely you have to insert something into the hash. Since it is a fresh declaration, you're also going to need to use the hash or else you'll be building up close to 6 million new hashes each second and will quickly run out of memory. And this needs to somewhat simulate useful code as speeding up useless code is not "significant", it is theory at best and more often just pointless. :)

      So, here is looping code that does nothing but add two entries to the hash. It isn't useful, but it is pretty darn minimal. Truly useful code is surely going to have to do more than this for the hash declaration to be a useful part of it.

      #!/usr/bin/perl use strict; use warnings; use Benchmark qw{cmpthese}; cmpthese( -1 => { no_assignment => sub { for( 1..10_000 ) { my %hash; $hash{$_} = $_; $hash{-$_} = -$_; } }, assignment => sub { for( 1..10_000 ) { my %hash = (); $hash{$_} = $_; $hash{-$_} = -$_; } }, } ); __END__ Rate assignment ano_assignment assignment 99.4/s -- -8% no_assignment 108/s 9% --

      Above is a typical result from a run of the script. In my experience, a 10% speed-up would be characterized as "something I'm quite unlikely to even notice" which falls a long way from "significant".

      The speed difference is small enough that I even got this result when I ran the script a few times to verify that my first results weren't atypical:

      Rate no_assignment assignment no_assignment 96.6/s -- -3% assignment 99.4/s 3% --

      Note that the "with assignment" code is the one that ran faster that time.

      Finally, a quick demonstration of why I think Benchmark.pm's attempt to "eliminate overhead" are overblown. With all of the insertions commented out, a typical result is:

      Rate assignment no_assignment assignment 1068/s -- -37% no_assignment 1685/s 58% --

      While your original code on my computer gives:

      Rate assignment no_assignment assignment 11967704/s -- -49% no_assignment 23642004/s 98% --

      ...and takes noticeably longer to run. Benchmark has to over and over again try running the code in a tight loop with increasing repetition counts because it gets back time measurements that are too close to "the time it takes to run empty code" for the result to be considered meaningful enough to even be reported.

      When that happens, the results are nearly guaranteed to have no practical value.

      Note that none of this is meant as much of a criticism of what you wrote. Based on the numbers you got, it certainly might have been possible to have a significant impact. Your statement was quite conservative. But my experience lead me to doubt that such could happen, so I did a quick test to verify it.

      This case is actually rather close to the edge of it being possible for a real, useful script to end up 20% faster (a minimum to be noticeable, IME) with only this change (though likely still rather contrived). Certainly extremely unlikely.

      The speed difference certainly looks to be insignificant to me.

      - tye        

        "You've fallen for the classic fallacy ..."

        Utter rubbish! I've "fallen" for no such thing.

        Before posting, I'd assumed the assignment incurred some overhead but also considered that an optimisation might have been applied to negate this. I chose to check it.

        The benchmark code indicated the overhead did exist: I posted the code and results to show this. I made no inferences nor offered any conclusions about the benchmark results.

        I wrote that the overhead was "typically negligible". I see that you excluded that from your opening quote.

        Anything, no matter how small, when multiplied enough times will become a bigger thing: that bigger thing "may be significant".

        -- Ken

      Ah OK,
      now I know! Thank you very much!
      So the assignment takes 0.000,000,09 seconds if this is precise (which isn't likely given the minuteness).
Re: Is there a difference in this declaration?
by sundialsvc4 (Abbot) on May 09, 2014 at 15:48 UTC

    The only difference that I could possibly see between the two is one of clarity to the (human) reader.   You are expressly saying that the hash-variable is defined and that it contains nothing.   (Saying what is already true, but ... saying it.)   Does that make a difference to the computer?   Pragmatically, no.   Might it make a difference of understanding to the team?   Perhaps, such that if, or if not, I saw that to be the coding team’s standard practice, I would not tell them that they ought to change it to be otherwise.

    Micro-optimization of code is a very annoying practice, in my view, because it tends to sacrifice the one thing that really does matter in the long run:   simple clarity.   It can prompt you to do things in the name of microseconds that seriously impede the future maintaining of that code.   And, it can distract the team to go chasing after a rabbit that’s a scrawny hunk-of-fur at best.   In nearly (but, not quite) every situation, the real thing that eats your lunch is I/O ... either directly, or implicitly in the form of page-faults.   (The latter much less significant these days since memory has become gigantic.)   Squibbling over microseconds, as Benchmark encourages one to do, is almost (but, not quite) always not a good thing to waste, uhhh, hours over.

      Regarding team/environment:

      If the OP's team is exposed / used to C-style programs where initialization must be done or BadStuffHappens(TM), then I'd go for the explicit initialization in order to reduce mental context switching.

      If the OP's team is used to Perl or similar languages then I'd go for brevity and skip the explicit initalization since it leads to shorter code. ('short' as in 'as short as necessary', not as in 'as short as possible')

      Or one could argue that you (generic you) should always go for a perlish mindset and remove anything non-perlish. Pick an option that's suitable to you and your team and apply it consistently - TMTOWTDI...

Re: Is there a difference in this declaration?
by wjw (Priest) on May 13, 2014 at 10:49 UTC
    I think the answer to your question is dependent on context: Correctness usually is ...

    If your writing a program that is likely to run for hours on an expensive resource, then perhaps the interesting bench-marking discussed by other posters probably counts for something. After all: if it is costing lots of $$/minute to run the program, it might be worth it to spend the $$ up front and have really efficient code from a hardware performance and resource point of view.(But if that were the case, would you be programming in Perl?)

    If your writing code under the vast majority of circumstances, then the 'team' aspect is probably more important, as wading through code written for brevity, as compared to having been written for readability/maintainabilty, can be costly. Personally, I prefer code which, when read by someone other than the author, does not assume that the reader knows much about the implicit characteristics of how the code compiles or runs. It boils down to a potential trade-off between some extra hardware resources applied occasionally as compared to wetware resources applied fairly regularly. On the other hand, that does not mean that every line needs a comment either as some level of skill should be assumed to be had by the code reader/modifier. Obviously a balancing act...

    In my experience, correctness is the in realm of those that think the world should reflect the way they like things to be, as compared to for example, the way I like things to be. :-) Am I correct or are they? I am of course!

    The point is that if you take a look around at code written by others who appear to be better than you are and find someone whose code you like, emulate that until you have a valid reason not to. If your goal is to be correct, you are going to spend an awful lot of time defending your version of correct. Do what works for you and does not get in the way of others and you will be about as correct as your likely to ever be...

    Incidentally, I like the 'my %hash = ();' just because it is so explicit. Am I correct? I doubt it...but it works for me and most others really don't give a damn that I spent and extra 5 keystrokes.... :-)


    ...the majority is always wrong, and always the last to know about it...
    Insanity: Doing the same thing over and over again and expecting different results...

      As we speak I have a script running which is consuming *two* expensive resources - CPU and disk I/O. It started running on April the 21st. It is now the 14th of May and I expect it to finish in the wee small hours of the 15th. It even spends a lot of that time in some tight loops, as it is producing summaries of a very large data-set.

      However, "optimizations" like worrying about whether I initialize my variables are foolish. I have loads of far better optimizations. For example, my script runs exactly as parallel as is most efficient, all the time. This means that it uses all the CPU cores available while minimizing conflicts over resources.

      If I care to optimize it further I will minimize disk I/O. But the time taken for disk I/O is only about 20% of the time taken by the process, so we're already getting into the stage where the amount of time it would take to optimize the code or the money it would take to invest in more memory (so less use of disk for intermediate results) or an SSD (just as much disk I/O but contention matters less) isn't really worth it.

        Thanks for that! It validates most of what I thought and answers the question: (would it be written in Perl?).

        Clearly it would and is.

        I have to imagine it must be pretty satisfying to come to work each day and see that code chunking away!

        Again, thanks!

        ...the majority is always wrong, and always the last to know about it...
        Insanity: Doing the same thing over and over again and expecting different results...