Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

'%hash = ()' is slower than 'undef %hash'

by rsFalse (Friar)
on May 18, 2018 at 10:13 UTC ( #1214832=perlquestion: print w/replies, xml ) Need Help??
rsFalse has asked for the wisdom of the Perl Monks concerning the following question:

Hello.

Today I've found, that my code with %hash = () is slower than undef %hash about 1.2 times. Perl 5.20.1. My hash contained simple values, not a Hash of Hashes or smth.

I think that %hash = () should be aliased to undef %hash. What is your opinion?

Replies are listed 'Best First'.
Re: '%hash = ()' is slower than 'undef %hash'
by dave_the_m (Monsignor) on May 18, 2018 at 12:37 UTC
    For anything but small hashes, the total time will be dominated by the time to free all the hash elements (which uses the same code path for both methods). Other than that, the assign will generally have a slightly higher overhead as it has to create an empty list first. In perl 5.26.0, the actual list assign operator has been optimised so that it handles an empty RHS lists more efficiently.

    Dave.

Re: '%hash = ()' is slower than 'undef %hash'
by shmem (Chancellor) on May 18, 2018 at 11:17 UTC

    The assignment of an empty list erases the content of %hash, but undef %hash erases the container. So these are different operations to choose from for the task at hand.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
      What is the biggest difference we can obtain by using these two? I really don't understand :(
        What is the biggest difference we can obtain by using these two?

        The difference can be seen with Devel::Peek:

        #/usr/bin/perl use Devel::Peek; use feature 'say'; my %assign = ( foo => 1234, bar => 5678, ); my %undef = ( foo => 1234, bar => 5678, ); %assign = (); undef %undef; say "%assign (",\%assign,") after assignment:"; Dump(%assign); say "%undef (",\%undef,") after undef:"; Dump(%undef); __END__ %assign (HASH(0x195f890)) after assignment: SV = PVHV(0x193fe60) at 0x195f890 REFCNT = 1 FLAGS = (PADMY,SHAREKEYS) ARRAY = 0x19a1d70 KEYS = 0 FILL = 0 MAX = 7 %undef (HASH(0x195f860)) after undef: SV = PVHV(0x193fe80) at 0x195f860 REFCNT = 1 FLAGS = (PADMY,SHAREKEYS) ARRAY = 0x0 KEYS = 0 FILL = 0 MAX = 7

        The difference lies in the ARRAY element of the above output. By assignment of an empty list the container of the key/value pairs is preserved, whereas undef erases it and sets it to a NULL pointer.

        It is obviously cheaper to just wipe the container (and its content) instead of iterating through it and re-arranging its internal structure. - edit - see below, what dave_the_m writes.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
        The effect on the hash is indistinguishable, as far as I can say. But the two expressions have different values: %hash = () returns an empty hash, whereas undef %hash returns undef. This isn't a big deal, but in my opinion big enough to keep the two distinct.
        I think defined %hash used to show the difference . Unfortunately it's deprecated now.
Re: '%hash = ()' is slower than 'undef %hash'
by roboticus (Chancellor) on May 18, 2018 at 14:20 UTC

    rsFalse:

    For many applications, that's a false economy. If you're going to reuse the hash container, then it seems to me to be cheaper to clear the hash (%h = ()) instead of destroying the hash container and then recreating it (undef %hash):

    $ perl pm_hash_clear_vs_undef_1214851.pl Checking container differences between delete and clear: Hash size (initial): 0/8 Hash size (after fill): 62/64 Hash size (after clear): 0/64 Hash size (after delete): 0/8 Rate delete reuse overwrite delete 38049/s -- -3% -39% reuse 39358/s 3% -- -37% overwrite 62375/s 64% 58% --

    Overwriting the hash is obviously the fastest, as you needn't clear or destroy the container. Of course for many applications you'd have the added headache of ensuring that old data and current data don't mix.

    Clearing the hash container allows you to reuse the hash without mixing old and current data, but might appear to be slower than simply deleting the hash container.

    Deleting the hash container might appear to be faster until you also account for the time it takes to recreate the hash container when you use it. It might matter more than it appears, though: I'd expect that clearing the hash may leave the container at the same size, so re-using the hash in the case there are many keys may be significantly faster than clearing/recreating it because perl could avoid the multiple container resize operations as it adds the keys. I was going to test that, but for some reason, I don't see how to make perl give the "used/total" buckets value for a hash any longer. (Funny, when I was a perl novice, I was getting that frequently, but now that I want it, I can't seem to make it happen. I guess I'll have to hit the documentation and see if I can suss it out. If so, I'll try to remember to update this node.)

    Deleting the hash container might appear to be faster until you also account for the time it takes to recreate the hash container when you use it. It matters a little more than it appears, though: clearing the hash leaves the container the same size, so re-using the hash is slightly faster than clearing/recreating it because perl can avoid many of the container resize operations as it adds the keys. On the positive side, though, clearing the container may allow your application to reclaim some memory in the event that some datasets may have significantly more keys than are ordinarily needed. (Although I expect that would be as insignificant as the savings from the resizes just mentioned.)

    At least that's how I see it... I'm providing the benchmark so you can point out what I may be missing...

    $ cat pm_hash_clear_vs_undef_1214851.pl use strict; use warnings; use Benchmark ':all'; use Hash::Util 'bucket_stats'; my @some_keys = ('A' .. 'Z', 'a' .. 'z', '0' .. '9'); print "Checking container differences between delete and clear:\n"; my %gh; print "Hash size (initial): ", old_hash_stats(\%gh), "\n"; fill_hash(\%gh); print "Hash size (after fill): ", old_hash_stats(\%gh), "\n"; %gh=(); print "Hash size (after clear): ", old_hash_stats(\%gh), "\n"; undef %gh; print "Hash size (after delete): ", old_hash_stats(\%gh), "\n"; print "\n"; sub old_hash_stats { my $hr = shift; my @hash_stats = bucket_stats($hr); return "$hash_stats[0]/$hash_stats[1]"; } sub fill_hash { my $hr = shift; @{$hr}{@some_keys} = (0) x @some_keys; } cmpthese(500000, { 'overwrite' => sub { my %h; fill_hash(\%h); fill_hash(\%h); }, 'delete' => sub { my %h; fill_hash(\%h); undef %h; fill_hash(\%h); }, 'reuse' => sub { my %h; fill_hash(\%h); %h = (); fill_hash(\%h); }, });

    Update: It seems that the old behavior of scalar(%h) changed in version 5.25.3 from displaying "buckets used/bucket count" to simply "buckets used". (A poor idea, in my opinion.) Anyway, with the Hash::Util function bucket_stats we can still get the information. I've edited the text and benchmark accordingly, and rearranged things a little for readability.

    Update 2: Added the bit about destroying the container allows you to reclaim memory as a possible benefit.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: '%hash = ()' is slower than 'undef %hash'
by ikegami (Pope) on May 19, 2018 at 01:04 UTC

    I see the opposite. Where's your test? I never trust a benchmark I don't see; they're too easy to get wrong.

    >perl a.pl Rate test2 test1 test2 225043/s -- -5% test1 236071/s 5% -- >perl a.pl Rate test2 test1 test2 233861/s -- -5% test1 247248/s 6% -- >perl a.pl Rate test2 test1 test2 224170/s -- -8% test1 243265/s 9% --
    $ perl a.pl Rate test2 test1 test2 194855/s -- -10% test1 217061/s 11% -- $ perl a.pl Rate test2 test1 test2 200313/s -- -5% test1 210977/s 5% -- $ perl a.pl Rate test2 test1 test2 195321/s -- -4% test1 203538/s 4% --
    use strict; use warnings; use Benchmark qw( cmpthese ); sub test1 { my %h = 'a'..'z'; %h = (); } sub test2 { my %h = 'a'..'z'; undef %h; } cmpthese(-3, { test1 => \&test1, test2 => \&test2, });
Re: '%hash = ()' is slower than 'undef %hash'
by afoken (Abbot) on May 18, 2018 at 17:35 UTC
    %hash = () is slower than undef %hash about 1.2 times [...] What is your opinion?

    It seems you are micro-optimizing. Unless %hash=() resp. undef %hash takes significant amounts of time (say a minute or more per run), I would simply ignore it. Very likely, other places of your code need much more time and optimizing them gains much more. Hint: Try Devel::NYTProf.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: '%hash = ()' is slower than 'undef %hash'
by kcott (Bishop) on May 19, 2018 at 06:55 UTC

    G'day rsFalse,

    "What is your opinion?"

    Unfortunately, you've provided very little information upon which an opinion might be based. Answers to the following would have been useful in your OP.

    • How did you declare %hash? my? our? no strict 'vars'; and just let it pop into existence? something else?
    • In what context are you using %hash?
    • Why are you attempting to get rid of its contents (with undef or =())?
    • Where's your benchmark code?
    • Where's your benchmark results? [Sorry, but just saying "%hash = () is slower than undef %hash about 1.2 times" is not sufficient.]

    Given all those unknowns, my opinion is that you should consider code like this:

    { my %hash = (...); # operations on %hash here } # %hash doesn't exist here. Does that solve your problem?

    Of course, with some answers to those questions, a better response might be forthcoming.

    — Ken

Re: '%hash = ()' is slower than 'undef %hash'
by LanX (Archbishop) on May 18, 2018 at 10:30 UTC
    My opinion is they are doing different things.

    I don't think oranges should be aliased to apples.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

Re: '%hash = ()' is slower than 'undef %hash'
by rsFalse (Friar) on May 19, 2018 at 12:02 UTC
    Thank you very much for discussion and benchmarks!

    I was simply upsolving a competitive programming problem which had 1000 ms time limit and big tests with undirected graphs (tries). After my solution with %hash = () got a verdict time-limit-exceeded, I tried to find how to optimise it. And it was surprising for me, when I found that my program is running ~20% faster when I change to undef %hash, got verdict accepted in 850 ms on max tests. Both variations didn't influence a correctness of an algorithm, only running time. I used %hash as global (here it didn't influence solving time comparing with declaring with 'my').
    I'm happy that in newer versions optimisation took place, as dave_the_m have mension. I thought maybe it is unknown or unoptimized thing.
    edited
      I haven't done benchmark by myself, only compared running times of tests cases on CP platform. Problem. Submitions: >1000 ms, 800 ms (maybe not readable, but here only 16th line differs). Perl 5.20.1.
      Probably determined by whether-or-not garbage collection had been triggered.
        Perl does reference counting, it's always triggered, sundial

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1214832]
Approved by haukex
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2019-03-23 06:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How do you Carpe diem?





    Results (114 votes). Check out past polls.

    Notices?