Considering hash slices, past, present and future

It all started out innocently enough. I dig hash slices. They offer a compact syntax for representing bulk parallel transfers in hashes. If you don't know what hash slices are, here is a quick example. Consider the difference between:

#! /usr/bin/perl -w
use strict;

my %slice;
@slice{ qw/alpha bravo/ } = ( 'foo', 'bar' );

my %assign;
$assign{alpha} = 'foo';
$assign{bravo} = 'bar';
[download]

These two assignment blocks to %slice and %assign are equivalent. My intuition told me that the hash slice form would be faster. Benchmarks showed me otherwise...

I first created a benchmark with a two-element assignment and a 16-element assignment, with varying-length scalars. I was shocked to see that the hash slice method was 50% slower! I then expanded the tests to include 32-element and 64-element assignments, and each time the hash slice assignment was fifty percent slower.

% perl bench-hslice 100000
Benchmark: timing 100000 iterations of Assign_02, Assign_16, Assign_32
+, Assign_64, Slice_02, Slice_16, Slice_32, Slice_64...
 Assign_02:  1 wallclock secs ( 1.24 usr +  0.00 sys =  1.24 CPU)
 Assign_16:  7 wallclock secs ( 6.41 usr +  0.00 sys =  6.41 CPU)
 Assign_32: 14 wallclock secs (13.55 usr +  0.00 sys = 13.55 CPU)
 Assign_64: 26 wallclock secs (26.34 usr +  0.00 sys = 26.34 CPU)
  Slice_02:  2 wallclock secs ( 1.80 usr +  0.00 sys =  1.80 CPU)
  Slice_16: 10 wallclock secs ( 9.47 usr +  0.01 sys =  9.48 CPU)
  Slice_32: 19 wallclock secs (18.56 usr +  0.01 sys = 18.57 CPU)

  Slice_64: 39 wallclock secs (38.43 usr +  0.00 sys = 38.43 CPU)
[download]

Creating the 32- and 64-element hash slice assignments brought to light the whole reason why I starting meditating on this. The fact is, for a small number of elements in a hash slice assignment, creative use of whitespace lets you line up the assigner and the assignee, which leads to better comprehension and stamps out a possible source of errors. Consider:

  my %slice;
  @slice{ qw/alpha   bravo charlie delta/ } =
        (    $omega, $x,   $brown, $plane );
[download]

The big flaw with this approach is that it breaks down when the character length of the list of assignments exceeds your usual recommendations for line limits (which, in my case, is around 110 chars, but other shops might limit that to 80).

What would be really nice is to allow some kind of syntax along the lines of:

  my %slice;
  @slice{
    'alpha' = $omega,
    'bravo' = $x,
    # ...
  };
[download]

But that doesn't compile (Can't modify constant item in scalar assignment). But wait! That code looks awfully like a standard hash constructor. What if we used the => sugar comma?

  my %slice;
  @slice{
    alpha => $omega,
    bravo => $x,
    # ...
  };
[download]

That actually compiles, although it doesn't do what we want: perl spits out "Useless use of hash slice in void context" and doesn't modify %slice. Bummer. That's the crux of the problem: hash slice assignments do not scale up well for more than half a dozen or so elements, if you consider notational clarity as the most important issue. A minor issue is that it's also slower, but I don't really care about that, for I perceive that the notational benefits outweigh the runtime cost.

But I was curious as to why it was that much slower. My first step was a step in the dark, and somewhat of a step backwards to boot, for I replaced @h{ qw/alpha bravo/ } = ... with my @h = qw/alpha bravo/; @h{ @h } = .... (Farewell to clarity). Surprisingly, the hash slice assignment became faster, about as fast as the flat assignment approach.

% perl bench-hslice2 100000
Benchmark: timing 100000 iterations of Assign_02, Assign_16, Assign_32
+, Assign_64, Slice_02, Slice_16, Slice_32, Slice_64...
 Assign_02:  1 wallclock secs ( 1.22 usr +  0.00 sys =  1.22 CPU)
 Assign_16:  7 wallclock secs ( 6.61 usr +  0.00 sys =  6.61 CPU)
 Assign_32: 13 wallclock secs (13.11 usr +  0.00 sys = 13.11 CPU)
 Assign_64: 25 wallclock secs (26.82 usr +  0.00 sys = 26.82 CPU)
  Slice_02:  0 wallclock secs ( 1.33 usr +  0.00 sys =  1.33 CPU)
  Slice_16:  7 wallclock secs ( 6.57 usr +  0.00 sys =  6.57 CPU)
  Slice_32: 13 wallclock secs (12.76 usr +  0.00 sys = 12.76 CPU)
  Slice_64: 26 wallclock secs (26.11 usr +  0.00 sys = 26.11 CPU)
[download]

I was doing all this with perl 5.005_03. I ran the same code on 5.6.1, and oh joy! the hash slice assignment (without the klugey @array) ran about as fast as the flat assignment approach.

% perl hslice 100000
Benchmark: timing 100000 iterations of Assign_02, Assign_16, Assign_32
+, Assign_64, Slice_02, Slice_16, Slice_32, Slice_64...
 Assign_02:  1 wallclock secs ( 0.68 usr +  0.00 sys =  0.68 CPU) @ 14
+7058.82/s (n=100000)
 Assign_16:  3 wallclock secs ( 3.99 usr +  0.00 sys =  3.99 CPU) @ 25
+062.66/s (n=100000)
 Assign_32:  9 wallclock secs ( 8.40 usr +  0.01 sys =  8.41 CPU) @ 11
+890.61/s (n=100000)
 Assign_64: 17 wallclock secs (16.61 usr +  0.03 sys = 16.64 CPU) @ 60
+09.62/s (n=100000)
  Slice_02:  1 wallclock secs ( 0.70 usr +  0.00 sys =  0.70 CPU) @ 14
+2857.14/s (n=100000)
  Slice_16:  4 wallclock secs ( 4.21 usr +  0.00 sys =  4.21 CPU) @ 23
+752.97/s (n=100000)
  Slice_32:  7 wallclock secs ( 8.30 usr +  0.00 sys =  8.30 CPU) @ 12
+048.19/s (n=100000)
  Slice_64: 18 wallclock secs (16.93 usr +  0.01 sys = 16.94 CPU) @ 59
+03.19/s (n=100000)
[download]

To see what was going on, it was time to look at the op-codes perl was generating. I'm not an expert in perlguts, but it is pretty straight forward to figure things out. For the flat assignment, the code is roughly the same between the two versions of perl used, but for hash slices, it is a different story:

% cat opcode-hslice
#! /usr/bin/perl -w
use strict;
my %h;
@h{ qw/alpha bravo/ } = ( 'foo', 'bar' );

% perl -MO=Terse opcode-hslice # 5.6.1
    BINOP (0x819f8e0) aassign [2]
        UNOP (0x81b3dc0) null [141]
            OP (0x81a7868) pushmark
            SVOP (0x819c720) const  PV (0x8131440) "foo"
            SVOP (0x819c740) const  PV (0x8131434) "bar"
        UNOP (0x81b3e00) null [141]
            OP (0x81a37e8) pushmark
            LISTOP (0x8126360) hslice
                OP (0x81a7850) pushmark
                LISTOP (0x8126060) list
                    OP (0x81a79e8) pushmark
                    SVOP (0x8126240) const  PV (0x812eb8c) "alpha"
                    SVOP (0x81262a0) const  PV (0x8124224) "bravo"
                OP (0x819d128) padhv [1]

% perl -MO=Terse opcode-hslice # 5.005_03
    BINOP (0x80d4f30) pp_aassign [3]
        UNOP (0x80d4e00) pp_null [141]
            OP (0x80d4e28) pp_pushmark
            SVOP (0x8144de8) pp_const  PV (0x80d1f30) "foo"
            SVOP (0x80d4de0) pp_const  PV (0x80d1f48) "bar"
        UNOP (0x80d4ee8) pp_null [141]
            OP (0x80d4f10) pp_pushmark
            LISTOP (0x8144da0) pp_hslice
                OP (0x8144dc8) pp_pushmark
                LISTOP (0x80d5520) pp_split [2]
                    PMOP (0x81360b8) pp_pushre /\s+/
                    SVOP (0x80d54e0) pp_const  PVIV (0x80d1f90) "alpha
+ bravo"
                    SVOP (0x80d5500) pp_const  IV (0x80d1f0c) 0
                OP (0x80d54c0) pp_padhv [1]
[download]

(Some output elided for brevity).

Here, the slowdown observed in 5.005_03 is easily explained. When the interpreter encounters @slice{ qw/alpha bravo/ } = ( 'foo', 'bar' );, it emits the run-time op code to split /\s+/ the qw/alpha bravo/ quoted word array, each and every time. In 5.6.1, this is performed once at compile time, hence the speed gain.

I'm just hoping now that in Perl 6, there'll be one single mother-of-all-opcodes that will take two lists (the hslice lvalues, and the assignment rvalues) and do the whole mess at C speeds. Then we will see a dramatic performance increase. And a nice notation for arbitrary length hash slices would be nice too.

`g r i n d e r`

Back to Meditations