Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
XP is just a number
 
PerlMonks  

Re: Efficient bit-twiddling in Perl.

by Tux (Monsignor)
on Feb 28, 2013 at 16:58 UTC ( #1021079=note: print w/ replies, xml ) Need Help??


in reply to Efficient bit-twiddling in Perl.

Just to test if it would bring speed (which I did not expect), I fiddled with pack and unpack. Before doing the hard part, I looked at the speeds halfway. It is not worth pursuing.

I also looked at Inline::C, and it is slower than your perl code (and I already optimized the stack stuff)

$ cat test.pl use 5.016; use warnings; use Inline "C"; use Benchmark qw( cmpthese ); my $n = 0x80061861; my $x = 10000; sub andshift { my ($top14, $nxt6, $mid6, $bot6); for (1 .. $x) { $top14 = ($n & 0xfffc0000) >> 18; $nxt6 = ($n & 0x0003f000) >> 12; $mid6 = ($n & 0x00000fc0) >> 6; $bot6 = ($n & 0x0000003f); } return ($top14, $nxt6, $mid6, $bot6); } # andshift sub packunpack { my ($top14, $nxt6, $mid6, $bot6); for (1 .. $x) { ($top14, $nxt6, $mid6, $bot6) = unpack "A14 A6 A6 A6", unpack "B32", => pack "N" => $n; } return ($top14, $nxt6, $mid6, $bot6); } # packunpack sub inlined { my ($top14, $nxt6, $mid6, $bot6); for (1 .. $x) { ($top14, $nxt6, $mid6, $bot6) = split14666 ($n); } return ($top14, $nxt6, $mid6, $bot6); } # inlined say for andshift (); say for packunpack (); say for split14666 ($n); cmpthese (-1, { andshift => \&andshift, bitstrings => \&packunpack, inline_c => \&inlined, }); __END__ __C__ void split14666 (int n) { Inline_Stack_Vars; Inline_Stack_Reset; EXTEND (sp, 4); mPUSHi (((unsigned int)n & 0xfffc0000) >> 18); mPUSHi (((unsigned int)n & 0x0003f000) >> 12); mPUSHi (((unsigned int)n & 0x00000fc0) >> 6); mPUSHi (((unsigned int)n & 0x0000003f) ); Inline_Stack_Done; } /* split14666 */ $ perl test.pl 8193 33 33 33 10000000000001 100001 100001 100001 8193 33 33 33 Rate bitstrings inline_c andshift bitstrings 121/s -- -52% -74% inline_c 250/s 108% -- -47% andshift 470/s 290% 88% --

Enjoy, Have FUN! H.Merijn


Comment on Re: Efficient bit-twiddling in Perl.
Select or Download Code
Re^2: Efficient bit-twiddling in Perl.
by BrowserUk (Pope) on Feb 28, 2013 at 17:28 UTC
    Before doing the hard part, I looked at the speeds halfway. It is not worth pursuing.

    Indeed. My initial thought for this was to use unpack, and I came up with this unholy construct:

    unpack 'vccc', join'',map{ pack'b*', $_ } unpack 'a14a6a6a6', unpack ' +B32', pack 'N', $n;

    Which does the job but (not unexpectedly) is an order of magnitude slower than my second attempt in the OP.

    That was a pretty good saving, but whenever I have asked these type of questions in the past, someone has always come up with a better solution.

    And indeed, salva has; albeit only a further 20% saving. I also tried using separate lookup tables to save a dereference to no avail:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; my @lookup; $#lookup = 0x3ffff; $lookup[ $_ ] = [ ( $_ & 0x3f000 ) >> 12, ( $_ & 0xfc0 ) >> 6, $_ & 0x +3f ] for 0 .. 0x3ffff; my( @nxt, @mid, @bot ); $nxt[ $_ ] = ( $_ & 0x3f000 ) >> 12, $mid[ $_ ] = ( $_ & 0xfc0 ) >> 6, $bot[ $_ ] = $_ & 0x3f for 0 .. 0x3ffff; sub stuff{ # print "@_"; } our $ITERS //= 10e6; my $n = 0x80061861; my $start = time; for ( 1 .. $ITERS ) { stuff( ( $n & 0xffc00000 ) >> 18, ( $n & 0x0003f000 ) >> 12, ( $n & 0x00000fc0 ) >> 6, ( $n & 0x0000003f ) ); } printf "Shift&and took: %.12f seconds\n", ( time() - $start ) / $ITERS +; $start = time; for ( 1 .. $ITERS ) { stuff( $n >> 18, @{ $lookup[ $n & 0x3ffff ] } ); } printf " Lookup took: %.12f seconds\n", ( time() - $start ) / $ITERS +; $start = time; for ( 1 .. $ITERS ) { my $i = $n & 0x3ffff; stuff( $n >> 18, $nxt[ $i ], $mid[ $i ], $bot[ $i ] ); } printf " Lookup2 took: %.12f seconds\n", ( time() - $start ) / $ITERS +; $start = time; for ( 1 .. $ITERS ) { stuff( unpack 'vccc', join'',map{ pack'b*', $_ } unpack 'a14a6a6a6 +', unpack 'B32', pack 'N', $n ); } printf "(un)pack took: %.12f seconds\n", ( time() - $start ) / $ITERS +; __END__ C:\test>1021064 Shift&and took: 0.000000482421 seconds Lookup took: 0.000000386419 seconds Lookup2 took: 0.000000547556 seconds (un)pack took: 0.000005933478 seconds

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I'll qualify this result to say salva's result doesn't necessarily buy you anything if you are only doing this millions of times. On my machine, I get the timings
      Generation took: 0.218580069542 seconds Shift&and took: 0.000000471000 seconds Lookup took: 0.000000395095 seconds
      Where generation is calculated with
      my $start = time; for (1..100) { my @lookup; $#lookup = 0x3ffff; $lookup[ $_ ] = [ ( $_ & 0x3f000 ) >> 12, ( $_ & 0xfc0 ) >> 6, $_ & 0x +3f ] for 0 .. 0x3ffff; } printf "Generation took: %.12f seconds\n", ( time() - $start )/100;
      I ran the transforms for 10^7 iterations, though frankly all metrics were still a little volatile for my taste. This puts breakeven at 2.9 million iterations and 5% speed up at 4.2 million. YMMV.

      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        The lookup table generation is 2/10th second, 1-off cost at startup.

        Completely insignificant to the cost of the inner loop lookup code that could theoretically be executed 4 billion times. (Once for each 32-bit number.) More typical is mid 100s to low 1000s of millions of times.

        A small saving overall, but only one part of the inner loop code I am optimising. Already down from a couple of weeks to ~4 days.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      Bit::Vector as you might guess isn't any better...

      use strict; use warnings; use Time::HiRes qw[ time ]; use Bit::Vector; my @lookup; $#lookup = 0x3ffff; $lookup[ $_ ] = [ ( $_ & 0x3f000 ) >> 12, ( $_ & 0xfc0 ) >> 6, $_ & 0x +3f ] for 0 .. 0x3ffff; my( @nxt, @mid, @bot ); $nxt[ $_ ] = ( $_ & 0x3f000 ) >> 12, $mid[ $_ ] = ( $_ & 0xfc0 ) >> 6, $bot[ $_ ] = $_ & 0x3f for 0 .. 0x3ffff; sub stuff{ # print "@_"; } our $ITERS //= 10e6; my $n = 0x80061861; my $start = time; for ( 1 .. $ITERS ) { stuff( ( $n & 0xffc00000 ) >> 18, ( $n & 0x0003f000 ) >> 12, ( $n & 0x00000fc0 ) >> 6, ( $n & 0x0000003f ) ); } printf " Shift&and took: %.12f seconds\n", ( time() - $start ) / $ITE +RS; $start = time; for ( 1 .. $ITERS ) { stuff( $n >> 18, @{ $lookup[ $n & 0x3ffff ] } ); } printf " Lookup took: %.12f seconds\n", ( time() - $start ) / $ITE +RS; $start = time; for ( 1 .. $ITERS ) { my $i = $n & 0x3ffff; stuff( $n >> 18, $nxt[ $i ], $mid[ $i ], $bot[ $i ] ); } printf " Lookup2 took: %.12f seconds\n", ( time() - $start ) / $ITE +RS; $start = time; for ( 1 .. $ITERS ) { stuff( unpack 'vccc', join'',map{ pack'b*', $_ } unpack 'a14a6a6a6 +', unpack 'B32', pack 'N', $n ); } printf " (un)pack took: %.12f seconds\n", ( time() - $start ) / $ITE +RS; $n="$n"; my $vector; $start = time; for (1 .. $ITERS ) { $vector = Bit::Vector->new_Hex(32, $n); stuff($vector->Chunk_Read(14, 18), $vector->Chunk_Read( 6, 12), $vector->Chunk_Read( 6, 6), $vector->Chunk_Read( 6, 0) ); } printf "Bit::Vector took: %.12f seconds\n", ( time() - $start ) / $ITE +RS; __END__ C:\test>perl bv.pl Shift&and took: 0.000000551174 seconds Lookup took: 0.000000272709 seconds Lookup2 took: 0.000000335823 seconds (un)pack took: 0.000003393702 seconds Bit::Vector took: 0.000005372898 seconds
        The Bit::Vector code needs new_Dec, not new_Hex. Here are are some additions to your code showing Data::BitStream::XS and two alternatives using Bit::Vector. Neither are improvements over the fast simple methods, but they do show that both are at least reasonable solutions.
        $start = time; my $stream = Data::BitStream::XS->new; # $stream->write(32, $n) for 1 .. $ITERS; # Below is faster $stream->put_raw(pack("N",$n) x $ITERS, 32*$ITERS); $stream->rewind_for_read; for (1 .. $ITERS ) { stuff($stream->read(14), $stream->read(6), $stream->read(6), $stream- +>read(6)); } printf " D::B::XS took: %.12f seconds\n", ( time() - $start ) / $ITE +RS; $start = time; for (1 .. $ITERS ) { my $vector = Bit::Vector->new_Dec(32, $n); stuff($vector->Chunk_Read(14, 18), $vector->Chunk_Read( 6, 12), $vector->Chunk_Read( 6, 6), $vector->Chunk_Read( 6, 0) ); } printf "Bit::Vector took: %.12f seconds\n", ( time() - $start ) / $ITE +RS; $start = time; my $vector = Bit::Vector->new(32*$ITERS); $vector->Word_List_Store(($n) x $ITERS); for (0 .. $ITERS-1 ) { stuff($vector->Chunk_Read(14, 32*$_+18), $vector->Chunk_Read( 6, 32*$_+12), $vector->Chunk_Read( 6, 32*$_+6), $vector->Chunk_Read( 6, 32*$_+0) ); } printf "Bit::Vector took: %.12f seconds\n", ( time() - $start ) / $ITE +RS;
        I get on my computer:
        Shift&and took: 0.000000193217 seconds Lookup took: 0.000000180829 seconds Lookup2 took: 0.000000215928 seconds (un)pack took: 0.000001851533 seconds D::B::XS took: 0.000000643722 seconds Bit::Vector took: 0.000001739788 seconds Bit::Vector took: 0.000000957755 seconds
        It really depends on what the usage model is. Both Data::BitStream and Bit::Vector are going to suffer if you insist on making a new object for each word. They will do much better if you let them create a single object and fill it, then pull the data out in arbitrary sized chunks. If the OP had all the values in one chunk or array then this would be more appropriate.

        One could also write a new coding role for Data::BitStream to handle retrieving the values using code similar to one of the lookup methods, but there would have to be some real reason to want a stream vs. just doing it using one of the lookup methods.

      I think you can shave off a lot in XS (or Inline::C) when you switch to prebound variables, ala bind_columns () in DBI or Text::CSV_XS, as much of the XS overhead is stack-work. That said, it then is still is function calls, so the stack will be involved. If you pre-bind both in- AND out- parameters, you don't need to push variables on the stack and pop return values of the stack. That should speed up quite a bit.

      The next step could be looking in to op-hooks, so there are no real perl-level function calls anymore, but looking at what you already have, I'm wondering if it would be worth the effort (other than to learn from it),

      The lookup solutions have array-index ops in the back. I am stunned by the 20% gain you get, and wonder if that would differ per perl version and/or architecture.

      Anyway, the solution that is fastest on the machine the problem is solved on is most important, even if another method would be four times faster on z/OS with perl4 :)


      Enjoy, Have FUN! H.Merijn
      Your benchmark is not very realistic. As $n doesn't vary, you are completely ignoring the effect of the CPU cache misses that the lookup table may introduce.

      If you use a random $n, you will see that the table approach becomes actually quite slower than the simple one.

      I have tried encoding the table in other ways, but have not been able to find any one good enough:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; my @lookup; $#lookup = 0x3ffff; $lookup[ $_ ] = [ ( $_ & 0x3f000 ) >> 12, ( $_ & 0xfc0 ) >> 6, $_ & 0x +3f ] for 0 .. 0x3ffff; my( @nxt, @mid, @bot ); $nxt[ $_ ] = ( $_ & 0x3f000 ) >> 12, $mid[ $_ ] = ( $_ & 0xfc0 ) >> 6, $bot[ $_ ] = $_ & 0x3f for 0 .. 0x3ffff; my (@lookup3); $#lookup3 = 0x3ffff; $lookup3[$_ << 6] = [$_ >> 6, $_ & 0x3f] for 0 .. 0xfff; my $lookup4 = 'x' x (3 * (1<<18)); $lookup4 = ''; $lookup4 .= pack CCC => $_ >> 12, ($_>>6) & 0x3f, $_ & 0x3f for 0..0x3 +ffff; my $lookup6 = 'x' x (2 * (1<<12)); $lookup6 = ''; $lookup6 .= pack CC => $_ >> 6, $_ & 0x3f for 0..0xfff; print "tables generated"; our $ITERS //= 10e6; my @n = map int(rand(1<<18)), 1..$ITERS; print "sample data generated"; sub stuff{ # print "@_"; } my $start = time; for my $n (@n) { stuff( ( $n ) >> 18, ( $n & 0x0003f000 ) >> 12, ( $n & 0x00000fc0 ) >> 6, ( $n & 0x0000003f ) ); } printf "Shift&and took: %.12f seconds\n", ( time() - $start ) / $ITERS +; $start = time; for my $n (@n) { stuff( $n >> 18, @{ $lookup[ $n & 0x3ffff ] } ); } printf " Lookup took: %.12f seconds\n", ( time() - $start ) / $ITERS +; $start = time; for my $n (@n) { stuff( $n >> 18, @{ $lookup3[$n & 0x3ffc0] }, $n & 0x3f ); } printf " Lookup3 took: %.12f seconds\n", ( time() - $start ) / $ITERS +; $start = time; for my $n (@n) { stuff( $n >> 18, unpack CCC => substr($lookup4, 3 * ($n & 0x3ffff) +, 3)); } printf " Lookup4 took: %.12f seconds\n", ( time() - $start ) / $ITERS +; $start = time; for my $n (@n) { stuff( $n >> 18, unpack 'x'.(3 * ($n & 0x3ffff)).'CCC' => $lookup4 +); } printf " Lookup5 took: %.12f seconds\n", ( time() - $start ) / $ITERS +; $start = time; for my $n (@n) { stuff( $n >> 18, unpack(CC => substr($lookup6, ($n & 0x3ffc0) >> 5 +, 3)), $n & 0x3f); } printf " Lookup6 took: %.12f seconds\n", ( time() - $start ) / $ITERS +; $start = time; for my $n (@n) { stuff( $n >> 18, unpack('x'.(($n & 0x3ffc0) >> 5).'CC', $lookup6), + $n & 0x3f); } printf " Lookup7 took: %.12f seconds\n", ( time() - $start ) / $ITERS +; __END__ Shift&and took: 0.000000783860 seconds Lookup took: 0.000001267049 seconds Lookup3 took: 0.000001018672 seconds Lookup4 took: 0.000001903985 seconds Lookup5 took: 0.000002110766 seconds Lookup6 took: 0.000001607903 seconds Lookup7 took: 0.000001791258 seconds
        Your benchmark is not very realistic. As $n doesn't vary, you are completely ignoring the effect of the CPU cache misses that the lookup table may introduce.

        The version of my benchmark I posted was the one I use -- in conjunction with -N=1 and uncommenting the print "@_"; -- to check that the output from all the methods is correct.

        For timing runs, I operate on the loop counter $_, which when -N=10e6, is enough to exercise the full range of the table 38 times.

        And I still get 20%+ improvement from the AoAs lookup:

        Results

        C:\test>1021064 Shift&and took: 0.000000606833 seconds Lookup took: 0.000000485629 seconds Lookup2 took: 0.000000669586 seconds (un)pack took: 0.000007564075 seconds

        In general, I've found it very difficult to detect, much less measure any discernible effect from cpu caching in perl code on my machine. It shows up readily in tight loops in C code, but Perl's memory allocation is so cache unfriendly that it rarely seems to come into play.

        I would be interested to see the output from the above on your machine if you have a coupe of minutes.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1021079]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2014-04-21 14:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (495 votes), past polls