Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Bitwise operators on binary objects

by Anonyrnous Monk (Hermit)
on Jan 31, 2011 at 21:31 UTC ( [id://885368]=note: print w/replies, xml ) Need Help??


in reply to Bitwise operators on binary objects

Newer processors have a popcnt ("population count") instruction, which computes the number of 1-bits (for up to 64 bits) with a single fast CPU instruction.  In other words, if you want real speed at the cost of portability, you could hack a little assembly1,2.

A quick test shows that this is roughly 10 times as fast as the above mentioned unpack "%32b*".

#!/usr/bin/perl -wl use strict; use Inline C => Config => LDDLFLAGS => '-shared ../../../popcount.o'; use Inline "C"; use Benchmark "cmpthese"; my $s = pack "C*", map { int rand 256 } 1 .. 2**18; # 256k #print unpack q/%32b*/, $s; #print count_bits($s); cmpthese (-2, { unpack => sub { my $n = unpack q/%32b*/, $s; }, popcnt => sub { my $n = count_bits($s); }, }); __END__ __C__ extern unsigned long popcount(char *, int64_t); // the asm routine unsigned long count_bits (SV* bitstr) { // XS wrapper STRLEN l; char *s = SvPV(bitstr, l); return popcount(s, l); }
Rate unpack popcnt unpack 1252/s -- -91% popcnt 13581/s 984% --

Tested with

$ cat /proc/cpuinfo | grep name | head -1 model name : AMD Phenom(tm) 9600 Quad-Core Processor

___

1 The assembly code (nasm syntax; 64 bit; gcc register subroutine calling convention — compile with nasm -f elf64 popcount.asm ):

bits 64 global popcount section .text ; prototype: unsigned long popcount(char*, int64_t); popcount: push rbp mov rbp, rsp mov rcx, rsi shr rcx, 3 mov rsi, rdi xor rax, rax .cnt popcnt rbx, [rsi] add rax, rbx add rsi, 8 loop .cnt mov rsp, rbp pop rbp ret

(Caveat: as this is just proof of concept code, the string length is assumed to be a multiple of 8 byte/64 bit, so you might need to zero-pad it)

2 In theory, there is also the gcc __builtin_popcount function, but in practice it turns out to be less portable than a piece of assembly code.

Replies are listed 'Best First'.
Re^2: Bitwise operators on binary objects
by PeterCordes (Initiate) on Sep 22, 2017 at 09:31 UTC

    The x86 `loop` instruction is slow; never use it. See this Stack Overflow post about LOOP being slow: https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently

    Write your function in C, using GNU C __builtin_popcountll, and compile with -march=native (to use the popcnt instruction if your CPU supports it)..

    Or if you really want NASM, your function only works if used with a whole number of 64-bit chunks, so I changed the signature to int64_t*. The count is still a byte-count, though.

    ; Untested (check for off-by-one array indexing) ; Probably about 3x as fast as the above for large arrays ; on Intel Sandybridge-family. Unrolling by 2 or so would help. ; prototype: unsigned long popcount(int64_t* rdi, int64_t rsi); ; requires byte count in rsi >= 8, and a multiple of 8. ALIGN 16 global popcount popcount: add rdi, rsi ; rdi = end of buffer neg rsi ; negative index counts up to zero xor rax, rax ;do { .cnt mov rdx, [rdi+rsi] ; mov-load breaks the dependency popcnt rdx, rdx ; on rdx from last iteration. add rax, rdx add rsi, 8 ; index+=8 jl .cnt ;}while(index<0) ret

    popcnt on Intel Sandybridge-family has a false dependency on the destination. Using a separate load avoids creating a loop-carried dependency on the 3-cycle latency of popcnt.

    Using an indexed addressing mode saves an instruction inside the loop. Checking the count and using an unrolled loop would gain speed, especially on AMD Ryzen where popcnt has 4 per clock throughput (so this will bottleneck on the front-end instead of on 2 loads per clock).

    It's 4 fused-domain uops on Intel Sandybridge-family, though (since add / jl can macro-fuse), so it can run at about 1 per clock. It should be at least 3x faster than the old version, which bottlenecks on the popcnt false dependency on Sandybridge-family, or on the LOOP instruction front-end throughput. (Although the old version would actually run ok on AMD Bulldozer-family, where LOOP is fast and POPCNT has no false dependency).

    If you didn't know all this stuff, you should have written your function in C and let the compiler get it right. Or read Agner Fog's optimization guide, instruction tables, and microarch guide (http://agner.org/optimize/). Also, Clang would auto-vectorize for you with AVX2, using vpshufb to count faster (on Haswell and later) than you can with popcnt. By all means look at the C compiler output to make sure you got what you wanted, though.

    Peter Cordes (https://stackoverflow.com/users/224132/peter-cordes)

Re^2: Bitwise operators on binary objects
by albert (Monk) on Feb 01, 2011 at 11:35 UTC
    Assembly is beyond me. This is a cool suggestion, but the processor on my machine isn't new enough to have this call. Plus, I'm fairly sure that I have other significant bottle necks in my code, so not sure I'd benefit nearly so well from moving part to assembly. Still, a very nice idea.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://885368]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2025-03-27 17:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When you first encountered Perl, which feature amazed you the most?










    Results (70 votes). Check out past polls.

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.