Re^10: Bit vector fiddling with Inline C

Just guessing, but maybe the library method might prove to be quicker if it operates on words rather than bytes.

I thought that using bigger, and particularly register sized chunks, might make some difference, given that loading/using sub-register sized operands is generally considered to be more expensive. However, I tried addressing the string as an array of both 32-bit and 64-bit ints:

int mytest2(SV* sv_vec, unsigned int bit) {

    STRLEN vecbytes; // Length of vector in bytes
    unsigned int *vec = (unsigned int*) SvPV(sv_vec, vecbytes);

    if( bit / 8 >= vecbytes) return 0; // Check in range

    vec[ bit / 32 ] |= ( 1U << ( bit % 32 ) ); // Set bit (CHANGES $ve
+ctor)

    return 1;
}


int mytest3(SV* sv_vec, unsigned int bit) {

    STRLEN vecbytes; // Length of vector in bytes
    unsigned __int64 *vec = (unsigned __int64 *) SvPV(sv_vec, vecbytes
+);

    if( bit / 8 >= vecbytes) return 0; // Check in range

    vec[ bit / 64] |= ( 1ULL << ( bit % 64 ) ); // Set bit (CHANGES $v
+ector)

    return 1;
}
[download]

And whatever difference it made if any, was entirely lost in the noise of benchmarking. The relative ordering ot bytes/dwords/qwords interchange randomly with every run:

C:\test>903727.pl
         Rate qwords  bytes dwords    vec
qwords 3.05/s     --    -2%    -2%   -25%
bytes  3.13/s     2%     --    -0%   -23%
dwords 3.13/s     2%     0%     --   -23%
vec    7.70/s   151%   149%   147%     --

C:\test>903727.pl
         Rate qwords  bytes dwords    vec
dwords 3.10/s     --    -0%    -1%   -60%
bytes  3.11/s     0%     --    -0%   -60%
qwords 3.13/s     1%     0%     --   -59%
vec    7.69/s   148%   147%   146%     --
[download]

Of course, for best possible performance, you would need to ensure that the pointer was register-size aligned. Perl does allocate strings starting with such alignment, though the SvOOK optimisation can mean that the pointer you receive from SvPV isn't so aligned if the scalar has been fiddled with after allocation, but that is not the case in this benchmark.

My guess is that optimising compilers already generate code to "unroll the loop" for such accesses, and so this attempt at manual optimisation is unnecessary. I'd like to try and verify that by having the compiler produce the assembler, but every attempt I've made to pass additional compiler options to Inline C cause it to fail to build. Even if CCFLAGS =>'', is still fails, when without that option it succeeds :(

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^10: Bit vector fiddling with Inline C Select or Download Code


Do you know where your variables are?
	PerlMonks