Might be that I'm totally wrong, but I think that as long as you're trying to consider

1. Incrementing haystack by 1-bit.

an important primary op, there's something wrong.

I'd basically try to first find out how the needle would need to be shifted in order to match the char_aligned haystack, and leave the haystack as it is - always.

My somewhat naive approach would be to start with building the set of 256 possible 'needle heads' (going for char operations all over). A 'needle head' could be say 8 chars long (or some other convenient value).

Then, look for the occurences of the needle_heads within haystack. For each needle_head, you know the offset (needed shift) associated with it. (Yes, that means 256 searches instead of 1, but it still might be better than shift-ing things around for every next bit-position.)

If needlehead is found, go on with a full, char-wise compare, with haystack's chars untouched and every next needle-char obtained via a proper shift from the next two chars of the needle.

End of string is bit-masked as needed to comply with the known offset.

Maybe you've tried something like that already?

Krambambuli
---
• Comment on Re^5: [OT] The interesting problem of comparing bit-strings. (More info.)

Replies are listed 'Best First'.
by BrowserUk (Pope) on Mar 25, 2015 at 13:34 UTC
Might be that I'm totally wrong, but I think that as long as you're trying to consider 1. Incrementing haystack by 1-bit. an important primary op, there's something wrong.

Sorry. I should have pointed out that Incrementing haystack by 1-bit isn't quite what it sounds like.

I used the expression to match up to the strstr() implementation I posted.

When in that algorithm they increment the haystack pointer (char*), so as to compare the needle at the next byte offset, I need to effectively increment the haystack by 1 bit.

What that actually entails is a single machine code instruction that shifts 1 64-bit register left by one bit whilst bringing in a new bit from another 64-bit register containing the (preloaded) next (to the right) quad from the haystack.

That's the __shiftleft128() intrinsic which is implemented as a single shld instruction.

But, I also need to detect when I transit a 64-bit boundary so that I can preload the next quad, and reset the shift counter. The whole thing looks like this (very close to what I had above):

```U64 nextBit( U64 **p, U8 *o ) {                  // p: pointer to poin
+ter to current quad; o: pointer to current offset
if( ++( *o ) == 64 ) {                       // increment the offs
+et and detect transitions across quad boundaries.
*o = 0;                                  // reset the offset.
+ pointer.
}
// Get the 'current' (unshifted) quad value; and the next
return __shiftleft128( *( *p + 1 ), **p, *o );         // return t
+he value from the higher location with *o bits from lo shifted in fro
+m the right.
}

Which, despite that it looks like it does two 64-bit loads from memory every time, once inlined and optimised, it only does one 64-bit load every 64 calls, with the compiler seeing fit to keep the values in registers across calls. In fact, even the shift-left gets optimised away/folded into the same shift left that is used for the haystack in the inner loop.

It's hard to follow (if your unfamiliar with x64 code), but the shld instruction that should appear at the bottom of the outer loop (just above the label \$LN14@bvSearch:, has been folded into the shld that appears near the top of the inner loop, due to the C code h = nextQuad( &hp, oHay );, by virtue of the rather complex addressing mode used for the register loads:

```    mov    rax, QWORD PTR [r11+r9+16]
mov    r8, QWORD PTR [r11+r9+8]

The relevant part of the optimised assembler output:

```; 41   :     while( hay < eHay ) {                        // until we
+reach the end of the haystack

cmp    rcx, r12
jae    SHORT \$LN5@bvSearch
mov    r11, rcx
sub    r11, r9
\$LL6@bvSearch:

; 42   :         hp = hay;                                // copy of t
+he haystack pointer
; 43   :         np = ndl;                                // copy of t
+he needle pointer

mov    r9, r13
\$LL4@bvSearch:

; 44   : //TAG;
; 45   :         do {
; 46   :             // end of needle; match completed; return current
+ value of haystack pointer
; 47   :             if( np >= eNdl ) {

cmp    r9, rdi
jae    SHORT \$LN19@bvSearch

; 55   :             }
; 56   :             h = nextQuad( &hp, oHay );           // get the n

mov    rax, QWORD PTR [r11+r9+16]
mov    r8, QWORD PTR [r11+r9+8]

; 57   :             n = nextQuad( &np, oNdl );           // get the n

mov    rdx, QWORD PTR [r9+8]
movzx    ecx, sil
shld    r8, rax, cl
mov    rax, QWORD PTR [r9+8]
movzx    ecx, bpl
shld    rdx, rax, cl

; 58   : //printf( "\rhp:%p->h:%16I64x[%s] np:%p->n:%16I64x[%s]\t", hp
+, h, toBin( h ), np, n, toBin( n ) );
; 59   :         } while( h == n );                       // while the

cmp    r8, rdx
je    SHORT \$LL4@bvSearch

; 60   :         nextBit( &hay, &oHayCopy );              // Got here
+means a mismatch; get the next bit from the haystack

inc    bl
cmp    bl, 64                    ; 00000040H
jne    SHORT \$LN14@bvSearch
xor    bl, bl
; 41   :     while( hay < eHay ) {                        // until we
+reach the end of the haystack

cmp    rcx, r12
jae    SHORT \$LN5@bvSearch
mov    r11, rcx
sub    r11, r9
\$LL6@bvSearch:

; 42   :         hp = hay;                                // copy of t
+he haystack pointer
; 43   :         np = ndl;                                // copy of t
+he needle pointer

mov    r9, r13
\$LL4@bvSearch:

; 44   : //TAG;
; 45   :         do {
; 46   :             // end of needle; match completed; return current
+ value of haystack pointer
; 47   :             if( np >= eNdl ) {

cmp    r9, rdi
jae    SHORT \$LN19@bvSearch

; 55   :             }
; 56   :             h = nextQuad( &hp, oHay );           // get the n

mov    rax, QWORD PTR [r11+r9+16]
mov    r8, QWORD PTR [r11+r9+8]

; 57   :             n = nextQuad( &np, oNdl );           // get the n

mov    rdx, QWORD PTR [r9+8]
movzx    ecx, sil
shld    r8, rax, cl
mov    rax, QWORD PTR [r9+8]
movzx    ecx, bpl
shld    rdx, rax, cl

; 58   : //printf( "\rhp:%p->h:%16I64x[%s] np:%p->n:%16I64x[%s]\t", hp
+, h, toBin( h ), np, n, toBin( n ) );
; 59   :         } while( h == n );                       // while the

cmp    r8, rdx
je    SHORT \$LL4@bvSearch

; 60   :         nextBit( &hay, &oHayCopy );              // Got here
+means a mismatch; get the next bit from the haystack

inc    bl
cmp    bl, 64                    ; 00000040H
jne    SHORT \$LN14@bvSearch
xor    bl, bl
\$LN14@bvSearch:

; 40   :
; 41   :     while( hay < eHay ) {                        // until we
+reach the end of the haystack

cmp    r10, r12
jb    SHORT \$LL6@bvSearch
\$LN5@bvSearch:

; 61   :     }