in reply to [OT] The interesting problem of comparing bit-strings.

BrowserUk:

Are you familiar with the Boyer-Moore string search? I bet you could use the same basic idea of building a state machine of the next address to check. But rather than analyzing the needle byte by byte, do it in parallel for each shifted variant. Something like the below. Note: I'm going to pretend the byte size is 4 bits here, and I also don't know boyer-moore by heart, so I'm winging it to give you the gist of it. If you like the idea, you can hash out the annoying details (heh!):

Given Needle : 1010 0011 1101 1100 Haystack: 0001 0010 1101 1100 1010 1101 0001 1110 1110 0111 1) Make three more copies of given, each bit-shifted Needle A: 1010 0011 1101 1100 Needle B: x101 0001 1110 1110 0xxx Needle C: xx10 1000 1111 0111 00xx Needle D: xxx1 0100 0111 1011 100x Column D C B A 2) Build the BM skip table based on the last complete byte D C B A 0000 ---- ---- ---- 4444 0001 ---W -M-- ---- ---- 0010 --X- ---- ---- ---- 0011 ---W M--- ---- ---- 0100 ---- ---M ---- ---- 0101 -Y-W ---- ---- ---- 0110 --X- ---- ---- ---- 0111 ---W ---- ---M --M- 1000 ---- --M- ---- ---- 1001 ---W ---- ---- ---- 1010 Z-X- ---- ---- ---- 1011 ---W ---- ---- ---M 1100 ---- ---- ---- M--- 1101 -Y-W ---- M--- ---- 1110 --X- ---- -M-- -M-- 1111 ---W ---- --M- ---- - : an entry I didn't try to determine yet M : Matched byte position, match previous needle char to previous table column. Z : Perfect match found Y : Perfect match if byte after position A matches remnant B X : Perfect match if byte after position A matches remnant C W : Perfect match if byte after position A matches remnant D 1-4: mismatch, skip 1..4 bytes and restart scan at position A

Geh! I've got to get to work now. I'll update this at lunch time. Anyway, the state table will show how far you can skip ahead so you don't need to examine every byte in the haystack. Read the [no such wiki, Boyer-Moor string search] page, and you'll see what I'm talking about. I've drawn four state machines here by having 4 output edges based on which needle we're tracking, but in the final version, we'd combine the tables into simpler entries. If you like the idea, we can investigate that a bit. I've got some time this evening, so we could code something up.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: [OT] The interesting problem of comparing bit-strings.
by BrowserUk (Pope) on Mar 24, 2015 at 13:01 UTC
    I bet you could use the same basic idea of building a state machine of the next address to check

    Um. Boyer Moore is an optimisation of the basic string search isn't it?

    Right now I'm stuck just trying to implement the basic search taking into account the bit-level offsets. Optimisations can come later if required.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      No Boyer-Moore, but you can convert your bitstring search into a regular expression (in the CS sense, not the Perl one), and then compile it into an efficient state-machine.

      For instance, in order to look for bitstring 010101, you have to look for character strings matching any of...

      s1: 010101xx s2: x010101x s3: xx010101 s4: xxx01010 1xxxxxxx s5: xxxx0101 01xxxxxx s6: xxxxx010 101xxxxx s7: xxxxxx01 0101xxxx s8: xxxxxxx0 10101xxx
      You can write the regular expression matching s1 as a character set $s1 = qr/[\x54\x55\x56\x57]/, and similarly for s2-s8, and then combine all of them as qr/$s1|$s2|...|$s8/.

      Then you would need some module able to convert that regular expression into a state machine, and use it for matching the vector... I am sure there is already some RegExp backend on CPAN that does that.

      The downside is that the resulting state-machines may be huge.

        As you are the third respondent that has suggested bit-masks of one form or another, either:

        1. My question/description is unclear.
        2. Or I'm missing something fundamental.

        Both distinct possibilities.

        As I see it, my problem isn't how to isolate the bits of the strings that I need to compare -- __shiftleft128() does that easily and efficiently.

        The problem is one of how to construct/combine the nested loops in order to perform the iteration of the bits and quads in the proper way.

        Looking up a basic (ie. not the gcc version which incredibly complicated) strstr() function I found this:

        char *strstr( char *haystack, char *needle ) { char *hp; char *np; if( !*needle ) return haystack; // null needle; retur +n haystack (strange POSIX specification requirement!) while( *haystack ) { // until we reach the + end of the haystack hp = haystack; // copy of the haysta +ck pointer np = needle; // copy of the needle + pointer do { if( !*np ) return haystack; // end of needle; mat +ch completed; return current value of haystack pointer } while( *hp++ == *np++ ); // increment both poi +nter (copies) while the chars match ++haystack; // Got here means a m +ismatch; base pointer to next character. } return 0; // Got here means we +hit the terminating null of the haystack; no match; return null. }

        If my bitstrings where always 64-bit aligned, I could just substitute U64 for char everywhere, and that would be that; but of course it isn't.

        Essentially, there are five operations I need to encapsulate:

        1. Incrementing haystack by 1-bit.
        2. Incrementing haystack by 64-bits (1 quad + offset).
        3. Incrementing needle by 64-bits (1 quad + offset).
        4. Detecting EOS of the haystack.
        5. Detecting EOS of the needle.

        For 2 & 3 I thought to write a function:

        U64 nextQuad( U64 **p, U8 o ) { // p: pointer to poin +ter to the 'current' quad; o: value of the unused bits at the beginni +ng. U64 hi = *( ++*p ) // increment the poin +ter at *p, and set hi to the value it points at. , lo = *( *p+1 ); // and lo to the valu +e + 1 it points at return __shiftleft128( lo, hi, o ); // return the value f +rom the higher location with o bits from lo shifted in from the right +. }

        I think that would work, but for efficiency, you'd want to keep hi and lo in registers between calls, and you can't take the address of regiters, so I'd be relying on the compiler to recognise the importance of doing so.

        For one, things get more complicated. Now I need a pointer to the offset as well:

        U64 nextBit( U64 **p, U8 *o ) { // p: pointer to poin +ter to current quad; o: pointer to current offset U64 hi, lo; if( ++( *o ) == 64 ) { // increment the offs +et and detect transitions across quad boundaries. *o = 0; // reset the offset. ++*p; // increment the quad + pointer. } hi = **p; // Get the 'current' + (unshifted) quad value lo = *( *p + 1 ); // And the next return __shiftleft128( lo, hi, *o ); // return the value +from the higher location with *o bits from lo shifted in from the rig +ht. }

        For 4 & 5, I need to calculate whether I've reached the end from the current values of the quad pointer & associated offset.

        Putting it all together and getting it to compile is defeating me at the moment.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        DFA space requirement and construction are both O(m*s) where m is needle length and s the alphabet size. Search is O(n) (linear to text size). In practice, one would handle say 8 bits (a byte) at a time. Even then, the space requirement hardly matters with small patterns..