BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

The interesting problem of comparing bit-strings.

Perl's vec and bitwise string operators are very useful and powerful for manipulating (large) bitstrings; but they are limited in several ways.

Namely, there are no left & right shift operators for them; and the bitwise string operators operate with 8-bit boundaries.

One common operation needed for programming with bitstrings is the ability to search one bitstring for another bitstring, looking for matches that occur even at non-byte or integer-sized alignments.

So graphically, using 8-bit units instead of 64 for brevity, the problem is this:

Find (the first occurance of):

needle = x1001010 10101001 10100xxx (x represents bits we are not + interested in)

in:

haystk = xxxxx001 10001101 01110010 10101010 01101001 01100011 100 +11001 00010001 .....

The specification of needle requires three values:

Similarly, the haystack requires 3 values:

The function required thus takes six arguments:

U64 bsSearch( U64 *hay, U8 oHay, U64 lHay, U64 *ndl, U8 oNdl, U64 +lNdl ) { ... }

Return value

In an ideal world (Perl) the function would return the unit number(or address), and the offset into that unit, where the match is found (or 0).

As C doesn't permit multiple return values, I've opted for a single 64-bit value indicating the offset from the start of the first unit of the haystack (or -1?).

Found:

haystk = xxxxx001 10001101 01110010 10101010 01101001 01100011 100 +11001 00010001 ..... needle = x10010 10101010 0110100x xx (note the + alignment change!) return = 01234567 89012345 6789 = 19th bit = unit 2/ bit 3

There are 128-bit shift instructions available on Intel/AMD 64-bit processors (and probably most others):

REX.W + 0F A5 SHLD r/m64, r64, CL B Valid N.E. Shift + r/m64 to left CL places while shifting bits from r64 in from the rig +ht. REX.W + 0F AD SHRD r/m64, r64, CL B Valid N.E. Shift + r/m64 to right CL places while shifting bits from r64 in from the le +ft.

which are available on the MS compiler as intrinsics:

unsigned __int64 __shiftleft128( unsigned __int64 LowPart, unsigne +d __int64 HighPart, unsigned char Shift ); unsigned __int64 __shiftright128( unsigned __int64 LowPart, unsign +ed __int64 HighPart, unsigned char Shift );

and probably in gcc (though I couldn't locate the names).

Using these, you can load two registers with two units of one of the bitstrings and when shifting the first register, bits from the second are shifted in.

Thus to process through the bitstrings 1 bit at a time, you use loops something like:

for( i = 0; i < ( lHay % 64 )+1; ++i ) { r1 = *( hay + i ); r2 = *( hay + i + 1 ); for( b1 = 0; b1 < 64; ++b1 ){ __shiftleft128( r2, r1, 1 ); } }

and:

for( j = 0; j < ( lNdl % 64 )+1; ++j ) { r3 = *( ndl + j ); r4 = *( ndl +j + 1 ); for( b2 = 0; b2 < 64; ++b2 ) { __shiftleft128( r4, r3, 1 ); } }

Except that:

  1. the terminating conditions of the outer loops need work; (especially in light of b below).
  2. the initialisation/repopulation of the two registers needs to take account of the bit offsets oHay & oNdl respectively.
  3. the two loops need to run concurrently; with the second loop resetting to the starting condition everytime r1 != r3;
  4. We might successfully match & advance through several units before hitting a mismatch.

    At which point we not only need to reset the needle (inner pair?) loop;

    we also need to reset the hay loop by the distance we successfully advanced before the failure, then advance one bit.

What have I tried so far?

In answer to the traditional question: You've just pretty much read it!

The function definition and a few local variables:

U64 bsSearch( U64 *hay, U8 oHay, U64 lHay, U64 *ndl, U8 oNdl, U64 +lNdl ) { register U64 r1, r2, r3, r4; U64 i, j, rv; U8 b1, b2; ... return rv; }

And the two separate loops above, and little else, despite many hours of though and scribbling on post-its.

I suspect that the right approach might include another subroutine that compares the needle to the haystack at a given, fixed offset; but I'm stuck for how to code that such that it doesn't repeatedly shift one or both by an increasing number of bits each time it is called?

I also thought that it might make sense to copy the needle to an aligned chunk of memory, to eliminate its offset from the problem; but it could be a very large piece of memory, which would be expensive.

Update:I suspect one or even two gotos may be needed; though I'd avoid them if I can.

What am I looking for?

Cluebats; pseudo-code; real-code of any flavour.

Basically anything that will make the penny drop as to how to code this beast?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re: [OT] The interesting problem of comparing bit-strings.
by roboticus (Chancellor) on Mar 24, 2015 at 11:41 UTC

    BrowserUk:

    Are you familiar with the Boyer-Moore string search? I bet you could use the same basic idea of building a state machine of the next address to check. But rather than analyzing the needle byte by byte, do it in parallel for each shifted variant. Something like the below. Note: I'm going to pretend the byte size is 4 bits here, and I also don't know boyer-moore by heart, so I'm winging it to give you the gist of it. If you like the idea, you can hash out the annoying details (heh!):

    Given Needle : 1010 0011 1101 1100 Haystack: 0001 0010 1101 1100 1010 1101 0001 1110 1110 0111 1) Make three more copies of given, each bit-shifted Needle A: 1010 0011 1101 1100 Needle B: x101 0001 1110 1110 0xxx Needle C: xx10 1000 1111 0111 00xx Needle D: xxx1 0100 0111 1011 100x Column D C B A 2) Build the BM skip table based on the last complete byte D C B A 0000 ---- ---- ---- 4444 0001 ---W -M-- ---- ---- 0010 --X- ---- ---- ---- 0011 ---W M--- ---- ---- 0100 ---- ---M ---- ---- 0101 -Y-W ---- ---- ---- 0110 --X- ---- ---- ---- 0111 ---W ---- ---M --M- 1000 ---- --M- ---- ---- 1001 ---W ---- ---- ---- 1010 Z-X- ---- ---- ---- 1011 ---W ---- ---- ---M 1100 ---- ---- ---- M--- 1101 -Y-W ---- M--- ---- 1110 --X- ---- -M-- -M-- 1111 ---W ---- --M- ---- - : an entry I didn't try to determine yet M : Matched byte position, match previous needle char to previous table column. Z : Perfect match found Y : Perfect match if byte after position A matches remnant B X : Perfect match if byte after position A matches remnant C W : Perfect match if byte after position A matches remnant D 1-4: mismatch, skip 1..4 bytes and restart scan at position A

    Geh! I've got to get to work now. I'll update this at lunch time. Anyway, the state table will show how far you can skip ahead so you don't need to examine every byte in the haystack. Read the [no such wiki, Boyer-Moor string search] page, and you'll see what I'm talking about. I've drawn four state machines here by having 4 output edges based on which needle we're tracking, but in the final version, we'd combine the tables into simpler entries. If you like the idea, we can investigate that a bit. I've got some time this evening, so we could code something up.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      I bet you could use the same basic idea of building a state machine of the next address to check

      Um. Boyer Moore is an optimisation of the basic string search isn't it?

      Right now I'm stuck just trying to implement the basic search taking into account the bit-level offsets. Optimisations can come later if required.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        No Boyer-Moore, but you can convert your bitstring search into a regular expression (in the CS sense, not the Perl one), and then compile it into an efficient state-machine.

        For instance, in order to look for bitstring 010101, you have to look for character strings matching any of...

        s1: 010101xx s2: x010101x s3: xx010101 s4: xxx01010 1xxxxxxx s5: xxxx0101 01xxxxxx s6: xxxxx010 101xxxxx s7: xxxxxx01 0101xxxx s8: xxxxxxx0 10101xxx
        You can write the regular expression matching s1 as a character set $s1 = qr/[\x54\x55\x56\x57]/, and similarly for s2-s8, and then combine all of them as qr/$s1|$s2|...|$s8/.

        Then you would need some module able to convert that regular expression into a state machine, and use it for matching the vector... I am sure there is already some RegExp backend on CPAN that does that.

        The downside is that the resulting state-machines may be huge.

Re: [OT] The interesting problem of comparing bit-strings.
by marinersk (Priest) on Mar 24, 2015 at 10:59 UTC

    Quick thought, as I won't have time to dedicate to code-example before tonight, but in case it sparks a line of thought for you -- This is specifically to address the "I don't want it to keep rotating the bits over and over again" problem.

    Might a combination of rotation and mask help?

    Have an initializer which sets up all possible aligned bitmask searches, and then a routine which loops through all the pre-built masks looking for a match?

    In this short example which will likely be fraught with errors, assuming we are looking for 0b1010 in 0b00110100101. For readability's sake, this example plays on an 8-bit alignment requirement; simply increase the work of the initializer and searcher routines to take advantage of 64-bit alignment capabilities.

    The initializer sets up the following array of pre-rotated bitmasks:

    mask[0] = 0b11110000; mask[1] = 0b01111000; mask[2] = 0b00111100; mask[3] = 0b00011110; mask[4] = 0b00001111;

    I'm running on light air here, I think you'll need to also set up an array of pre-rotated comparison bitstrings:

    seabits[0] = 0b10100000; seabits[1] = 0b01010000; seabits[2] = 0b00101000; seabits[3] = 0b00010100; seabits[4] = 0b00001010;

    Then the main comparison routine iterates across the main bit buffer using these pre-built masks and values to ascertain if there is a match.

    Multilingual psuedocode for the main comparison routine:

    for (chunkcol = 0; chunkcol < (BitBufferLength / CHUNKSIZE); chunk +col++) { foreach $masknum (@mask) { if ( ( BitBuffer[chunkcol] & mask[$masknum] ) == $seabit +s[$masknum]) { return ( ( chunkcol * CHUNKSIZE) + $masknum ); +# FIXME: Check for fenceposting error(s) } } } return ($RET_NOTFOUND);

    I should think a static accumulator can track which bit column you find the match in and simply exit the loop returning that value?

    I started coding a proof-of-concept in Perl but as soon as I got into the guts of it I knew this was going to take me longer than I have left this morning.

    Hope this got my point across and hope it either yields some inspiration or you already thought of it and have moved on. :-)


    Update: Fixed bitmask and searchbits arrays.
    Update: Added psuedocode.
    Update: Updated chunking in psuedocode.

Re: [OT] The interesting problem of comparing bit-strings.
by hdb (Monsignor) on Mar 24, 2015 at 20:29 UTC

    Low level C programming is not my strength, so I stay with bit strings in Perl. Based on a block size of 8 bits, you would have 8 alignments of the needle. Each of this 8 alignments has a middle part of complete 8 bit blocks and a left and right partial block. I would use something like index to search for the middle block in the haystack bit string and then check whether or not the left and right partial blocks match or not.

    Apologies if anyone's proposal above is materially the same but the fact eluded me.

      The idea is to provide the functionality to perl through a I::C module.

      I've done similar arrangements -- using bytes and shifts for the unaligned parts and 32 or 64-bit compares for the aligned parts -- but for large bitstrings -- mine are frequently > 1/4gb in size -- it is very slow.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        Ahhh....   the bitstrings are very large?   How about the typical length of the patterns for which you search within such strings?

        If the length of the string were longer than 3 * 64 = 192 bits long, and especially if it were longer, then you would have a definite “crib,” in the form of a 64-bit quantity ... there are exactly 64 possibilities ... that must occur in the pattern at a quadword boundary.   This would be a “huge win™,” well worth the extra effort of creating a special-case for, in anticipation that it would actually pay-off most of the time.

        I-F the string being sought was at least 192 bits long, then you are able to calculate, with certainty, the 64 possible values that exist for “the 64-bit ‘word in the middle.’”   You can now search for each of these in turn, using an extremely efficient REPNE SCASD (i86 ...) assembler-language construct, and then explore the surrounding doublewords in each case to determine if the bit-string occurs.   (If the run being sought is “even longer,” then a REPNE CMPSD loop can be applied to check all of the “subsequent doublewords,” the values of all of which once again can be determined with certainty.   Only the partial-doublewords at the beginning and at the end, if any, remain to be verified.)

        Easily the most efficient implementation would be to loop through all 64 of the pre-computed possibilities, using the assembler instructions aforementioned to locate all occurrences within the buffer.   (Of course, and as you of all Monks well know, you will need to use a “sliding window” scan through the compass of the entire file, to avoid missing any bitstrings that serependitiously fell across a boundary, but, I digress.)

        Each and every time you stumble-upon any of these 64 values, the odds that you have, in fact, “found another answer” are rather-astronomically high.   In fact, in such a case you might even wish to “take a calculated risk” that in fact you need look no further.   If you actually discover a matching 64-bit quantity within the buffer, then the odds are “64 in 2^64” that the entire value-being-sought is present.   Certainly, if the REPNE CMPSD succeeds with regard to a location found by REPNE SCASD, then the odds of being wrong become absurdly (and dismissably) small:   “It is a match.   It has to be.”   At this point, the actual odds of committing a “type-2 error” are . . . zero.

        No, this is not a solution to your entire problem:   if the bit-string being sought turns out to be shorter, it will not help at all.   But, for longer strings, it will cause the problem to “virtually disappear.”

          A reply falls below the community's threshold of quality. You may see it by logging in.
Re: [OT] The interesting problem of comparing bit-strings.
by VinsWorldcom (Prior) on Mar 25, 2015 at 15:53 UTC

    This is way over my head, but as a proof of concept, I tried to use Bit::Vector to compare the needle and haystack as strings. Probably not very efficient for the large sets you're talking about.

    #!perl use strict; use warnings; use Bit::Vector; # 101001 my $nedInHay = Bit::Vector->new_Dec(6, 41); # 101000 my $nedNOTInHay = Bit::Vector->new_Dec(6, 40); # 100101001001000011111010 my $hay = Bit::Vector->new_Dec(24, 9736442); print $nedInHay->to_Bin; if ((my $idx = index ($hay->to_Bin, $nedInHay->to_Bin )) != -1) { print " Found. Starts at position - $idx\n" } print $nedNOTInHay->to_Bin; if (my $idx = index ($hay->to_Bin, $nedNOTInHay->to_Bin ) != -1) { print " Found. Starts at position - $idx\n" } print "\n"

    ... and run ...

    VinsWorldcom@C:\Users\VinsWorldcom\tmp> test.pl 101001 Found. Starts at position - 3 101000
Re: [OT] The interesting problem of comparing bit-strings.
by Ea (Chaplain) on Mar 25, 2015 at 09:32 UTC
    My first thought was that the problem is similar to what you might do in bioinformatics, as mentioned by Anonymous earlier. Although there are are likely significant differences between bitstrings and character strings that I can't help you with, I was listening to Natural Language Processing lectures on Coursera and the Edit Distance lectures attacked the problem with Dynamic Programming. Maybe that helps you with the coding strategy?

    Sometimes I can think of 6 impossible LDAP attributes before breakfast.

      Bioinformatics is one use (actually many uses) of large bitstring searches.

      Others include:

      1. Finding patterns of usage in temporal data.

        Each bit represents a time period; 1s represent activity within that time period.

      2. Steganography.

        Both application of, and detection of.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: [OT] The interesting problem of comparing (long) bit-strings.
by Anonymous Monk on Mar 28, 2015 at 01:39 UTC

    You'll likely want a search that is strongly sub-linear, ie better than O(n). There are string search algorithms that approach O(log(m)*n/m), on the average. With long needles (length m), that'll be a blast. However, you don't get anything for free: preprocessing requirements are O(m) in time and space.

    Google for "Turbo Reverse Factor" and "Alpha Skip Search". Wikipedia seems to lack pages for these. With alpha skip, you create a trie (or a table) where you push the offsets of all substrings of length log(m). Bitstrings are just strings on size=2 alphabet. Which algorithm or variant thereof will prove most suitable for you is something hardly anyone here could guess.

      You'll likely want a search that is strongly sub-linear, ie better than O(n). There are string search algorithms that approach O(log(m)*n/m), on the average.

      On modern hardware, the single most expensive avoidable* 'operation' is a cache miss. With instruction cache misses more expensive data cache misses due to pipelining.

      *context switches are more expensive, but are, at least in the long term, unavoidable.

      The problem with (all?) the analysies of the various sub-linear algorithms; is that they ignore (or were done before), the costs & benefits of 2 and 3 levels of on-chip instruction & data cache, became a significant factor in instrumenting this type of algorithm.

      My 7 y/o cpu, with only 2-level caching, can perform upwards of 1000 instructions in the time it takes to fulfill a data-cache miss. The effect for an instruction cache miss is even greater.

      Sub-linear algorithms exacerbate both problems, in 3 different ways:

      1. They require extra memory in the form of lookup/lookahead tables.

        That space and frequent access to it, directly hits the effectiveness of the data cache's ability to keep the relevant chunks of the haystack and needle in cache.

      2. The algorithms call for look-aheads, look-behinds and/or probes; that upsets the caching algorithms ability to keep caches primed with the required data.

        This is especially important, 64 times as important, for a bitstring search, where each bit needs to be compared 64 times each as a part of 64 different groups of 64 bits.

        This is the single most important factor that has come out of my testing of real code, on a real machine. Modern hardware is optimised to scream through memory in a linear fashion; and every single time you do anything that upsets that relentless, mechanical (electronic) forward (or reverse) flow, you pay a huge hit.

      3. They also require many more instructions and branch points.

        The number of instructions is significant, because if instructions early in the loop get flushed by instructions later in the loop, then you not only get a stall for the instruction cache when the loop iterates; it also causes a pipeline flush; and invalidates branch predictions. All together; and a very expensive situation.

        The number of branch points is significant because they compound exponentially in their effects on failed branch predictions; with the knock-on effect of instruction cache stalls if you breach the cache size cliff.

      The beauty of well-written brute force algorithms, is that they derive absolute maximum benefit from the cache effect, by doing as much work as possible on each cache line before allowing it to get flushed thus minimising the need for reloads:

      1. The inner loops are very small with only 1 or 2 branch points required.

        That ensures no instruction cache misses and fullest possible benefits from both branch predictions and pipelining.

      2. Because the data is processed strictly serially; and all the work required on each piece is done with just one from-memory load; they work with both the hardware branch prediction and caching algorithms, to best effect.

      In the following table of benchmark timings from my implementation:

      • The horizontal list of numbers at the top is the length of the needle in bits.
      • The major vertical tick numbers are the distance, in quads, into the 128MB/1Gib haystack at which the needle is found.
      • The minor vertical tick numbers are the offset, in bits, into the quad, where the needle is found.

      Summary. The algorithm is very linear in terms of the distance into the buffer; and sub-linear in terms of the length of the needle. And it is very fast.

      Can it be beaten? Maybe, but I very much doubt it will be done using Boyer Moore or similar, jump-around-the-data algorithms. Time will tell :)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Alright. Here's a demo of alpha skip search on bitstrings. The needle size is fixed, but it should be trivial to follow this with a longer compare. A possibility exists that some off-by-one errors have crept in. If that's the case, just deduct from my paycheck. :-)

      Alpha skip is quadratic in the worst case and probably not the best choice when nasty matches are expected. Anyway, I'm curious how this might hold up against the optimized brute-force routines.

      Note: to change the needle size, just edit the M and logM values. (Compile with -fno-strict-aliasing, or fix the type-punning).

        Making minimal changes to your code to allow it compile here, and it simply never finds anything(very quickly), no matter how simple I make the search criteria:

        C:\test\C>cl /W3 bitsearch-oiskuu.c Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64 Copyright (C) Microsoft Corporation. All rights reserved. bitsearch-oiskuu.c bitsearch-oiskuu.c(49) : warning C4267: '=' : conversion from 'size_t' + to 'U16', possible loss of data bitsearch-oiskuu.c(78) : warning C4267: '=' : conversion from 'size_t' + to 'char', possible loss of data bitsearch-oiskuu.c(80) : warning C4267: '+=' : conversion from 'size_t +' to 'char', possible loss of data Microsoft (R) Incremental Linker Version 9.00.21022.08 Copyright (C) Microsoft Corporation. All rights reserved. /out:bitsearch-oiskuu.exe bitsearch-oiskuu.obj C:\test\C>bitsearch-oiskuu Result -1 Took 1.837746e-004 secs

        I've wasted enough time on it; and won't be expending any more, but here is the modified code that compiles clean (apart from warnings that originate from your posted code):

        I'd suggest you enter it in an obfuscated C competition; could be a winner.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        'm curious how this might hold up against the optimized brute-force routines.

        It's 3 orders of magnitude slower.

      I'll attempt a brief description of the Alpha Skip Search. Hopefully, this may shed some light on the basic operation of the algorithm.

      Consider a pattern of length M. The approach is to quickly locate any fractional substring within the pattern, wherever it may occur. Substrings of length F take offsets from 0 .. M-F, inclusive. Thus there is a set of M-F+1 needle fragments. The preprocessing stage creates an index of those fragments. A perl implementation might read as follows:

      push @{$index{ substr($needle, $_, $F) }}, $_     for 0 .. $M-$F;

      Search phase proceeds by inspecting the haystack one window at a time. Window shift is the same figure as above, M-F+1. Each time, a substring is extracted at the end of the window. The index now yields all possible alignments where this fragment will match in the pattern.

      Theoretical optimum for F is logsM (log based on alphabet size). That amounts to one entry per index bucket, on the average. For random data, there will be one alignment to test per window, and this is likely to fail at the very start.

      In the case of bit-string matching, the substrings are nothing but small integers below (1<<F), which suggests a very simple array-based indexing scheme (instead of trie). Here's a sample implementation in C (just a rehash of the previous demo, main difference being this one handles variable length needles).

      // A demonstration of alpha skip search on bitstrings // // dd if=/dev/urandom bs=1k count=99k > garbage // ./a.out [length] [offset] < garbage // // Feed the program with random needle and haystack. Two arguments may + be given. // Haystack offset says where to hide the needle. Needle length defaul +ts to 65536. #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <unistd.h> // F = log(length) is theoretically optimal. Mmax is our maximum needl +e size. // if logM is changed, be sure to also check the bt_off_t below enum { logM = 32, Mmax = 1UL << logM, Mnil = Mmax - 1UL, Mdflt = 65536 + }; enum { F = 16, Ftop = 1UL << F, Fmask = Ftop - 1UL }; typedef uint32_t bt_off_t; // must hold 0..Mnil typedef struct { bt_off_t head[Ftop]; bt_off_t next[]; } bt_index; // extract F bits at bit offset i. assert(F <= 25) // this is non-portable static inline size_t fbits(char *v, size_t i) { uint32_t *p = (uint32_t *)&v[i >> 3]; return Fmask & (*p >> (i & 7)); } static size_t xcmp(char *y, size_t j, char *x, size_t w) { size_t i, diff = fbits(y, j) - fbits(x, 0); for (i = F; i < w && !diff; i += F) { diff = fbits(y, i + j) - fbits(x, i); } if (!diff) diff = fbits(y, w-1 + j) - fbits(x, w-1); return diff; } static bt_index *prep(char *x, size_t w) { bt_index *Z = malloc(sizeof(*Z) + w * sizeof(*Z->next)); size_t i, f; for (i = 0; i < Ftop; i++) { Z->head[i] = Mnil; } for (i = 0; i < w; i++) { f = fbits(x, i); Z->next[i] = Z->head[f]; Z->head[f] = i; } printf("bt_index %zu bytes\n", sizeof(*Z) + w * sizeof(*Z->next)); return Z; } // y of length n is haystack; x is the needle size_t search(char *y, size_t n, char *x, size_t m) { size_t w = m - F + 1; // shift size size_t i, j, f; bt_index *Z = prep(x, w); for (i = -1; (i += w) <= n - F;) { f = fbits(y, i); for (j = Z->head[f]; j != Mnil; j = Z->next[j]) { if (!xcmp(y, i-j, x, w)) return i-j; } } return -1; } // one-bit rotate void rot1(char *v, size_t n) { unsigned char *p = (unsigned char *)v; size_t i, c; for (c = i = 0; i < n; c >>= 8, i++) { p[i] = c += p[i] << 1; } p[0] += c; } int main(int argc, char *argv[]) { static char buf[(99<<20) + 1]; size_t n = read(0, buf, sizeof(buf) - 1); size_t t = argc > 1 ? strtoul(argv[1], NULL, 0) : 0; size_t m = (t < F || t - 1 > Mnil) ? Mdflt : t; char *x = buf, *y = buf + (m+7)/8; if ((long)n < 0 || (long)n < 2*(y-x)) { printf("read %ld bytes, need at least %ld\n", (long)n, (long)2 +*(y-x)); return 1; } n = 8 * (n - (y-x)); printf("haystack=%zu needle=%zu (bits)\n", n, m); if (argc > 2) { size_t r, i = strtoul(argv[2], NULL, 0); i += (long)i < 0 ? (n-m) : 0; i = i < (n-m) ? i : (n-m); memcpy(y + i/8, x, (y-x)); // stuff the needle and rotate for (r = i & 7; r; r--) rot1(y + i/8, (y-x) + 1); printf("needle at %zu (byte offset %zu, rot %zu)\n", i, i/8, i +%8); } printf("search=%ld\n", (long)search(y, n, x, m)); return 0; }

      Tested with gcc/clang under linux; my apologies if the code is too obscure, or otherwise unworkable.

      Update. Simplified above code. Note that it is sometimes difficult to put a name on a string search routine, especially with bit-strings. Minor modification can turn one search into another, and the variants abound. Here's a version that might be called Horspool.

      // A demonstration of Horspool variant on bitstrings // // dd if=/dev/urandom bs=1k count=99k > garbage // ./a.out [length] [offset] < garbage // // Feed the program with random needle and haystack. Two arguments may + be given. // Haystack offset says where to hide the needle. Needle length defaul +ts to 65536. #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <unistd.h> // if logM is changed, be sure to also check the bt_off_t below enum { logM = 32, Mmax = 1UL << logM, Mnil = Mmax - 1UL, Mdflt = 65536 + }; enum { F = 16, Ftop = 1UL << F, Fmask = Ftop - 1UL }; typedef uint32_t bt_off_t; // must hold 0..Mnil typedef struct { bt_off_t head[Ftop]; } bt_index; // extract F bits at bit offset i. assert(F <= 25) // this is non-portable static inline size_t fbits(char *v, size_t i) { uint32_t *p = (uint32_t *)&v[i >> 3]; return Fmask & (*p >> (i & 7)); } static size_t xcmp(char *y, size_t j, char *x, size_t w) { size_t i, diff = fbits(y, j) - fbits(x, 0); for (i = F; i < w && !diff; i += F) { diff = fbits(y, i + j) - fbits(x, i); } if (!diff) diff = fbits(y, w-1 + j) - fbits(x, w-1); return diff; } static bt_index *prep(char *x, size_t w) { static bt_index Z[1]; size_t i, f; for (i = 0; i < Ftop; i++) { Z->head[i] = Mnil; } for (i = 0; i < w; i++) { f = fbits(x, i); Z->head[f] = i; } printf("bt_index %zu bytes\n", sizeof(*Z)); return Z; } // y of length n is haystack; x is the needle size_t search(char *y, size_t n, char *x, size_t m) { size_t w = m - F + 1; // shift size size_t i, j, f; bt_index *Z = prep(x, w); for (i = -1; (i += w) <= n - F;) { f = fbits(y, i); if ((j = Z->head[f]) != Mnil) { i -= j; if (!xcmp(y, i, x, w)) return i; } } return -1; } // one-bit rotate void rot1(char *v, size_t n) { unsigned char *p = (unsigned char *)v; size_t i, c; for (c = i = 0; i < n; c >>= 8, i++) { p[i] = c += p[i] << 1; } p[0] += c; } int main(int argc, char *argv[]) { static char buf[(99<<20) + 1]; size_t n = read(0, buf, sizeof(buf) - 1); size_t t = argc > 1 ? strtoul(argv[1], NULL, 0) : 0; size_t m = (t < F || t - 1 > Mnil) ? Mdflt : t; char *x = buf, *y = buf + (m+7)/8; if ((long)n < 0 || (long)n < 2*(y-x)) { printf("read %ld bytes, need at least %ld\n", (long)n, (long)2 +*(y-x)); return 1; } n = 8 * (n - (y-x)); printf("haystack=%zu needle=%zu (bits)\n", n, m); if (argc > 2) { size_t r, i = strtoul(argv[2], NULL, 0); i += (long)i < 0 ? (n-m) : 0; i = i < (n-m) ? i : (n-m); memcpy(y + i/8, x, (y-x)); // stuff the needle and rotate for (r = i & 7; r; r--) rot1(y + i/8, (y-x) + 1); printf("needle at %zu (byte offset %zu, rot %zu)\n", i, i/8, i +%8); } printf("search=%ld\n", (long)search(y, n, x, m)); return 0; }

        Despite saying I wouldn't look at this further; since you updated, I did. But I didn't get very far.

        Once I commented out a couple (non-existent on my system) unix-y headers -- no idea if they are necessary or not:

        #include <stdio.h> //#include <stdint.h> #include <stdlib.h> #include <string.h> //#include <unistd.h>

        I got this:

        c:\test\c\oiskuu-search.h(9) : warning C4293: '<<' : shift count negat +ive or too big, undefined behavior

        Which I fixed by changing 1L to 1ull:

        enum { logM = 32, Mmax = 1ull << logM, Mdflt = 65536, Mnil = Mmax - 1 +};

        Which got me to:

        c:\test\c\oiskuu-search.h(10) : warning C4341: 'Mmax' : signed value i +s out of range for enum constant c:\test\c\oiskuu-search.h(10) : warning C4309: 'initializing' : trunca +tion of constant value c:\test\c\oiskuu-search.h(14) : error C2143: syntax error : missing ') +' before '(' c:\test\c\oiskuu-search.h(14) : error C2091: function returns function c:\test\c\oiskuu-search.h(14) : error C2059: syntax error : ')' c:\test\c\oiskuu-search.h(14) : error C2085: 'bt_off_t' : not in forma +l parameter list

        From these 3 lines:

        // F = logM is theoretically optimal. Mmax is our maximum needle size. // if logM/Mmax is changed, be sure to also check the bt_off_t below enum { logM = 32, Mmax = 1ull << logM, Mdflt = 65536, Mnil = Mmax - 1 +}; enum { F = 16, Ftop = 1 << F, Fmask = Ftop - 1 }; //typedef uint32_t bt_off_t; // must hold 0..Mnil enum { BT_OFF_MAX = Mnil } __attribute__((packed)) typedef bt_off_t;

        I hope you'll understand that I'm not being unfair when I say I have simply no clue how to fix that...but I did try.

        (Quite why you need to invent your own special integer types all over the place is beyond my understanding.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      so instead of brute forcing a search problem, the solution is to run out of memory ... brilliant :P
        Boyer-Moore is O(1) in memory usage.

        Hm? Boyer-Moore prep is the same, O(m) in time and space. A good-shift table of length m is used. It might be more compact in practice.

        If the problem is too large, divide it! Code a tuned search routine for fixed length (sub)needles, such that all tables fit into L2. My guess is KB needles can be done. Now the search becomes two-level. Top-level "alphabet" are the subneedles.

        As for brute-force. The "Boyer-Moore beating assembler versions that utilize microcoded string search instructions" is a decades-old lesson. Tried to google for some edutaining links, but sadly, google has become ****.

Re: [OT] The interesting problem of comparing bit-strings.
by sundialsvc4 (Abbot) on Mar 25, 2015 at 20:04 UTC

    “Bit vector” problems such as this one actually have many different analogs.   (Graphics processing, in particular, comes to mind.)   These problems become expensive and difficult to solve if the strings are short (less than the native word-size of the machine), and/or if there is positively no “crib” that can be meaningfully applied to cut the search-space.

    For instance, if you are routinely looking for bit-strings that are (say) only 20-30 bits long, then there really isn’t any alternative but to pedantically growl through the buffer.   If you are looking for a 20-bit-long quantity in a world of 64-bit words, that quantity could be in any one of 64 positions, and within that space of 128 total bits, 108 of them are entropy.   Strategies such as the one which I previously suggested would be of no real value, because the shorter the “window” is, the less likely it is that “a hit” implies the existence of “a solution.”

    Even if you try to press strategies such as Moore into service, you find that they are not nearly so effective in this situation, as they are in the text-search applications for which they were intended.   Why?   Because uncompressed text actually has only a very small number of possible bit-patterns within each byte.   The entropy of the data is very small.   If you consider text “one character at a time,” you get a familiar letter-frequency distribution:   ETAOIN....   In that context, Moore’s algorithm might well save some time.   Whereas, if the data is a bit-string, the distribution of byte-wise values is much more likely to be flat, simply because the data is not aligned to any sort of incremental boundary.

    Hence, the necessity to “divide and conquer.”   To find some rune by which you can quickly eliminate all places where the data could not be.

      “Bit vector” problems such as this one actually have many different analogs. (Graphics processing, in particular, comes to mind.)

      Arrays of rgb or rgba (or CMK or HSA) are not bit vectors.

      For instance, if you are routinely looking for bit-strings that are (say) only 20-30 bits long, then there really isn’t any alternative but to pedantically growl through the buffer.

      There isn't any alternative. Full stop. Regardless of the length of the needles.

      But when a sensible algorithm can locate 1 million bits close to the end of a 1/4 billion bits in 0.25 seconds, who cares if you have to "pedantically growl".

      Strategies such as the one which I previously suggested would be of no real value,

      It is of no value. Full stop.

      Even if you try to press strategies such as Moore into service, you find that they are not nearly so effective in this situation

      if by "Moore" you mean Boyer Moore; you're wrong again. It requires some modifications, but it can be very useful. Just not with microcoded scan loops.

      Hence, the necessity to “divide and conquer.” To find some rune by which you can quickly eliminate all places where the data could not be.

      There's no necessity. And anything that takes longer than 1/4 of a second is pointless.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        But when a sensible algorithm can locate 1 million bits close to the end of a 1/4 billion bits in 0.25 seconds, who cares if you have to "pedantically growl" ..... And anything that takes longer than 1/4 of a second is pointless.

        Out of curiosity, I tested the gmp library's capability with the above scenario - under the assumption that "1/4 billion" is "250 million".
        The haystack was pseudo-randomly generated, and the needle starts 19 bits from the left hand end of the haystack.
        I used a combination of Math::GMPz and Inline::C.

        It took 3 seconds to find the needle in the haystack (starting from the right hand end of the haystack) on Windows 7 64 bit.

        In some parts of the world "1/4 billion" is "25 million" and the program then, of course, finds the needle 10 times quicker.

        I'm not at all impressed with the code that I wrote in order to check this. But I was impressed that a general usage library could provide such a speedy implementation of the approach.
        Unfortunately, it's not quick enough to be pointful ;-)
        use strict; use warnings; use Math::GMPz qw(:mpz); use Time::HiRes; use Inline C => Config => USING => 'ParseRegExp', BUILD_NOISY => 1, INC => '-IC:/MinGW/msys/1.0/local/include', LIBS => '-LC:/MinGW/msys/1.0/local/lib -lgmp', TYPEMAPS => './typemap', # ships with Math::GMPz ; use Inline C => <<'EOC'; #include <gmp.h> int start(mpz_t * haystack, int haystack_size, mpz_t * needle, int needle_size) { int h_bit, n_bit, nless1 = needle_size -1; for(h_bit = nless1; h_bit < haystack_size; h_bit++) { for(n_bit = 0; n_bit < needle_size; n_bit++) { if(mpz_tstbit(*haystack, h_bit - needle_size + n_bit) != mpz_tstbit(needle, n_bit)) break; if(n_bit == nless1) return h_bit; } } return 0; } EOC my ($t1, $t2, $match); my ($haystack_size, $needle_size) = (250e6, 1e6); # generate the haystack - takes a while my $h_string = rstring($haystack_size); # generate a needle that's guaranteed to match , starting # at 19 bits from the lh end - ie at bit 249999981 my $n_string = substr($h_string, 19, $needle_size); # create the haystack gmp vector. (We need an additional # leading "1" bit to ensure that leading zero bits aren't ignored.) my $haystack = Math::GMPz->new( "1" . $h_string , 2); # create the needle gmp vector. (We need an additional # leading "1" bit to ensure that leading zero bits aren't ignored.) my $needle = Math::GMPz->new("1" . $n_string, 2); # Check that we've got the right number of bits. print Rmpz_sizeinbase($haystack, 2), " ", Rmpz_sizeinbase($needle, 2), "\n"; $t1 = Time::HiRes::time(); $match = start($haystack, $haystack_size, $needle, $needle_size); $t2 = Time::HiRes::time(); print $match, " ", $t2 - $t1, "\n"; sub rstring { die "haystack size must be a multiple of 10" if $_[0] % 10; my $ret; my $its = $_[0] / 10; $ret .= sprintf("%010b",int rand 1024) for 1 .. $its; if(length($ret) != $_[0] || $ret =~ /[^01]/) { die "Error in sub rstring"; } return $ret; } # Outputs: # 250000001 1000001 # 249999981 3.16680598258972
        Cheers,
        Rob
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.