You'll likely want a search that is strongly sub-linear, ie better than O(n). There are string search algorithms that approach O(log(m)*n/m), on the average. With long needles (length m), that'll be a blast. However, you don't get anything for free: preprocessing requirements are O(m) in time and space.

Google for "Turbo Reverse Factor" and "Alpha Skip Search". Wikipedia seems to lack pages for these. With alpha skip, you create a trie (or a table) where you push the offsets of all substrings of length log(m). Bitstrings are just strings on size=2 alphabet. Which algorithm or variant thereof will prove most suitable for you is something hardly anyone here could guess.

• Comment on Re: [OT] The interesting problem of comparing (long) bit-strings.

Replies are listed 'Best First'.
Re^2: [OT] The interesting problem of comparing (long) bit-strings.
by BrowserUk (Pope) on Mar 28, 2015 at 10:40 UTC
You'll likely want a search that is strongly sub-linear, ie better than O(n). There are string search algorithms that approach O(log(m)*n/m), on the average.

On modern hardware, the single most expensive avoidable* 'operation' is a cache miss. With instruction cache misses more expensive data cache misses due to pipelining.

*context switches are more expensive, but are, at least in the long term, unavoidable.

The problem with (all?) the analysies of the various sub-linear algorithms; is that they ignore (or were done before), the costs & benefits of 2 and 3 levels of on-chip instruction & data cache, became a significant factor in instrumenting this type of algorithm.

My 7 y/o cpu, with only 2-level caching, can perform upwards of 1000 instructions in the time it takes to fulfill a data-cache miss. The effect for an instruction cache miss is even greater.

Sub-linear algorithms exacerbate both problems, in 3 different ways:

1. They require extra memory in the form of lookup/lookahead tables.

That space and frequent access to it, directly hits the effectiveness of the data cache's ability to keep the relevant chunks of the haystack and needle in cache.

2. The algorithms call for look-aheads, look-behinds and/or probes; that upsets the caching algorithms ability to keep caches primed with the required data.

This is especially important, 64 times as important, for a bitstring search, where each bit needs to be compared 64 times each as a part of 64 different groups of 64 bits.

This is the single most important factor that has come out of my testing of real code, on a real machine. Modern hardware is optimised to scream through memory in a linear fashion; and every single time you do anything that upsets that relentless, mechanical (electronic) forward (or reverse) flow, you pay a huge hit.

3. They also require many more instructions and branch points.

The number of instructions is significant, because if instructions early in the loop get flushed by instructions later in the loop, then you not only get a stall for the instruction cache when the loop iterates; it also causes a pipeline flush; and invalidates branch predictions. All together; and a very expensive situation.

The number of branch points is significant because they compound exponentially in their effects on failed branch predictions; with the knock-on effect of instruction cache stalls if you breach the cache size cliff.

The beauty of well-written brute force algorithms, is that they derive absolute maximum benefit from the cache effect, by doing as much work as possible on each cache line before allowing it to get flushed thus minimising the need for reloads:

1. The inner loops are very small with only 1 or 2 branch points required.

That ensures no instruction cache misses and fullest possible benefits from both branch predictions and pipelining.

2. Because the data is processed strictly serially; and all the work required on each piece is done with just one from-memory load; they work with both the hardware branch prediction and caching algorithms, to best effect.

In the following table of benchmark timings from my implementation:

• The horizontal list of numbers at the top is the length of the needle in bits.
• The major vertical tick numbers are the distance, in quads, into the 128MB/1Gib haystack at which the needle is found.
• The minor vertical tick numbers are the offset, in bits, into the quad, where the needle is found.

Summary. The algorithm is very linear in terms of the distance into the buffer; and sub-linear in terms of the length of the needle. And it is very fast.

Can it be beaten? Maybe, but I very much doubt it will be done using Boyer Moore or similar, jump-around-the-data algorithms. Time will tell :)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^2: [OT] The interesting problem of comparing (long) bit-strings.
by oiskuu (Hermit) on Mar 29, 2015 at 09:10 UTC

Alright. Here's a demo of alpha skip search on bitstrings. The needle size is fixed, but it should be trivial to follow this with a longer compare. A possibility exists that some off-by-one errors have crept in. If that's the case, just deduct from my paycheck. :-)

Alpha skip is quadratic in the worst case and probably not the best choice when nasty matches are expected. Anyway, I'm curious how this might hold up against the optimized brute-force routines.

Note: to change the needle size, just edit the M and logM values. (Compile with -fno-strict-aliasing, or fix the type-punning).

Making minimal changes to your code to allow it compile here, and it simply never finds anything(very quickly), no matter how simple I make the search criteria:

```C:\test\C>cl /W3 bitsearch-oiskuu.c
Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64

bitsearch-oiskuu.c
bitsearch-oiskuu.c(49) : warning C4267: '=' : conversion from 'size_t'
+ to 'U16', possible loss of data
bitsearch-oiskuu.c(78) : warning C4267: '=' : conversion from 'size_t'
+ to 'char', possible loss of data
bitsearch-oiskuu.c(80) : warning C4267: '+=' : conversion from 'size_t
+' to 'char', possible loss of data
Microsoft (R) Incremental Linker Version 9.00.21022.08

/out:bitsearch-oiskuu.exe
bitsearch-oiskuu.obj

C:\test\C>bitsearch-oiskuu
Result           -1     Took 1.837746e-004 secs

I've wasted enough time on it; and won't be expending any more, but here is the modified code that compiles clean (apart from warnings that originate from your posted code):

I'd suggest you enter it in an obfuscated C competition; could be a winner.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
'm curious how this might hold up against the optimized brute-force routines.

It's 3 orders of magnitude slower.

More or less, yeah.

```[         0.033410] needle 512, haystack 134217216 bytes
[         0.000030] rot 5 at 8000007; bitoff=64000061
[         0.000017] needle stuffed!
[         0.000644] search=64000061
I see your table lists "2.483719e-001" for the corresponding case. Brute force appears to be >300 times slower (0.2483719/0.000644).

Looks like { M = 16384, logM = 14 } case is the fastestfaster, with tables weighing in at 64KB. Beyond that, slowdown. (0.2784286/0.000236 == ~1180). Even bigger needles/tables may be quicker still.

Re^2: [OT] The interesting problem of ... bit-strings (alpha skip search)
by oiskuu (Hermit) on Apr 01, 2015 at 10:18 UTC

I'll attempt a brief description of the Alpha Skip Search. Hopefully, this may shed some light on the basic operation of the algorithm.

Consider a pattern of length M. The approach is to quickly locate any fractional substring within the pattern, wherever it may occur. Substrings of length F take offsets from 0 .. M-F, inclusive. Thus there is a set of M-F+1 needle fragments. The preprocessing stage creates an index of those fragments. A perl implementation might read as follows:

push @{\$index{ substr(\$needle, \$_, \$F) }}, \$_     for 0 .. \$M-\$F;

Search phase proceeds by inspecting the haystack one window at a time. Window shift is the same figure as above, M-F+1. Each time, a substring is extracted at the end of the window. The index now yields all possible alignments where this fragment will match in the pattern.

Theoretical optimum for F is logsM (log based on alphabet size). That amounts to one entry per index bucket, on the average. For random data, there will be one alignment to test per window, and this is likely to fail at the very start.

In the case of bit-string matching, the substrings are nothing but small integers below (1<<F), which suggests a very simple array-based indexing scheme (instead of trie). Here's a sample implementation in C (just a rehash of the previous demo, main difference being this one handles variable length needles).

Tested with gcc/clang under linux; my apologies if the code is too obscure, or otherwise unworkable.

Update. Simplified above code. Note that it is sometimes difficult to put a name on a string search routine, especially with bit-strings. Minor modification can turn one search into another, and the variants abound. Here's a version that might be called Horspool.

Despite saying I wouldn't look at this further; since you updated, I did. But I didn't get very far.

Once I commented out a couple (non-existent on my system) unix-y headers -- no idea if they are necessary or not:

```#include <stdio.h>
//#include <stdint.h>
#include <stdlib.h>
#include <string.h>
//#include <unistd.h>

I got this:

```c:\test\c\oiskuu-search.h(9) : warning C4293: '<<' : shift count negat
+ive or too big, undefined behavior

Which I fixed by changing 1L to 1ull:

```enum { logM = 32, Mmax = 1ull << logM, Mdflt = 65536, Mnil = Mmax - 1
+};

Which got me to:

```c:\test\c\oiskuu-search.h(10) : warning C4341: 'Mmax' : signed value i
+s out of range for enum constant
c:\test\c\oiskuu-search.h(10) : warning C4309: 'initializing' : trunca
+tion of constant value
c:\test\c\oiskuu-search.h(14) : error C2143: syntax error : missing ')
+' before '('
c:\test\c\oiskuu-search.h(14) : error C2091: function returns function
c:\test\c\oiskuu-search.h(14) : error C2059: syntax error : ')'
c:\test\c\oiskuu-search.h(14) : error C2085: 'bt_off_t' : not in forma
+l parameter list

From these 3 lines:

```// F = logM is theoretically optimal. Mmax is our maximum needle size.
// if logM/Mmax is changed, be sure to also check the bt_off_t below
enum { logM = 32, Mmax = 1ull << logM, Mdflt = 65536, Mnil = Mmax - 1
+};
enum { F = 16, Ftop = 1 << F, Fmask = Ftop - 1 };

//typedef uint32_t bt_off_t;        // must hold 0..Mnil
enum { BT_OFF_MAX = Mnil } __attribute__((packed)) typedef bt_off_t;

I hope you'll understand that I'm not being unfair when I say I have simply no clue how to fix that...but I did try.

(Quite why you need to invent your own special integer types all over the place is beyond my understanding.)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^2: [OT] The interesting problem of comparing (long) bit-strings.
by Anonymous Monk on Mar 28, 2015 at 08:46 UTC
so instead of brute forcing a search problem, the solution is to run out of memory ... brilliant :P
Boyer-Moore is O(1) in memory usage.

Hm? Boyer-Moore prep is the same, O(m) in time and space. A good-shift table of length m is used. It might be more compact in practice.

If the problem is too large, divide it! Code a tuned search routine for fixed length (sub)needles, such that all tables fit into L2. My guess is KB needles can be done. Now the search becomes two-level. Top-level "alphabet" are the subneedles.

As for brute-force. The "Boyer-Moore beating assembler versions that utilize microcoded string search instructions" is a decades-old lesson. Tried to google for some edutaining links, but sadly, google has become ****.

As for brute-force. The "Boyer-Moore beating assembler versions that utilize microcoded string search instructions" is a decades-old lesson

For string search, decades ago (before cpus acquired high-speed, on-chip caches) Boyer Moore was the business, but since the 486 circa. 1990, on-chip caches have tipped the balance again.

If you won't take my words for it; and aren't equipped to try it for yourself, take Bjarne Stroustrup's word for it. View http://channel9.msdn.com/Events/GoingNative/GoingNative-2012/Keynote-Bjarne-Stroustrup-Cpp11-Style and skip ahead to timecode 44:40.

A few seconds in he presents a question (on a slide titled "Vector vs. Linked List"), that goes like this:

The task comes in two parts:

1. Generate N random integers and insert them into a sequence so that each is inserted in its proper position in numerical order.
2. Remove the elements one at a time by picking a random position in the sequence and removing the element there.

And the question is: For which N does it become faster to use a linked list than a flat array?

The decades old wisdom on this is that insertions/deletions into/from an array are O(N2), but for a linked list they are O(N log N) (with slightly higher constant factors), so N only need grow to a couple of hundred or a 1000 or so for the linked-list to start outperforming the flat array. Right? Wrong!

Watch for a few more seconds (and wonder in amazement at his non-existent graph) as he shows that the answer is "Never". At no point does the linked list become faster than the flat array. In fact the bigger N gets, the bigger the gap between the two becomes, in favour of the flat array.

As counter-intuitive as it may be, the benefits of the caching on the sequential accesses made by the flat array, far outweigh the costs of moving more data. The non-sequential, scattered around memory nature of linked-lists means they never catch up, let alone surpass the performance of the cache-friendly contiguous memory accesses of the flat array.

Times change; and a lot of the "knowledge" CS students acquired, despite being tortured within an inch of their lives by Knuth; is simply out-of-date. Whilst the proofs of correctness still hold; the analysies of operational complexity -- done in terms of the imaginative, but imaginary MIX (later MMIX) -- simply do not consider the effects of caching, pipelining, branch prediction, memory stalls et al.

Stroustrup concludes that there are "3 orders of magnitude performance" to be gained.

And, the relevance of string search analysis to bit-string search, has to my knowledge, never been demonstrated. They are quite different animals.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked