You'll likely want a search that is strongly sub-linear, ie better than O(n). There are string search algorithms that approach O(log(m)*n/m), on the average.

On modern hardware, the single most expensive avoidable* 'operation' is a cache miss. With instruction cache misses more expensive data cache misses due to pipelining.

*context switches are more expensive, but are, at least in the long term, unavoidable.

The problem with (all?) the analysies of the various sub-linear algorithms; is that they ignore (or were done before), the costs & benefits of 2 and 3 levels of on-chip instruction & data cache, became a significant factor in instrumenting this type of algorithm.

My 7 y/o cpu, with only 2-level caching, can perform upwards of 1000 instructions in the time it takes to fulfill a data-cache miss. The effect for an instruction cache miss is even greater.

Sub-linear algorithms exacerbate both problems, in 3 different ways:

  1. They require extra memory in the form of lookup/lookahead tables.

    That space and frequent access to it, directly hits the effectiveness of the data cache's ability to keep the relevant chunks of the haystack and needle in cache.

  2. The algorithms call for look-aheads, look-behinds and/or probes; that upsets the caching algorithms ability to keep caches primed with the required data.

    This is especially important, 64 times as important, for a bitstring search, where each bit needs to be compared 64 times each as a part of 64 different groups of 64 bits.

    This is the single most important factor that has come out of my testing of real code, on a real machine. Modern hardware is optimised to scream through memory in a linear fashion; and every single time you do anything that upsets that relentless, mechanical (electronic) forward (or reverse) flow, you pay a huge hit.

  3. They also require many more instructions and branch points.

    The number of instructions is significant, because if instructions early in the loop get flushed by instructions later in the loop, then you not only get a stall for the instruction cache when the loop iterates; it also causes a pipeline flush; and invalidates branch predictions. All together; and a very expensive situation.

    The number of branch points is significant because they compound exponentially in their effects on failed branch predictions; with the knock-on effect of instruction cache stalls if you breach the cache size cliff.

The beauty of well-written brute force algorithms, is that they derive absolute maximum benefit from the cache effect, by doing as much work as possible on each cache line before allowing it to get flushed thus minimising the need for reloads:

  1. The inner loops are very small with only 1 or 2 branch points required.

    That ensures no instruction cache misses and fullest possible benefits from both branch predictions and pipelining.

  2. Because the data is processed strictly serially; and all the work required on each piece is done with just one from-memory load; they work with both the hardware branch prediction and caching algorithms, to best effect.

In the following table of benchmark timings from my implementation:

Summary. The algorithm is very linear in terms of the distance into the buffer; and sub-linear in terms of the length of the needle. And it is very fast.

Can it be beaten? Maybe, but I very much doubt it will be done using Boyer Moore or similar, jump-around-the-data algorithms. Time will tell :)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

In reply to Re^2: [OT] The interesting problem of comparing (long) bit-strings. by BrowserUk
in thread [OT] The interesting problem of comparing bit-strings. by BrowserUk

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":