You'll likely want a search that is strongly sub-linear, ie better than O(n). There are string search algorithms that approach O(log(m)*n/m), on the average.
On modern hardware, the single most expensive avoidable* 'operation' is a cache miss. With instruction cache misses more expensive data cache misses due to pipelining.
*context switches are more expensive, but are, at least in the long term, unavoidable.
The problem with (all?) the analysies of the various sub-linear algorithms; is that they ignore (or were done before), the costs & benefits of 2 and 3 levels of on-chip instruction & data cache, became a significant factor in instrumenting this type of algorithm.
My 7 y/o cpu, with only 2-level caching, can perform upwards of 1000 instructions in the time it takes to fulfill a data-cache miss. The effect for an instruction cache miss is even greater.
Sub-linear algorithms exacerbate both problems, in 3 different ways:
That space and frequent access to it, directly hits the effectiveness of the data cache's ability to keep the relevant chunks of the haystack and needle in cache.
This is especially important, 64 times as important, for a bitstring search, where each bit needs to be compared 64 times each as a part of 64 different groups of 64 bits.
This is the single most important factor that has come out of my testing of real code, on a real machine. Modern hardware is optimised to scream through memory in a linear fashion; and every single time you do anything that upsets that relentless, mechanical (electronic) forward (or reverse) flow, you pay a huge hit.
The number of instructions is significant, because if instructions early in the loop get flushed by instructions later in the loop, then you not only get a stall for the instruction cache when the loop iterates; it also causes a pipeline flush; and invalidates branch predictions. All together; and a very expensive situation.
The number of branch points is significant because they compound exponentially in their effects on failed branch predictions; with the knock-on effect of instruction cache stalls if you breach the cache size cliff.
The beauty of well-written brute force algorithms, is that they derive absolute maximum benefit from the cache effect, by doing as much work as possible on each cache line before allowing it to get flushed thus minimising the need for reloads:
That ensures no instruction cache misses and fullest possible benefits from both branch predictions and pipelining.
In the following table of benchmark timings from my implementation:
Summary. The algorithm is very linear in terms of the distance into the buffer; and sub-linear in terms of the length of the needle. And it is very fast.
Can it be beaten? Maybe, but I very much doubt it will be done using Boyer Moore or similar, jump-around-the-data algorithms. Time will tell :)
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>