But when a sensible algorithm can locate 1 million bits close to the end of a 1/4 billion bits in 0.25 seconds, who cares if you have to "pedantically growl" ..... And anything that takes longer than 1/4 of a second is pointless.

Out of curiosity, I tested the gmp library's capability with the above scenario - under the assumption that "1/4 billion" is "250 million".
The haystack was pseudo-randomly generated, and the needle starts 19 bits from the left hand end of the haystack.
I used a combination of Math::GMPz and Inline::C.

It took 3 seconds to find the needle in the haystack (starting from the right hand end of the haystack) on Windows 7 64 bit.

In some parts of the world "1/4 billion" is "25 million" and the program then, of course, finds the needle 10 times quicker.

I'm not at all impressed with the code that I wrote in order to check this. But I was impressed that a general usage library could provide such a speedy implementation of the approach.
Unfortunately, it's not quick enough to be pointful ;-)
```use strict;
use warnings;
use Math::GMPz qw(:mpz);
use Time::HiRes;

use Inline C => Config =>
USING => 'ParseRegExp',
BUILD_NOISY => 1,
INC =>  '-IC:/MinGW/msys/1.0/local/include',
LIBS => '-LC:/MinGW/msys/1.0/local/lib -lgmp',
TYPEMAPS => './typemap', # ships with Math::GMPz
;

use Inline C => <<'EOC';

#include <gmp.h>

int start(mpz_t * haystack, int haystack_size,
mpz_t * needle, int needle_size) {

int h_bit, n_bit, nless1 = needle_size -1;

for(h_bit = nless1; h_bit < haystack_size; h_bit++) {
for(n_bit = 0; n_bit < needle_size; n_bit++) {
if(mpz_tstbit(*haystack, h_bit - needle_size + n_bit) !=
mpz_tstbit(needle, n_bit)) break;
if(n_bit == nless1) return h_bit;
}
}
return 0;

}

EOC

my (\$t1, \$t2, \$match);
my (\$haystack_size, \$needle_size) = (250e6, 1e6);

# generate the haystack - takes a while
my \$h_string = rstring(\$haystack_size);

# generate a needle that's guaranteed to match , starting
# at 19 bits from the lh end - ie at bit 249999981
my \$n_string = substr(\$h_string, 19, \$needle_size);

# create the haystack gmp vector. (We need an additional
# leading "1" bit to ensure that leading zero bits aren't ignored.)
my \$haystack = Math::GMPz->new( "1" . \$h_string , 2);

# create the needle gmp vector. (We need an additional
# leading "1" bit to ensure that leading zero bits aren't ignored.)
my \$needle = Math::GMPz->new("1" . \$n_string, 2);

# Check that we've got the right number of bits.
print Rmpz_sizeinbase(\$haystack, 2), " ",
Rmpz_sizeinbase(\$needle, 2), "\n";

\$t1 = Time::HiRes::time();
\$match = start(\$haystack, \$haystack_size, \$needle, \$needle_size);
\$t2 = Time::HiRes::time();
print \$match, " ", \$t2 - \$t1, "\n";

sub rstring {
die "haystack size must be a multiple of 10"
if \$_[0] % 10;
my \$ret;
my \$its = \$_[0] / 10;
\$ret .= sprintf("%010b",int rand 1024) for 1 .. \$its;
if(length(\$ret) != \$_[0] || \$ret =~ /[^01]/) {
die "Error in sub rstring";
}
return \$ret;
}

# Outputs:
# 250000001 1000001
# 249999981 3.16680598258972
Cheers,
Rob

Replies are listed 'Best First'.
Re^4: [OT] The interesting problem of comparing bit-strings.
by BrowserUk (Pope) on Apr 04, 2015 at 01:46 UTC
under the assumption that "1/4 billion" is "250 million".

You picked out a quote from a reply to a "non-combatant"; someone for whom the differences between 'GB', 'Gb' and 'GiB' would be lost.

For the sake of precision, "1/4 billion bits", in this case is, 1024**3/4 == 268435456 bits == 33554432 bytes == 32 MiB.

BTW: searching from the end is an intriguing possibility...but gains play losses if the needle is to be found at the beginning (left-most; least-most; lower) position in the haystack.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked