in reply to [OT] The interesting problem of comparing bit-strings.

“Bit vector” problems such as this one actually have many different analogs.   (Graphics processing, in particular, comes to mind.)   These problems become expensive and difficult to solve if the strings are short (less than the native word-size of the machine), and/or if there is positively no “crib” that can be meaningfully applied to cut the search-space.

For instance, if you are routinely looking for bit-strings that are (say) only 20-30 bits long, then there really isn’t any alternative but to pedantically growl through the buffer.   If you are looking for a 20-bit-long quantity in a world of 64-bit words, that quantity could be in any one of 64 positions, and within that space of 128 total bits, 108 of them are entropy.   Strategies such as the one which I previously suggested would be of no real value, because the shorter the “window” is, the less likely it is that “a hit” implies the existence of “a solution.”

Even if you try to press strategies such as Moore into service, you find that they are not nearly so effective in this situation, as they are in the text-search applications for which they were intended.   Why?   Because uncompressed text actually has only a very small number of possible bit-patterns within each byte.   The entropy of the data is very small.   If you consider text “one character at a time,” you get a familiar letter-frequency distribution:   ETAOIN....   In that context, Moore’s algorithm might well save some time.   Whereas, if the data is a bit-string, the distribution of byte-wise values is much more likely to be flat, simply because the data is not aligned to any sort of incremental boundary.

Hence, the necessity to “divide and conquer.”   To find some rune by which you can quickly eliminate all places where the data could not be.

  • Comment on Re: [OT] The interesting problem of comparing bit-strings.

Replies are listed 'Best First'.
Re^2: [OT] The interesting problem of comparing bit-strings.
by BrowserUk (Pope) on Mar 25, 2015 at 20:32 UTC
    “Bit vector” problems such as this one actually have many different analogs. (Graphics processing, in particular, comes to mind.)

    Arrays of rgb or rgba (or CMK or HSA) are not bit vectors.

    For instance, if you are routinely looking for bit-strings that are (say) only 20-30 bits long, then there really isn’t any alternative but to pedantically growl through the buffer.

    There isn't any alternative. Full stop. Regardless of the length of the needles.

    But when a sensible algorithm can locate 1 million bits close to the end of a 1/4 billion bits in 0.25 seconds, who cares if you have to "pedantically growl".

    Strategies such as the one which I previously suggested would be of no real value,

    It is of no value. Full stop.

    Even if you try to press strategies such as Moore into service, you find that they are not nearly so effective in this situation

    if by "Moore" you mean Boyer Moore; you're wrong again. It requires some modifications, but it can be very useful. Just not with microcoded scan loops.

    Hence, the necessity to “divide and conquer.” To find some rune by which you can quickly eliminate all places where the data could not be.

    There's no necessity. And anything that takes longer than 1/4 of a second is pointless.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      But when a sensible algorithm can locate 1 million bits close to the end of a 1/4 billion bits in 0.25 seconds, who cares if you have to "pedantically growl" ..... And anything that takes longer than 1/4 of a second is pointless.

      Out of curiosity, I tested the gmp library's capability with the above scenario - under the assumption that "1/4 billion" is "250 million".
      The haystack was pseudo-randomly generated, and the needle starts 19 bits from the left hand end of the haystack.
      I used a combination of Math::GMPz and Inline::C.

      It took 3 seconds to find the needle in the haystack (starting from the right hand end of the haystack) on Windows 7 64 bit.

      In some parts of the world "1/4 billion" is "25 million" and the program then, of course, finds the needle 10 times quicker.

      I'm not at all impressed with the code that I wrote in order to check this. But I was impressed that a general usage library could provide such a speedy implementation of the approach.
      Unfortunately, it's not quick enough to be pointful ;-)
      use strict; use warnings; use Math::GMPz qw(:mpz); use Time::HiRes; use Inline C => Config => USING => 'ParseRegExp', BUILD_NOISY => 1, INC => '-IC:/MinGW/msys/1.0/local/include', LIBS => '-LC:/MinGW/msys/1.0/local/lib -lgmp', TYPEMAPS => './typemap', # ships with Math::GMPz ; use Inline C => <<'EOC'; #include <gmp.h> int start(mpz_t * haystack, int haystack_size, mpz_t * needle, int needle_size) { int h_bit, n_bit, nless1 = needle_size -1; for(h_bit = nless1; h_bit < haystack_size; h_bit++) { for(n_bit = 0; n_bit < needle_size; n_bit++) { if(mpz_tstbit(*haystack, h_bit - needle_size + n_bit) != mpz_tstbit(needle, n_bit)) break; if(n_bit == nless1) return h_bit; } } return 0; } EOC my ($t1, $t2, $match); my ($haystack_size, $needle_size) = (250e6, 1e6); # generate the haystack - takes a while my $h_string = rstring($haystack_size); # generate a needle that's guaranteed to match , starting # at 19 bits from the lh end - ie at bit 249999981 my $n_string = substr($h_string, 19, $needle_size); # create the haystack gmp vector. (We need an additional # leading "1" bit to ensure that leading zero bits aren't ignored.) my $haystack = Math::GMPz->new( "1" . $h_string , 2); # create the needle gmp vector. (We need an additional # leading "1" bit to ensure that leading zero bits aren't ignored.) my $needle = Math::GMPz->new("1" . $n_string, 2); # Check that we've got the right number of bits. print Rmpz_sizeinbase($haystack, 2), " ", Rmpz_sizeinbase($needle, 2), "\n"; $t1 = Time::HiRes::time(); $match = start($haystack, $haystack_size, $needle, $needle_size); $t2 = Time::HiRes::time(); print $match, " ", $t2 - $t1, "\n"; sub rstring { die "haystack size must be a multiple of 10" if $_[0] % 10; my $ret; my $its = $_[0] / 10; $ret .= sprintf("%010b",int rand 1024) for 1 .. $its; if(length($ret) != $_[0] || $ret =~ /[^01]/) { die "Error in sub rstring"; } return $ret; } # Outputs: # 250000001 1000001 # 249999981 3.16680598258972
      Cheers,
      Rob
        under the assumption that "1/4 billion" is "250 million".

        You picked out a quote from a reply to a "non-combatant"; someone for whom the differences between 'GB', 'Gb' and 'GiB' would be lost.

        For the sake of precision, "1/4 billion bits", in this case is, 1024**3/4 == 268435456 bits == 33554432 bytes == 32 MiB.

        BTW: searching from the end is an intriguing possibility...but gains play losses if the needle is to be found at the beginning (left-most; least-most; lower) position in the haystack.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked