in reply to [OT] The interesting problem of comparing bit-strings.

Low level C programming is not my strength, so I stay with bit strings in Perl. Based on a block size of 8 bits, you would have 8 alignments of the needle. Each of this 8 alignments has a middle part of complete 8 bit blocks and a left and right partial block. I would use something like index to search for the middle block in the haystack bit string and then check whether or not the left and right partial blocks match or not.

Apologies if anyone's proposal above is materially the same but the fact eluded me.

Replies are listed 'Best First'.
Re^2: [OT] The interesting problem of comparing bit-strings.
by BrowserUk (Pope) on Mar 24, 2015 at 20:58 UTC

    The idea is to provide the functionality to perl through a I::C module.

    I've done similar arrangements -- using bytes and shifts for the unaligned parts and 32 or 64-bit compares for the aligned parts -- but for large bitstrings -- mine are frequently > 1/4gb in size -- it is very slow.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Ahhh....   the bitstrings are very large?   How about the typical length of the patterns for which you search within such strings?

      If the length of the string were longer than 3 * 64 = 192 bits long, and especially if it were longer, then you would have a definite “crib,” in the form of a 64-bit quantity ... there are exactly 64 possibilities ... that must occur in the pattern at a quadword boundary.   This would be a “huge win™,” well worth the extra effort of creating a special-case for, in anticipation that it would actually pay-off most of the time.

      I-F the string being sought was at least 192 bits long, then you are able to calculate, with certainty, the 64 possible values that exist for “the 64-bit ‘word in the middle.’”   You can now search for each of these in turn, using an extremely efficient REPNE SCASD (i86 ...) assembler-language construct, and then explore the surrounding doublewords in each case to determine if the bit-string occurs.   (If the run being sought is “even longer,” then a REPNE CMPSD loop can be applied to check all of the “subsequent doublewords,” the values of all of which once again can be determined with certainty.   Only the partial-doublewords at the beginning and at the end, if any, remain to be verified.)

      Easily the most efficient implementation would be to loop through all 64 of the pre-computed possibilities, using the assembler instructions aforementioned to locate all occurrences within the buffer.   (Of course, and as you of all Monks well know, you will need to use a “sliding window” scan through the compass of the entire file, to avoid missing any bitstrings that serependitiously fell across a boundary, but, I digress.)

      Each and every time you stumble-upon any of these 64 values, the odds that you have, in fact, “found another answer” are rather-astronomically high.   In fact, in such a case you might even wish to “take a calculated risk” that in fact you need look no further.   If you actually discover a matching 64-bit quantity within the buffer, then the odds are “64 in 2^64” that the entire value-being-sought is present.   Certainly, if the REPNE CMPSD succeeds with regard to a location found by REPNE SCASD, then the odds of being wrong become absurdly (and dismissably) small:   “It is a match.   It has to be.”   At this point, the actual odds of committing a “type-2 error” are . . . zero.

      No, this is not a solution to your entire problem:   if the bit-string being sought turns out to be shorter, it will not help at all.   But, for longer strings, it will cause the problem to “virtually disappear.”

        REPNE SCASD

        You cannot search for bits using byte(nor word, nor double-word)-wise instructions unless your happy to miss 75%(87%,94%) of possible hits.

        The rest is regurgitated garbage. Probably cribbed from the 'net and then sundial'd into an unreadable mess. Please stop!


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        A reply falls below the community's threshold of quality. You may see it by logging in.