in reply to Re^2: [OT] The interesting problem of comparing bit-strings.
in thread [OT] The interesting problem of comparing bit-strings.

Ahhh....   the bitstrings are very large?   How about the typical length of the patterns for which you search within such strings?

If the length of the string were longer than 3 * 64 = 192 bits long, and especially if it were longer, then you would have a definite “crib,” in the form of a 64-bit quantity ... there are exactly 64 possibilities ... that must occur in the pattern at a quadword boundary.   This would be a “huge win™,” well worth the extra effort of creating a special-case for, in anticipation that it would actually pay-off most of the time.

I-F the string being sought was at least 192 bits long, then you are able to calculate, with certainty, the 64 possible values that exist for “the 64-bit ‘word in the middle.’”   You can now search for each of these in turn, using an extremely efficient REPNE SCASD (i86 ...) assembler-language construct, and then explore the surrounding doublewords in each case to determine if the bit-string occurs.   (If the run being sought is “even longer,” then a REPNE CMPSD loop can be applied to check all of the “subsequent doublewords,” the values of all of which once again can be determined with certainty.   Only the partial-doublewords at the beginning and at the end, if any, remain to be verified.)

Easily the most efficient implementation would be to loop through all 64 of the pre-computed possibilities, using the assembler instructions aforementioned to locate all occurrences within the buffer.   (Of course, and as you of all Monks well know, you will need to use a “sliding window” scan through the compass of the entire file, to avoid missing any bitstrings that serependitiously fell across a boundary, but, I digress.)

Each and every time you stumble-upon any of these 64 values, the odds that you have, in fact, “found another answer” are rather-astronomically high.   In fact, in such a case you might even wish to “take a calculated risk” that in fact you need look no further.   If you actually discover a matching 64-bit quantity within the buffer, then the odds are “64 in 2^64” that the entire value-being-sought is present.   Certainly, if the REPNE CMPSD succeeds with regard to a location found by REPNE SCASD, then the odds of being wrong become absurdly (and dismissably) small:   “It is a match.   It has to be.”   At this point, the actual odds of committing a “type-2 error” are . . . zero.

No, this is not a solution to your entire problem:   if the bit-string being sought turns out to be shorter, it will not help at all.   But, for longer strings, it will cause the problem to “virtually disappear.”

  • Comment on Re^3: [OT] The interesting problem of comparing bit-strings.

Replies are listed 'Best First'.
Re^4: [OT] The interesting problem of comparing bit-strings.
by BrowserUk (Pope) on Mar 25, 2015 at 00:52 UTC
    REPNE SCASD

    You cannot search for bits using byte(nor word, nor double-word)-wise instructions unless your happy to miss 75%(87%,94%) of possible hits.

    The rest is regurgitated garbage. Probably cribbed from the 'net and then sundial'd into an unreadable mess. Please stop!


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      You cannot search for bits using byte-wise instructions

      But the thinkg is that, actually, you can! Only that you will have to run eight searches instead of one.

        But the think thing is that, actually, you can! Only that you will have to run eight searches instead of one.

        Hm. Pedantically, that still searching for bytes, but no matter, it still doesn't solve the problem. What if what you are looking for spans a byte boundary?

        Eg. The needle below appears twice in the haystack, but no combination of 8 (or any number) of repne scasb instructions will ever find either:

        needle: 00100100 haystack: + * * * * 1 2 3 4 5 6 +7 8 9 0 1 2 0123456789012345678901234567890123456789012345678901234567890123456789 +0123456789012345678901234567890123456789012345678901234567 1111000101001010110110101100101011101001010100010111100010011001010100 +0111111111001000000011100100100111011100100100111000110111

        Because, repne scasb scans the string looking for a fixed byte value, but the needle doesn't appear as a byte anywhere in the string.

        Both occurrences span byte boundaries, thus to find them, you need to do two compares, using two different masks on two adjacent bytes, against two different byte values, neither of which are or include the needle.

        Nor can you pad the needle (on both ends) to 16-bits and use repne scabw to find it, because again, a masking operation is required to extract the relevant 7 bits from the surrounding 16.

        The same is true (and then some) for my use of 64-bit comparisons, but I didn't suggest I was going to microcode loop instructions.

        So, I'll say it again, with a minor addition, that (I hope) will satisfy you; and will go right over his head:

        You cannot search for bits using byte-wise instructions alone

        What he pontificated about was utterly pointless; and wrong.

        And that before we get into:

        1. I'm coding in C not assembler.

          And my compiler doesn't support inline assembler.

        2. the whole guff about: "If you actually discover a matching 64-bit quantity within the buffer, then the odds are 64 in 2^64 that the entire value-being-sought is present".

          We already know how completely screwed his sens eof probabilities are: The odds of an actual collision for a 7-digit random number are so astronomically small that I quite frankly would not bother to check for it.

          I mentioned that my haystacks are frequently >1/4GB in size"; that's 33 million quads; the chances that two the same appear somewhere are (by the Birthday Paradox) not so slim in totally random data; but in structured, correlated, meaningful data, quite likely to be much much higher.

          And that's before you consider that the patterns I'll be searching for are not quad-aligned, thus every bit in that 1/4GB (bar the first and last 63 bit) is actually a part of 63 (potentially) entirely different quads. Ie. that 33 million just became 2.1 billion.

        Beyond that, using bytes is a bad idea for this in any number of ways:

        • Byte aligned data is the most expensive size to load/store for a 64 bit processor.

          Effectively 64-bits is moved across the data bus, but only 8-bits are used. Even reading from cache, every byte except the low byte of each quad has to be manipulated into the low 8 bits of a register before it can be used.

        • <l>Byte operators effectively waste 7/8ths of each register used to hold the byte.

          Most of them use AL or CL leaving the rest of the register useless for the duration.

        If it is of interest, this is my currently incomplete code (a few edge cases to handle) locating a 1Mi bit needle near the end of a 1GiB buffer of random bits:

        C:\test\C>bvSearch 1745337664 1048576 Found it at 67108800 Took 0.255063793859 secs

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
    A reply falls below the community's threshold of quality. You may see it by logging in.