Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)

by BrowserUk (Patriarch)
on Apr 05, 2015 at 11:16 UTC ( [id://1122468]=perlmeditation: print w/replies, xml ) Need Help??

The basic premise of (the titled) fast string matching algorithms, is that you preprocess the needle and construct a table of shifts or skips that allow you to skip over chunks of the haystack and thus speed things up.

The following, I hope clear, explanation (they are few and far between) describes a variant called Quick Search.

You start with a haystack and needle:

0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 000010100101010000011010

You inspect the needle and build the skip table:

00001010 01010100 00011010 00001010 shift 24 01010100 shift 16 00011010 shift 8 xxxxxxxx shift 32 (all other patterns not found in the needle allow ma +ximum shift)

And apply it. Try the pattern at the start of the haystack; doesn't match, so look up the next 8 bits of the haystack in the table, and find the shift = 32:

000011101001110010111011 01110001 111010111100111110111111100000001000 +100100001010010101000001101000110010011000101101010110011011000000 000010100101010000011010 ???????? shift = 32

So, apply the shift and try the needle at the new position. No match, lookup the next 8 bits of the haystack to find a shift of 32:

0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 000010100101010000011010???????? shift + = 32

Apply the shift, try the needle at the new position. No match, look up the next 8 bits, get the shift of 8:

0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 000010 +100101010000011010???????? shift = 8

Apply the shift, try the match, And success. Needle found.

0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + 000010100101010000011010 Found at 72

Now lets try that with the same haystack but another needle:

10000101 00101010 00001101 10000101 shift 24 00101010 shift 16 00001101 shift 8 xxxxxxxx shift 32 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 100001010010101000001101???????? shift 32 100001010010101000001101???????? shift + 32 100001 +010010101000001101???????? shift 32 + 100001010010101000001101???????? shift 32 + 10000101001 +0101000001101 >>> Not found.

Great! Four compares & four skips to discover the needle isn't in the haystack.

Except that it is!

0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + 100001010010101000001101

And that's why Boyer-Moore, Horspool, Alpha-Skip, Quick Search et. al. don't work for bit-strings.

It doesn't matter what size you make the chunks of bits -- above I used 8 -- the skip table will only ever consider aligned groups of bits, but bit-strings inherently consist entirely of unaligned bits!

(Don't believe me; just try it for yourself.)

And the point of this meditation?

To prove my instincts were right?

Maybe, but mostly because I wanted to be wrong. I wanted there to be a better than brute force mechanism. And I'm still hoping that someone will point me at one.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
  • Comment on Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
  • Select or Download Code

Replies are listed 'Best First'.
Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by Eily (Monsignor) on Apr 05, 2015 at 15:01 UTC

    I had trouble understanding your shift values, but that's because you're looking at the next byte instead of trying to gain information from the current position of the needle. For me 00011010 would mean: 0 bits shift, check on the left of the string for the first mismatch.

    Anyway, your mismatch table is wrong: if you have 00001101 that does not mean a 32 bits shift, that's a 9 bits shift, because this sequence of bit is present in your needle. And 00000000 is a 28 bits shift, because its rightmost 4 bits match the beginning of the needle.

    Edit: I implemented a subset of Boyer-Moore for a string of bits below. Except since I was lazy, I used string of '0' and '1', so I realized that's nearly just the "normal" Boyer-Moore algorithm that works on characters. So the implementation has no further interest than the description of the algorithm on wikipedia. The point though is that instead of having a table of shifts required for a given byte, you can have a table of shifts required for the position of the rightmost mismatch. This is a little less efficient for small needles, but far easier to compute, and still better than brute force.

    For longer needles, you have either a table of 2**N (with N the length of the suffix you want to check) possible values and the shift required, or a table that can easily be of the same length as the needle, that holds a shift value for each possible mismatch position (using the rightmost one).

      The point though is that instead of having a table of shifts required for a given byte, you can have a table of shifts required for the position of the rightmost mismatch.

      The right mismatching what? Single bit? Group of 8 bits? Or 13 bits? 1000001 bits?

      My point is that a table of single bits 0 or 1 doesn't help; but any larger than that, and you're into the "groups of aligned bits" problem I described above.

      Whatever alignment you choose, byte/word/other; and whatever group of bits from the mismatched position in the haystack you use as your index, shift the needle 1-bit either way from that alignment and that group is no longer relevant.

      I don't know how to better describe it; but until you've tried to implement what you are describing -- using bits; not bytes pretending to be bits -- you will not appreciate the problem.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
        How about making the bits pretend to be bytes? =)
Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by RichardK (Parson) on Apr 05, 2015 at 13:12 UTC

    My gut feeling on this one is that brute force is the only way, _but_ processors are really good at bit-ops so it should be possible to do well.

    However I think that to do it efficiently you'll need to get closer to the processor then the high level languages will allow, so you will have to write it in assembler. On x86 there are the MMX instructions that let you operate on 64 bits at a time, which might help.

    Another thought: GPUs must be really good at shifting bits strings and doing bit-ops, so maybe some smart code running on your graphics card will do the job :). I'm thinking about opengl / cuda etc. But as I have other things to do today, I trying desperately NOT to fall down that rabbit hole right now! When I get time I'll have a read around the subject, it sounds really interesting.

      On x86 there are the MMX instructions that let you operate on 64 bits at a time, which might help.

      I already posted a very efficient brute-force algorithm that uses 128-bit shifts to good effect.

      The main purpose of this meditation was to try and discover if there are any smart bit-string search algorithms that I haven't discovered; as well as demonstrate that the fast string search algorithm don't work.

      GPUs must be really good at shifting bits strings and doing bit-ops

      That's the kind of thing that is a possibility. Thanks.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by Anonymous Monk on Apr 05, 2015 at 17:34 UTC

    The likely cause of misapprehension here is the persistent use of "character" in any description of a string search algorithm one is likely to come across. Which is easy to confuse with a character type, or byte, as programmers might call it. A symbol might be more appropriate, because those algorithms operate on abstract alphabets.

    If one were to implement unmodified B-M on bitstrings, the "characters" would be individual bits. The bad character shift table would have two slots, for '0' and for '1'. Practical value of such arrangement is nil. But the algorithms can be augmented to handle multiple "characters" at a time, and this is when they become lucrative.

      The bad character shift table would have two slots, for '0' and for '1'. Practical value of such arrangement is nil.

      Exactly! (Glad that someone, even an anonymous someone :) also sees the problem.

      But the algorithms can be augmented to handle multiple "characters" at a time, and this is when they become lucrative.

      Clue sticks and references cordially invited?

      (Though my immediate reaction is that compound characters simply reduce to a different alphabet with more bits -- ie. (ab) becomes a single symbol that requires twice as many bits -- and the problem remains the same, or is compounded because the lookup table(s) need to be bigger.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        BrowserUk:

        Update 2: I couldn't leave it alone (damned OCD), so I think it works now. I definitely had a few code problems, and the new version looks pretty decent to me. (The variable names could be better, but I'm a bit too tired to mess with it.

        It looks like one of my previous editing passes removed the reason that I coded it in Python. At $work, I'm having to use Python and JavaScript to write some stuff. I chose the lesser of two evals.

        (Believed) correct code:

        #!/usr/bin/env python import pprint import random import sys # Set up your needle, haystack and alphabet size haystack = "0000111010011100101110110111000111101011110011111011111110 +000000100010010000101001010100000110100011001001100010110101011001101 +1000000" hay_len = len(haystack) ruler = "" cases = [ { 'NDL': "100001010010101000001101", 'syms':3, 'sz':8 }, { 'NDL': "100001010010101000001101", 'syms':4, 'sz':6 }, { 'NDL': "100001010010101000001101", 'syms':6, 'sz':4 }, { 'NDL': "100001010010101000001101", 'syms':8, 'sz':3 }, { 'NDL': "000010100101010000011010", 'syms':3, 'sz':8 }, { 'NDL': "000010100101010000011010", 'syms':4, 'sz':6 }, { 'NDL': "000010100101010000011010", 'syms':6, 'sz':4 }, { 'NDL': "000010100101010000011010", 'syms':8, 'sz':3 }, { 'NDL': "1000010100101010000011010", 'syms':5, 'sz':5 } ] case = 0 try: case = int(sys.argv[1]) except: pass needle = cases[case]['NDL'] num_syms = cases[case]['syms'] symbol_size = cases[case]['sz'] def main(): global ruler print "\n{0}\nHaystack: {1}\n Needle: {2}\n".format(syms(70, '-') +, haystack, needle) # Use brute force to find all matches positions = brute_force(needle, haystack) if len(positions)>0: ruler = "".join([ "V" if x in positions else " " for x in rang +e(0, len(haystack)) ]) print "Brute force:\n{0}\n{1}".format(ruler, haystack) print "{0}{1}\n".format(spaces(positions[0]), needle) else: print "Brute force: not found" if not find_needle(haystack, needle, num_syms, symbol_size): print "**** Case failed! ****" def find_needle(haystack, needle, num_syms, symbol_size): sym_marker = syms(symbol_size, '^') needle_syms = [ needle[i*symbol_size:(i+1)*symbol_size] for i in r +ange(0, num_syms)] max_shift = (num_syms-1)*symbol_size print "Modified BM search: {0} syms of {1} bits: {2}".format( num_syms, symbol_size, ", ".join(needle_syms)) # Compute delta table dd1 = { } for i in reversed(range(0, (1<<symbol_size))): sym = to_binstr(i, symbol_size) tmp = brute_force(sym, needle) if len(tmp)>0: bitofs = tmp[-1] shift = max_shift - bitofs ttt = ", ".join(map(str,tmp)) dd1[sym] = shift #bitofs # Uncomment to see the delta table #print pprint.pprint(dd1) # Search loop print "Searching for a needle in the haystack" needle_ptr = 0 while needle_ptr + len(needle) <= len(haystack): cmp_ptr = needle_ptr + max_shift found = haystack[cmp_ptr:cmp_ptr+symbol_size] print "\n{0}\n{1}".format(ruler, haystack) print "{0}{2} (pos={1})".format(spaces(cmp_ptr), cmp_ptr, sym_ +marker) print "{0}{1}".format(spaces(cmp_ptr), found) print "{0}{1}".format(spaces(needle_ptr), needle) if found in dd1: if dd1[found] == 0: fl_match = True for t in range(0, num_syms): if needle_syms[t] != haystack[needle_ptr + t*symbo +l_size:needle_ptr + (t+1)*symbol_size]: fl_match = False if fl_match: print "MATCH FOUND!" print "\n{0}\n{1}\n{2}{3}\n".format(haystack,ruler +,spaces(needle_ptr),needle) return True else: print "Byte matched, string mismatch, trying next +position" needle_ptr += 1 else: print "found {0}, delta=>{1}".format(found, dd1[found] +) needle_ptr += dd1[found] else: print "not found, big shift" needle_ptr += len(needle)-symbol_size return False def to_binstr(n,len): bv = [ "1" if n & (1<<x) else "0" for x in range(0, len)] return "".join(bv) def brute_force(needle, haystack): """Brute-force find all matching positions of needle in haystack"" +" positions = [] for i in range(0, len(haystack)-len(needle)+1): if needle == haystack[i:i+len(needle)]: positions.append(i) return positions def syms(n, sym): return "".join([ sym for x in range(0, n)]) def spaces(n): return syms(n, ' ') if __name__ == '__main__': main()

        I should probably recode it in perl at some point. Old code & such below in readmore tags, should you be interested.

        (BTW: pun was intended.)

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        Given the needle:

        00001010 01010100 00011010
        
        The bad shift table for handling eight bits at a time:
        00001010                        needle offset 0
         0001010 0                      needle offset 1
          001010 01                     needle offset 2
           01010 010                    needle offset 3
            1010 0101                   needle offset 4
             010 01010                  needle offset 5
              10 010101                 needle offset 6
               0 0101010                needle offset 7
                 01010100               needle offset 8
                  1010100 0             needle offset 9
                   010100 00            needle offset 10
                    10100 000           needle offset 11
                     0100 0001          needle offset 12
                      100 00011         needle offset 13
                       00 000110        needle offset 14
                        0 0001101       needle offset 15
                          00011010      needle offset 16
        
        The shift follows from needle offset. If some 8-bit value repeats, the smallest shift is used.

Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by Anonymous Monk on Apr 05, 2015 at 22:37 UTC

    Although I don't have enough time at the moment to even play with a solution - I wish I did - I'll throw my initial thought out there anyway: This sounds like something that could be handled very well by an FPGA. Sure, it would require a specialized hardware solution with code written in VHDL or Verilog by someone who knows that field, but nowadays there are extremely powerful FPGAs out there that can access loads of external RAM at incredibly high speeds (e.g. the Virtex family from Xilinx). So admittedly, unless you know VHDL/Verilog and happen to have an eval board lying around, that may only be worth the investment if this is one of your large-scale "save the customer millions of pounds" type projects :-)

      So just to continue a back-of-the-envelope sketch here using just one example, to try and give an idea of the capabilities of FPGAs (and leaving the Perl world entirely now)... the Xilinx Virtex-7 FPGA VC707 Evaluation Kit (http://www.xilinx.com/products/boards-and-kits/ek-v7-vc707-g.html) features the XC7VX485T device, which contains, among other things, about 300,000 6-input LUTs (http://www.xilinx.com/support/documentation/data_sheets/ds180_7Series_Overview.pdf), meaning you could build logic to run comparisons of thousands of bits in parallel. One limiting factor would probably be that the more bits you compare at once, the logic cascades would probably limit the clock speed; the design tools would tell you that (and I suspect an experienced FPGA designer would know some tricks to improve the speed).

      Anyway, implementing access to external RAM is something you can get off-the-shelf (http://www.xilinx.com/products/technology/memory-interfacing.html); this board includes 1GB DDR3 RAM. It also includes a pretty fast PCI Express interface, again there should be reference designs available. That would hopefully give one fast enough access to the external data.

      Once one has access to the data, perhaps loaded into the onboard RAM, one would read the blocks of data into the FPGA's registers. From there, my initial thought would be two design possibilities: either one massive shift register through which the data is shifted and compared on each shift, or running the comparisons in parallel through multiples of the comparison logic. Which is best will depend on a timing comparison and of course on how long the bit strings are that you are seeking.

      As you can tell I think FPGAs are really cool and I wish I got to work with them more often... well, at least I get to work with Perl ;-)

Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by mr_mischief (Monsignor) on Apr 06, 2015 at 16:09 UTC

    To clarify, my understanding is that Boyer-Moore should work in that it would produce correct output, but the benefit over brute force diminishes toward disappearance as the alphabet gets smaller. Is that what you're seeing?

      the benefit over brute force diminishes toward disappearance as the alphabet gets smaller

      As Anonymous Monk so succinctly and accurately put it:

      If one were to implement unmodified B-M on bitstrings, the "characters" would be individual bits. The bad character shift table would have two slots, for '0' and for '1'. Practical value of such arrangement is nil.

      So yes. The two character alphabet of bit-strings renders Boyer-Moore (and other similar algorithms) ineffective.

      But further, the problem is that most of those advocating these algorithms have proposed (and implemented) solutions that treat aligned groups of 8-bits as the alphabet. Ie. Given the needle:

      010010110101011110101110

      Build the shift table in terms of the three groups of 8-bits:

      01001011 01010111 10101110

      Giving a table of 3 effective entries and jumps in steps of multiples of 8; which simply fails to work in 7 out of 8 cases.

      The solution anonymonk proposes is to artificially extend the alphabet from 2 to 256 by using unaligned groups of 8-bits as he describes in Re^3: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?); which appears to work.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      Yes, it is implied that the smaller the set of the alphabet (and for that matter the needle) the smaller the shift potential, the smaller the shift potential the less optimized from linear any of those algos is.
Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by Anonymous Monk on Apr 06, 2015 at 01:23 UTC
    I guess I would just use a trie with short circuit and step through bit by bit (ony actually searching i+n bits on the needle if the previous bits match and jumping forward 1 bit on a fail) or modified aho-corasick that is constructed with offset needles where you bit compare and shift in larger blocks but with more compares per block -- backtracking if the constructed needle offsets match. This second option may have much better performance as you are able to use the alignment of the bit stream in memory and the cpu's register to your advantage. Its a lot of compares but in a way that is completely optimized by the system.
Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by sundialsvc4 (Abbot) on Apr 05, 2015 at 20:10 UTC

    I think that all such algorithms are rather-fundamentally inapplicable to your particular situation because they, by definition, are working at the smallest possible unit of storage:   the byte.   Whereas, today’s microprocessor architectures are both extremely wide (64 bits and counting), and extremely pipelined.   Data is grabbed into the processor in “cache lines” which are several times larger.   The architecture of the chip bears only a superficial resemblance to the way that it is described to even the assembly-language programmer.

    I know that you don’t like to hear from me – that you emphatically don’t want to hear from me – that some folks think that I am talking to hear my head rattle and could not possibly have anything to contribute to your ongoing battle with “unaligned bit strings.”   But maybe, instead, you (folks) should be listening to what I have to say on this point.

    You are, as I understand it, always looking for very long bitstrings, in even longer buffers, and always in a machine that has enough RAM to contain a single copy without paging.   Therefore, I suggest that you should strictly go for an algorithm, as I have suggested in the past, which tries to gulp the data in the largest chunks that the native-CPU can directly support, and that while doing so you avoid conditional logic.   You want the pipeline to become full and to remain full, with no branch-conditions to predict and no pipelines to break.   Please .. just try this:

    • There are exactly 64 possibilities for 64-bit words [2..n-1].   In a loop that considers all 64 “shifts” one at a time, search for any one without using if.
    • If there is a probability of lots of consecutive 64-bit words or much-less frequent ones, search for those instead.   You can search for a word in any offset in the buffer (other than the first or the last).
    • Confirm your suspicion either by sampling a few words or simply by testing the entire group of 64-bit words.   The logic is an easy combination of rol and shift and bitwise-ands, none of which requires conditional logic, all of which can be done in registers.   This, too, is important.
    • Only when words [2..n-1] all match do you finally test the first and the last 64-bit words, at the rotation specified, masking off the “don’t care” bits on either side.
    • What superficially appears to be a “brutish” approach, offered by a vainglorious Monk who likes to hear himself talk, actually is designed to play into the hands of the CPU.   The microprocessor’s physical implementation will be optimized even further, taking full advantage of whole cache-lines and more.   The key is that, instead of trying to find an unshifted bitstring in a buffer in which that string can occur in any of 64 alignments, you are searching for one of 64 shifted bitstrings ... and doing it with pipelined, un-conditional, raw speed.

      But maybe, instead, you (folks) should be listening to what I have to say on this point.

      Why would I consider what you say this time; when it is the exact opposite of what you said last time.

      In both cases, your words show a complete misunderstanding of the problem; and offer completely meaningless description of completely unworkable approaches to a completely different problem.

      Someone recently described you as "The PerlMonks Village Idiot"; but that just unfair to village idiots, cos they can't help it.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://1122468]
Approved by Corion
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-19 13:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found