in reply to Re: [OT] The interesting problem of comparing (long) bit-strings.
in thread [OT] The interesting problem of comparing bit-strings.

I'll attempt a brief description of the Alpha Skip Search. Hopefully, this may shed some light on the basic operation of the algorithm.

Consider a pattern of length M. The approach is to quickly locate any fractional substring within the pattern, wherever it may occur. Substrings of length F take offsets from 0 .. M-F, inclusive. Thus there is a set of M-F+1 needle fragments. The preprocessing stage creates an index of those fragments. A perl implementation might read as follows:

push @{$index{ substr($needle, $_, $F) }}, $_     for 0 .. $M-$F;

Search phase proceeds by inspecting the haystack one window at a time. Window shift is the same figure as above, M-F+1. Each time, a substring is extracted at the end of the window. The index now yields all possible alignments where this fragment will match in the pattern.

Theoretical optimum for F is logsM (log based on alphabet size). That amounts to one entry per index bucket, on the average. For random data, there will be one alignment to test per window, and this is likely to fail at the very start.

In the case of bit-string matching, the substrings are nothing but small integers below (1<<F), which suggests a very simple array-based indexing scheme (instead of trie). Here's a sample implementation in C (just a rehash of the previous demo, main difference being this one handles variable length needles).

Tested with gcc/clang under linux; my apologies if the code is too obscure, or otherwise unworkable.

Update. Simplified above code. Note that it is sometimes difficult to put a name on a string search routine, especially with bit-strings. Minor modification can turn one search into another, and the variants abound. Here's a version that might be called Horspool.

Replies are listed 'Best First'.
Re^3: [OT] The interesting problem of ... bit-strings (alpha skip search)
by BrowserUk (Pope) on Apr 01, 2015 at 11:47 UTC

    Despite saying I wouldn't look at this further; since you updated, I did. But I didn't get very far.

    Once I commented out a couple (non-existent on my system) unix-y headers -- no idea if they are necessary or not:

    #include <stdio.h> //#include <stdint.h> #include <stdlib.h> #include <string.h> //#include <unistd.h>

    I got this:

    c:\test\c\oiskuu-search.h(9) : warning C4293: '<<' : shift count negat +ive or too big, undefined behavior

    Which I fixed by changing 1L to 1ull:

    enum { logM = 32, Mmax = 1ull << logM, Mdflt = 65536, Mnil = Mmax - 1 +};

    Which got me to:

    c:\test\c\oiskuu-search.h(10) : warning C4341: 'Mmax' : signed value i +s out of range for enum constant c:\test\c\oiskuu-search.h(10) : warning C4309: 'initializing' : trunca +tion of constant value c:\test\c\oiskuu-search.h(14) : error C2143: syntax error : missing ') +' before '(' c:\test\c\oiskuu-search.h(14) : error C2091: function returns function c:\test\c\oiskuu-search.h(14) : error C2059: syntax error : ')' c:\test\c\oiskuu-search.h(14) : error C2085: 'bt_off_t' : not in forma +l parameter list

    From these 3 lines:

    // F = logM is theoretically optimal. Mmax is our maximum needle size. // if logM/Mmax is changed, be sure to also check the bt_off_t below enum { logM = 32, Mmax = 1ull << logM, Mdflt = 65536, Mnil = Mmax - 1 +}; enum { F = 16, Ftop = 1 << F, Fmask = Ftop - 1 }; //typedef uint32_t bt_off_t; // must hold 0..Mnil enum { BT_OFF_MAX = Mnil } __attribute__((packed)) typedef bt_off_t;

    I hope you'll understand that I'm not being unfair when I say I have simply no clue how to fix that...but I did try.

    (Quite why you need to invent your own special integer types all over the place is beyond my understanding.)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked