[OT] The interesting problem of comparing bit-strings.

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: [OT] The interesting problem of comparing bit-strings. by roboticus (Chancellor) on Mar 24, 2015 at 11:41 UTC
BrowserUk: Are you familiar with the Boyer-Moore string search? I bet you could use the same basic idea of building a state machine of the next address to check. But rather than analyzing the needle byte by byte, do it in parallel for each shifted variant. Something like the below. Note: I'm going to pretend the byte size is 4 bits here, and I also don't know boyer-moore by heart, so I'm winging it to give you the gist of it. If you like the idea, you can hash out the annoying details (heh!): Given Needle : 1010 0011 1101 1100 Haystack: 0001 0010 1101 1100 1010 1101 0001 1110 1110 0111 1) Make three more copies of given, each bit-shifted Needle A: 1010 0011 1101 1100 Needle B: x101 0001 1110 1110 0xxx Needle C: xx10 1000 1111 0111 00xx Needle D: xxx1 0100 0111 1011 100x Column D C B A 2) Build the BM skip table based on the last complete byte D C B A 0000 ---- ---- ---- 4444 0001 ---W -M-- ---- ---- 0010 --X- ---- ---- ---- 0011 ---W M--- ---- ---- 0100 ---- ---M ---- ---- 0101 -Y-W ---- ---- ---- 0110 --X- ---- ---- ---- 0111 ---W ---- ---M --M- 1000 ---- --M- ---- ---- 1001 ---W ---- ---- ---- 1010 Z-X- ---- ---- ---- 1011 ---W ---- ---- ---M 1100 ---- ---- ---- M--- 1101 -Y-W ---- M--- ---- 1110 --X- ---- -M-- -M-- 1111 ---W ---- --M- ---- - : an entry I didn't try to determine yet M : Matched byte position, match previous needle char to previous table column. Z : Perfect match found Y : Perfect match if byte after position A matches remnant B X : Perfect match if byte after position A matches remnant C W : Perfect match if byte after position A matches remnant D 1-4: mismatch, skip 1..4 bytes and restart scan at position A [download] Geh! I've got to get to work now. I'll update this at lunch time. Anyway, the state table will show how far you can skip ahead so you don't need to examine every byte in the haystack. Read the [no such wiki, Boyer-Moor string search] page, and you'll see what I'm talking about. I've drawn four state machines here by having 4 output edges based on which needle we're tracking, but in the final version, we'd combine the tables into simpler entries. If you like the idea, we can investigate that a bit. I've got some time this evening, so we could code something up. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re^2: [OT] The interesting problem of comparing bit-strings. by BrowserUk (Patriarch) on Mar 24, 2015 at 13:01 UTC
I bet you could use the same basic idea of building a state machine of the next address to check Um. Boyer Moore is an optimisation of the basic string search isn't it? Right now I'm stuck just trying to implement the basic search taking into account the bit-level offsets. Optimisations can come later if required. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^3: [OT] The interesting problem of comparing bit-strings. by salva (Canon) on Mar 24, 2015 at 14:31 UTC
No Boyer-Moore, but you can convert your bitstring search into a regular expression (in the CS sense, not the Perl one), and then compile it into an efficient state-machine. For instance, in order to look for bitstring 010101, you have to look for character strings matching any of... `s1: 010101xx s2: x010101x s3: xx010101 s4: xxx01010 1xxxxxxx s5: xxxx0101 01xxxxxx s6: xxxxx010 101xxxxx s7: xxxxxx01 0101xxxx s8: xxxxxxx0 10101xxx` [download] You can write the regular expression matching s1 as a character set `$s1 = qr/[\x54\x55\x56\x57]/`, and similarly for s2-s8, and then combine all of them as `qr/$s1\|$s2\|...\|$s8/`. Then you would need some module able to convert that regular expression into a state machine, and use it for matching the vector... I am sure there is already some RegExp backend on CPAN that does that. The downside is that the resulting state-machines may be huge.	[reply] [d/l] [select]
Re^4: [OT] The interesting problem of comparing bit-strings. (More info.) by BrowserUk (Patriarch) on Mar 24, 2015 at 15:50 UTC
Re^5: [OT] The interesting problem of comparing bit-strings. (More info.) by Krambambuli (Curate) on Mar 25, 2015 at 10:31 UTC
Some notes below your chosen depth have not been shown here
Re^5: [OT] The interesting problem of comparing bit-strings. (More info.) by marinersk (Priest) on Mar 25, 2015 at 11:11 UTC
Some notes below your chosen depth have not been shown here
Re^4: [OT] The interesting problem of comparing bit-strings. by Anonymous Monk on Mar 24, 2015 at 17:20 UTC
Re^5: [OT] The interesting problem of comparing bit-strings. by salva (Canon) on Mar 24, 2015 at 21:11 UTC
Some notes below your chosen depth have not been shown here
Re: [OT] The interesting problem of comparing bit-strings. by marinersk (Priest) on Mar 24, 2015 at 10:59 UTC
Quick thought, as I won't have time to dedicate to code-example before tonight, but in case it sparks a line of thought for you -- This is specifically to address the "I don't want it to keep rotating the bits over and over again" problem. Might a combination of rotation and mask help? Have an initializer which sets up all possible aligned bitmask searches, and then a routine which loops through all the pre-built masks looking for a match? In this short example which will likely be fraught with errors, assuming we are looking for `0b1010` in `0b00110100101`. For readability's sake, this example plays on an 8-bit alignment requirement; simply increase the work of the initializer and searcher routines to take advantage of 64-bit alignment capabilities. The initializer sets up the following array of pre-rotated bitmasks: `mask[0] = 0b11110000; mask[1] = 0b01111000; mask[2] = 0b00111100; mask[3] = 0b00011110; mask[4] = 0b00001111;` [download] I'm running on light air here, I think you'll need to also set up an array of pre-rotated comparison bitstrings: `seabits[0] = 0b10100000; seabits[1] = 0b01010000; seabits[2] = 0b00101000; seabits[3] = 0b00010100; seabits[4] = 0b00001010;` [download] Then the main comparison routine iterates across the main bit buffer using these pre-built masks and values to ascertain if there is a match. Multilingual psuedocode for the main comparison routine: `for (chunkcol = 0; chunkcol < (BitBufferLength / CHUNKSIZE); chunk +col++) { foreach $masknum (@mask) { if ( ( BitBuffer[chunkcol] & mask[$masknum] ) == $seabit +s[$masknum]) { return ( ( chunkcol * CHUNKSIZE) + $masknum ); +# FIXME: Check for fenceposting error(s) } } } return ($RET_NOTFOUND);` [download] I should think a static accumulator can track which bit column you find the match in and simply exit the loop returning that value? I started coding a proof-of-concept in Perl but as soon as I got into the guts of it I knew this was going to take me longer than I have left this morning. Hope this got my point across and hope it either yields some inspiration or you already thought of it and have moved on. `:-)` Update: Fixed bitmask and searchbits arrays. Update: Added psuedocode. Update: Updated chunking in psuedocode.	[reply] [d/l] [select]
Re: [OT] The interesting problem of comparing bit-strings. by hdb (Monsignor) on Mar 24, 2015 at 20:29 UTC
Low level C programming is not my strength, so I stay with bit strings in Perl. Based on a block size of 8 bits, you would have 8 alignments of the needle. Each of this 8 alignments has a middle part of complete 8 bit blocks and a left and right partial block. I would use something like `index` to search for the middle block in the haystack bit string and then check whether or not the left and right partial blocks match or not. Apologies if anyone's proposal above is materially the same but the fact eluded me.	[reply] [d/l]
Re^2: [OT] The interesting problem of comparing bit-strings. by BrowserUk (Patriarch) on Mar 24, 2015 at 20:58 UTC
The idea is to provide the functionality to perl through a I::C module. I've done similar arrangements -- using bytes and shifts for the unaligned parts and 32 or 64-bit compares for the aligned parts -- but for large bitstrings -- mine are frequently > 1/4gb in size -- it is very slow. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^3: [OT] The interesting problem of comparing bit-strings. by sundialsvc4 (Abbot) on Mar 24, 2015 at 23:59 UTC
Ahhh.... the bitstrings are very large? How about the typical length of the patterns for which you search within such strings? If the length of the string were longer than 3 * 64 = 192 bits long, and especially if it were longer, then you would have a definite “crib,” in the form of a 64-bit quantity ... there are exactly 64 possibilities ... that must occur in the pattern at a quadword boundary. This would be a “huge win™,” well worth the extra effort of creating a special-case for, in anticipation that it would actually pay-off most of the time. I-F the string being sought was at least 192 bits long, then you are able to calculate, with certainty, the 64 possible values that exist for “the 64-bit ‘word in the middle.’” You can now search for each of these in turn, using an extremely efficient `REPNE SCASD` (i86 ...) assembler-language construct, and then explore the surrounding doublewords in each case to determine if the bit-string occurs. (If the run being sought is “even longer,” then a `REPNE CMPSD` loop can be applied to check all of the “subsequent doublewords,” the values of all of which once again can be determined with certainty. Only the partial-doublewords at the beginning and at the end, if any, remain to be verified.) Easily the most efficient implementation would be to loop through all 64 of the pre-computed possibilities, using the assembler instructions aforementioned to locate all occurrences within the buffer. (Of course, and as you of all Monks well know, you will need to use a “sliding window” scan through the compass of the entire file, to avoid missing any bitstrings that serependitiously fell across a boundary, but, I digress.) Each and every time you stumble-upon any of these 64 values, the odds that you have, in fact, “found another answer” are rather-astronomically high. In fact, in such a case you might even wish to “take a calculated risk” that in fact you need look no further. If you actually discover a matching 64-bit quantity within the buffer, then the odds are “64 in 2^64” that the entire value-being-sought is present. Certainly, if the `REPNE CMPSD` succeeds with regard to a location found by `REPNE SCASD`, then the odds of being wrong become absurdly (and dismissably) small: “It is a match. It has to be.” At this point, the actual odds of committing a “type-2 error” are . . . ~~zero~~. No, this is not a solution to your entire problem: if the bit-string being sought turns out to be shorter, it will not help at all. But, for longer strings, it will cause the problem to “virtually disappear.”	[reply]
Re^4: [OT] The interesting problem of comparing bit-strings. by BrowserUk (Patriarch) on Mar 25, 2015 at 00:52 UTC
Re^5: [OT] The interesting problem of comparing bit-strings. by salva (Canon) on Mar 25, 2015 at 09:47 UTC
Some notes below your chosen depth have not been shown here
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: [OT] The interesting problem of comparing bit-strings. by VinsWorldcom (Prior) on Mar 25, 2015 at 15:53 UTC
This is way over my head, but as a proof of concept, I tried to use Bit::Vector to compare the needle and haystack as strings. Probably not very efficient for the large sets you're talking about. #!perl use strict; use warnings; use Bit::Vector; # 101001 my $nedInHay = Bit::Vector->new_Dec(6, 41); # 101000 my $nedNOTInHay = Bit::Vector->new_Dec(6, 40); # 100101001001000011111010 my $hay = Bit::Vector->new_Dec(24, 9736442); print $nedInHay->to_Bin; if ((my $idx = index ($hay->to_Bin, $nedInHay->to_Bin )) != -1) { print " Found. Starts at position - $idx\n" } print $nedNOTInHay->to_Bin; if (my $idx = index ($hay->to_Bin, $nedNOTInHay->to_Bin ) != -1) { print " Found. Starts at position - $idx\n" } print "\n" [download] ... and run ... `VinsWorldcom@C:\Users\VinsWorldcom\tmp> test.pl 101001 Found. Starts at position - 3 101000` [download]	[reply] [d/l] [select]
Re: [OT] The interesting problem of comparing bit-strings. by Ea (Chaplain) on Mar 25, 2015 at 09:32 UTC
My first thought was that the problem is similar to what you might do in bioinformatics, as mentioned by Anonymous earlier. Although there are are likely significant differences between bitstrings and character strings that I can't help you with, I was listening to Natural Language Processing lectures on Coursera and the Edit Distance lectures attacked the problem with Dynamic Programming. Maybe that helps you with the coding strategy? Sometimes I can think of 6 impossible LDAP attributes before breakfast.	[reply]
Re^2: [OT] The interesting problem of comparing bit-strings. by BrowserUk (Patriarch) on Mar 25, 2015 at 12:02 UTC
Bioinformatics is one use (actually many uses) of large bitstring searches. Others include: Finding patterns of usage in temporal data. Each bit represents a time period; 1s represent activity within that time period. Steganography. Both application of, and detection of. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re: [OT] The interesting problem of comparing (long) bit-strings. by Anonymous Monk on Mar 28, 2015 at 01:39 UTC
You'll likely want a search that is strongly sub-linear, ie better than `O(n)`. There are string search algorithms that approach `O(log(m)n/m)`, on the average. With long needles (length `m`), that'll be a blast. However, you don't get anything for free: preprocessing requirements are `O(m)` in time and space. Google for "Turbo Reverse Factor" and "Alpha Skip Search*". Wikipedia seems to lack pages for these. With alpha skip, you create a trie (or a table) where you push the offsets of all substrings of length `log(m)`. Bitstrings are just strings on size=2 alphabet. Which algorithm or variant thereof will prove most suitable for you is something hardly anyone here could guess.	[reply]
Re^2: [OT] The interesting problem of comparing (long) bit-strings. by BrowserUk (Patriarch) on Mar 28, 2015 at 10:40 UTC
You'll likely want a search that is strongly sub-linear, ie better than O(n). There are string search algorithms that approach O(log(m)n/m), on the average.* On modern hardware, the single most expensive avoidable* 'operation' is a cache miss. With instruction cache misses more expensive data cache misses due to pipelining. context switches are more expensive, but are, at least in the long term, unavoidable. The problem with (all?) the analysies of the various sub-linear algorithms; is that they ignore (or were done before), the costs & benefits of 2 and 3 levels of on-chip instruction & data cache, became a significant factor in instrumenting this type of algorithm. My 7 y/o cpu, with only 2-level caching, can perform upwards of 1000 instructions in the time it takes to fulfill a data-cache miss. The effect for an instruction cache miss is even greater. Sub-linear algorithms exacerbate both problems, in 3 different ways: They require extra memory in the form of lookup/lookahead tables. That space and frequent access to it, directly hits the effectiveness of the data cache's ability to keep the relevant chunks of the haystack and needle in cache. The algorithms call for look-aheads, look-behinds and/or probes; that upsets the caching algorithms ability to keep caches primed with the required data. This is especially important, 64 times as important, for a bitstring search, where each bit needs to be compared 64 times each as a part of 64 different groups of 64 bits. This is the single most important factor that has come out of my testing of real code, on a real machine. Modern hardware is optimised to scream through memory in a linear fashion; and every single time you do anything that upsets that relentless, mechanical (electronic) forward (or reverse) flow, you pay a huge hit. They also require many more instructions and branch points. The number of instructions is significant, because if instructions early in the loop get flushed by instructions later in the loop, then you not only get a stall for the instruction cache when the loop iterates; it also causes a pipeline flush; and invalidates branch predictions. All together; and a very expensive situation. The number of branch points is significant because they compound exponentially in their effects on failed branch predictions; with the knock-on effect of instruction cache stalls if you breach the cache size cliff. The beauty of well-written brute force algorithms, is that they derive absolute maximum benefit from the cache effect, by doing as much work as possible on each cache line before allowing it to get flushed thus minimising the need for reloads: The inner loops are very small with only 1 or 2 branch points required. That ensures no instruction cache misses and fullest possible benefits from both branch predictions and pipelining. Because the data is processed strictly serially; and all the work required on each piece is done with just one from-memory load; they work with* both the hardware branch prediction and caching algorithms, to best effect. In the following table of benchmark timings from my implementation: The horizontal list of numbers at the top is the length of the needle in bits. The major vertical tick numbers are the distance, in quads, into the 128MB/1Gib haystack at which the needle is found. The minor vertical tick numbers are the offset, in bits, into the quad, where the needle is found. Summary. The algorithm is very linear in terms of the distance into the buffer; and sub-linear in terms of the length of the needle. And it is very fast. <Reveal this spoiler or all in this thread> Can it be beaten? Maybe, but I very much doubt it will be done using Boyer Moore or similar, jump-around-the-data algorithms. Time will tell :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l]
Re^2: [OT] The interesting problem of comparing (long) bit-strings. by oiskuu (Hermit) on Mar 29, 2015 at 09:10 UTC
Alright. Here's a demo of alpha skip search on bitstrings. The needle size is fixed, but it should be trivial to follow this with a longer compare. A possibility exists that some off-by-one errors have crept in. If that's the case, just deduct from my paycheck. :-) Alpha skip is quadratic in the worst case and probably not the best choice when nasty matches are expected. Anyway, I'm curious how this might hold up against the optimized brute-force routines. <Reveal this spoiler or all in this thread> Note: to change the needle size, just edit the `M` and `logM` values. (Compile with -fno-strict-aliasing, or fix the type-punning).	[reply] [d/l]
Re^3: [OT] The interesting problem of comparing (long) bit-strings. (Doesn't work!) by BrowserUk (Patriarch) on Mar 29, 2015 at 16:29 UTC
Making minimal changes to your code to allow it compile here, and it simply never finds anything(very quickly), no matter how simple I make the search criteria: C:\test\C>cl /W3 bitsearch-oiskuu.c Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64 Copyright (C) Microsoft Corporation. All rights reserved. bitsearch-oiskuu.c bitsearch-oiskuu.c(49) : warning C4267: '=' : conversion from 'size_t' + to 'U16', possible loss of data bitsearch-oiskuu.c(78) : warning C4267: '=' : conversion from 'size_t' + to 'char', possible loss of data bitsearch-oiskuu.c(80) : warning C4267: '+=' : conversion from 'size_t +' to 'char', possible loss of data Microsoft (R) Incremental Linker Version 9.00.21022.08 Copyright (C) Microsoft Corporation. All rights reserved. /out:bitsearch-oiskuu.exe bitsearch-oiskuu.obj C:\test\C>bitsearch-oiskuu Result -1 Took 1.837746e-004 secs [download] I've wasted enough time on it; and won't be expending any more, but here is the modified code that compiles clean (apart from warnings that originate from your posted code): #include <stdio.h> #include <stdlib.h> //#include <stdint.h> #include <string.h> //#include <unistd.h> //typedef uint32_t U32; //typedef uint16_t U16; typedef unsigned __int64 U64; typedef unsigned __int32 U32; typedef unsigned __int16 U16; typedef unsigned __int8 U8; enum { M = 8192, logM = 13, Mmask = M - 1, Mskip = M - logM + 1, Mnil +}; typedef struct { U16 head[M], next[M]; // index chains } skiptab; // extract logM bits at bit offset i. assert(logM <= 16) static __inline size_t fbits(char v, size_t i) { U32 x = (U32 )&v[2 (i >> 4)]; return Mmask & (x >> (i & 15)); } // 'tis rotten. should make shifted copies of x and memcmp the middle. static size_t xcmp(char y, size_t j, char x) { size_t i, diff = fbits(y, j) - fbits(x, 0); for (i = M % logM; i < M && !diff; i += logM) { diff = fbits(y, i + j) - fbits(x, i); } return diff; } static void prep(char x, skiptab Z) { size_t i, f; for (i = 0; i < M; i++) { Z->head[i] = Mnil; } for (i = 0; i < Mskip; i++) { f = fbits(x, i); Z->next[i] = Z->head[f]; Z->head[f] = i; } } // y of length n is haystack; x is the needle size_t search(char y, size_t n, char x, size_t m) { skiptab Z[1]; size_t i, j; // assert(m == M); prep(x, Z); n -= logM; for (i = -1; (i += Mskip) <= n;) { j = fbits(y, i); for (j = Z->head[j]; j != Mnil; j = Z->next[j]) { if (!xcmp(y, i-j, x)) return i-j; } } return -1; } // one-bit rotate void rot1(char v, size_t n) { size_t i, c; for (c = i = 0; i < n; c >>= 8, i++) { v[i] = c += (unsigned char )(v+i) << 1; } v[0] += c; } int main( int argc, char *argv ) { U64 hay[1] = { 0xdeadbeefdeadbeef }; U64 ndl[1] = { 0xdeadbeefdeadbeef }; U64 start, end, found = 0; start = __rdtsc(); found = search( (char)hay, sizeof(hay)8, (char)ndl, sizeof(ndl) +*8 ); _ReadBarrier(); end = __rdtsc(); printf( "Result %12d\t", found ); printf( "Took %9e secs\n", (double)( end - start ) / 2394046359.0 +); return 0; } [download] I'd suggest you enter it in an obfuscated C competition; could be a winner. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^3: [OT] The interesting problem of comparing (long) bit-strings. by BrowserUk (Patriarch) on Mar 29, 2015 at 09:38 UTC
'm curious how this might hold up against the optimized brute-force routines. It's 3 orders of magnitude slower.	[reply]
Re^4: [OT] The interesting problem of comparing (long) bit-strings. by oiskuu (Hermit) on Mar 29, 2015 at 10:32 UTC
Re^2: [OT] The interesting problem of ... bit-strings (alpha skip search) by oiskuu (Hermit) on Apr 01, 2015 at 10:18 UTC
I'll attempt a brief description of the Alpha Skip Search. Hopefully, this may shed some light on the basic operation of the algorithm. Consider a pattern of length `M`. The approach is to quickly locate any fractional substring within the pattern, wherever it may occur. Substrings of length `F` take offsets from `0 .. M-F`, inclusive. Thus there is a set of `M-F+1` needle fragments. The preprocessing stage creates an index of those fragments. A perl implementation might read as follows: `push @{$index{ substr($needle, $_, $F) }}, $_ for 0 .. $M-$F;` Search phase proceeds by inspecting the haystack one window at a time. Window shift is the same figure as above, `M-F+1`. Each time, a substring is extracted at the end of the window. The index now yields all possible alignments where this fragment will match in the pattern. Theoretical optimum for F is `log_sM` (log based on alphabet size). That amounts to one entry per index bucket, on the average. For random data, there will be one alignment to test per window, and this is likely to fail at the very start. In the case of bit-string matching, the substrings are nothing but small integers below (1<<F), which suggests a very simple array-based indexing scheme (instead of trie). Here's a sample implementation in C (just a rehash of the previous demo, main difference being this one handles variable length needles). <Reveal this spoiler or all spoilers in this node or all in this thread> Tested with gcc/clang under linux; my apologies if the code is too obscure, or otherwise unworkable. Update. Simplified above code. Note that it is sometimes difficult to put a name on a string search routine, especially with bit-strings. Minor modification can turn one search into another, and the variants abound. Here's a version that might be called Horspool. <Reveal this spoiler>	[reply] [d/l] [select]
Re^3: [OT] The interesting problem of ... bit-strings (alpha skip search) by BrowserUk (Patriarch) on Apr 01, 2015 at 11:47 UTC
Despite saying I wouldn't look at this further; since you updated, I did. But I didn't get very far. Once I commented out a couple (non-existent on my system) unix-y headers -- no idea if they are necessary or not: `#include <stdio.h> //#include <stdint.h> #include <stdlib.h> #include <string.h> //#include <unistd.h>` [download] I got this: `c:\test\c\oiskuu-search.h(9) : warning C4293: '<<' : shift count negat +ive or too big, undefined behavior` [download] Which I fixed by changing 1L to 1ull: `enum { logM = 32, Mmax = 1ull << logM, Mdflt = 65536, Mnil = Mmax - 1 +};` [download] Which got me to: c:\test\c\oiskuu-search.h(10) : warning C4341: 'Mmax' : signed value i +s out of range for enum constant c:\test\c\oiskuu-search.h(10) : warning C4309: 'initializing' : trunca +tion of constant value c:\test\c\oiskuu-search.h(14) : error C2143: syntax error : missing ') +' before '(' c:\test\c\oiskuu-search.h(14) : error C2091: function returns function c:\test\c\oiskuu-search.h(14) : error C2059: syntax error : ')' c:\test\c\oiskuu-search.h(14) : error C2085: 'bt_off_t' : not in forma +l parameter list [download] From these 3 lines: `// F = logM is theoretically optimal. Mmax is our maximum needle size. // if logM/Mmax is changed, be sure to also check the bt_off_t below enum { logM = 32, Mmax = 1ull << logM, Mdflt = 65536, Mnil = Mmax - 1 +}; enum { F = 16, Ftop = 1 << F, Fmask = Ftop - 1 }; //typedef uint32_t bt_off_t; // must hold 0..Mnil enum { BT_OFF_MAX = Mnil } __attribute__((packed)) typedef bt_off_t;` [download] I hope you'll understand that I'm not being unfair when I say I have simply no clue how to fix that...but I did try. (Quite why you need to invent your own special integer types all over the place is beyond my understanding.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l] [select]
Re^2: [OT] The interesting problem of comparing (long) bit-strings. by Anonymous Monk on Mar 28, 2015 at 08:46 UTC
so instead of brute forcing a search problem, the solution is to run out of memory ... brilliant :P	[reply]
Re^3: [OT] The interesting problem of comparing (long) bit-strings. by salva (Canon) on Mar 28, 2015 at 08:50 UTC
Boyer-Moore is O(1) in memory usage.	[reply]
Re^3: [OT] The interesting problem of comparing (long) bit-strings. by Anonymous Monk on Mar 28, 2015 at 20:03 UTC
Hm? Boyer-Moore prep is the same, `O(m)` in time and space. A good-shift table of length `m` is used. It might be more compact in practice. If the problem is too large, divide it! Code a tuned search routine for fixed length (sub)needles, such that all tables fit into L2. My guess is KB needles can be done. Now the search becomes two-level. Top-level "alphabet" are the subneedles. As for brute-force. The "Boyer-Moore beating assembler versions that utilize microcoded string search instructions" is a decades-old lesson. Tried to google for some edutaining links, but sadly, google has become ****.	[reply]
Re^4: [OT] The interesting problem of comparing (long) bit-strings. by BrowserUk (Patriarch) on Mar 29, 2015 at 04:17 UTC
Re^5: [OT] The interesting problem of comparing (long) bit-strings. by salva (Canon) on Mar 31, 2015 at 06:24 UTC
Some notes below your chosen depth have not been shown here
Re: [OT] The interesting problem of comparing bit-strings. by sundialsvc4 (Abbot) on Mar 25, 2015 at 20:04 UTC
“Bit vector” problems such as this one actually have many different analogs. (Graphics processing, in particular, comes to mind.) These problems become expensive and difficult to solve if the strings are short (less than the native word-size of the machine), and/or if there is positively no “crib” that can be meaningfully applied to cut the search-space. For instance, if you are routinely looking for bit-strings that are (say) only 20-30 bits long, then there really isn’t any alternative but to pedantically growl through the buffer. If you are looking for a 20-bit-long quantity in a world of 64-bit words, that quantity could be in any one of 64 positions, and within that space of 128 total bits, 108 of them are entropy. Strategies such as the one which I previously suggested would be of no real value, because the shorter the “window” is, the less likely it is that “a hit” implies the existence of “a solution.” Even if you try to press strategies such as Moore into service, you find that they are not nearly so effective in this situation, as they are in the text-search applications for which they were intended. Why? Because uncompressed text actually has only a very small number of possible bit-patterns within each byte. The entropy of the data is very small. If you consider text “one character at a time,” you get a familiar letter-frequency distribution: `ETAOIN...`. In that context, Moore’s algorithm might well save some time. Whereas, if the data is a bit-string, the distribution of byte-wise values is much more likely to be flat, simply because the data is not aligned to any sort of incremental boundary. Hence, the necessity to “divide and conquer.” To find some rune by which you can quickly eliminate all places where the data could not be.	[reply]
Re^2: [OT] The interesting problem of comparing bit-strings. by BrowserUk (Patriarch) on Mar 25, 2015 at 20:32 UTC
�Bit vector� problems such as this one actually have many different analogs. (Graphics processing, in particular, comes to mind.) Arrays of rgb or rgba (or CMK or HSA) are not bit vectors. For instance, if you are routinely looking for bit-strings that are (say) only 20-30 bits long, then there really isn�t any alternative but to pedantically growl through the buffer. There isn't any alternative. Full stop. Regardless of the length of the needles. But when a sensible algorithm can locate 1 million bits close to the end of a 1/4 billion bits in 0.25 seconds, who cares if you have to "pedantically growl". Strategies such as the one which I previously suggested would be of no real value, It is of no value. Full stop. Even if you try to press strategies such as Moore into service, you find that they are not nearly so effective in this situation if by "Moore" you mean Boyer Moore; you're wrong again. It requires some modifications, but it can be very useful. Just not with microcoded scan loops. Hence, the necessity to �divide and conquer.� To find some rune by which you can quickly eliminate all places where the data could not be. There's no necessity. And anything that takes longer than 1/4 of a second is pointless. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^3: [OT] The interesting problem of comparing bit-strings. by syphilis (Archbishop) on Apr 04, 2015 at 01:05 UTC
But when a sensible algorithm can locate 1 million bits close to the end of a 1/4 billion bits in 0.25 seconds, who cares if you have to "pedantically growl" ..... And anything that takes longer than 1/4 of a second is pointless. Out of curiosity, I tested the gmp library's capability with the above scenario - under the assumption that "1/4 billion" is "250 million". The haystack was pseudo-randomly generated, and the needle starts 19 bits from the left hand end of the haystack. I used a combination of Math::GMPz and Inline::C. It took 3 seconds to find the needle in the haystack (starting from the right hand end of the haystack) on Windows 7 64 bit. In some parts of the world "1/4 billion" is "25 million" and the program then, of course, finds the needle 10 times quicker. I'm not at all impressed with the code that I wrote in order to check this. But I was impressed that a general usage library could provide such a speedy implementation of the approach. Unfortunately, it's not quick enough to be pointful ;-) use strict; use warnings; use Math::GMPz qw(:mpz); use Time::HiRes; use Inline C => Config => USING => 'ParseRegExp', BUILD_NOISY => 1, INC => '-IC:/MinGW/msys/1.0/local/include', LIBS => '-LC:/MinGW/msys/1.0/local/lib -lgmp', TYPEMAPS => './typemap', # ships with Math::GMPz ; use Inline C => <<'EOC'; #include <gmp.h> int start(mpz_t * haystack, int haystack_size, mpz_t * needle, int needle_size) { int h_bit, n_bit, nless1 = needle_size -1; for(h_bit = nless1; h_bit < haystack_size; h_bit++) { for(n_bit = 0; n_bit < needle_size; n_bit++) { if(mpz_tstbit(*haystack, h_bit - needle_size + n_bit) != mpz_tstbit(needle, n_bit)) break; if(n_bit == nless1) return h_bit; } } return 0; } EOC my ($t1, $t2, $match); my ($haystack_size, $needle_size) = (250e6, 1e6); # generate the haystack - takes a while my $h_string = rstring($haystack_size); # generate a needle that's guaranteed to match , starting # at 19 bits from the lh end - ie at bit 249999981 my $n_string = substr($h_string, 19, $needle_size); # create the haystack gmp vector. (We need an additional # leading "1" bit to ensure that leading zero bits aren't ignored.) my $haystack = Math::GMPz->new( "1" . $h_string , 2); # create the needle gmp vector. (We need an additional # leading "1" bit to ensure that leading zero bits aren't ignored.) my $needle = Math::GMPz->new("1" . $n_string, 2); # Check that we've got the right number of bits. print Rmpz_sizeinbase($haystack, 2), " ", Rmpz_sizeinbase($needle, 2), "\n"; $t1 = Time::HiRes::time(); $match = start($haystack, $haystack_size, $needle, $needle_size); $t2 = Time::HiRes::time(); print $match, " ", $t2 - $t1, "\n"; sub rstring { die "haystack size must be a multiple of 10" if $_[0] % 10; my $ret; my $its = $_[0] / 10; $ret .= sprintf("%010b",int rand 1024) for 1 .. $its; if(length($ret) != $_[0] \|\| $ret =~ /[^01]/) { die "Error in sub rstring"; } return $ret; } # Outputs: # 250000001 1000001 # 249999981 3.16680598258972 [download] Cheers, Rob	[reply] [d/l]
Re^4: [OT] The interesting problem of comparing bit-strings. by BrowserUk (Patriarch) on Apr 04, 2015 at 01:46 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

[OT] The interesting problem of comparing bit-strings.

The interesting problem of comparing bit-strings.

The function required thus takes six arguments:

Return value

What have I tried so far?

What am I looking for?