Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^18: [OT] The interesting problem of comparing (long) bit-strings.

by salva (Canon)
on Apr 03, 2015 at 09:06 UTC ( [id://1122341]=note: print w/replies, xml ) Need Help??


in reply to Re^17: [OT] The interesting problem of comparing (long) bit-strings.
in thread [OT] The interesting problem of comparing bit-strings.

Answers to questions:

oiskuu says: re: bitstrstr. That looks like Horspool. (Most B-M simplifications have dropped the "good-suffix" shift and kept the "bad-character(s)" shift).

Yes, it is actually Boyer-Moore-Horspool.

I still have to come with a way to implement the good-suffix tables without incurring into a 32*m (or 64*m) memory usage which I consider unacceptable. Using the delta compression would reduce it to 8*m. Maybe it can be done at the byte level and then it would come to 1*m... the thing is that I like the O(1) memory consumption of the B-M-H.

Where does the delta compression idea come from?

It is my own. Trying to keep all the delta information in the cache.

Currently, on the GitHub repo there are three variants of the algorithm: the "master" branch that tries to work at byte boundaries when looking for the bad-character shift; the "simplify" branch, that works at bit level and the "caching" one (implementing the delta compression) that tries to be cache friendly.

I still don't know if there will be any effect in performance. I think there would be edge cases where having a precisse delta would help but I don't know if those are likely to appear on real-life data.

The same happens when deciding if running the bad-character test at byte boundaries first or just working at the bit level. The former removes a memory load (very likely from the processor cache), and a shift operation.

I was considering another variation: working at the byte level, and then performing 8 parallel bitstring comparisons when delta < 8 bits (or even working with uint16_t units, and performing 16 comparisons in parallel).

BrowserUk says: did you start with a (simple byte-wise) Boyer-Moore implementation cribbed from somewhere?

No, I started from scratch.

BrowserUk says: I'm confused as to the difference between needle_offset & needle_prefix?

needle_prefix is just a hack for testing byte-unaligned needles.

BrowserUk says: If you have a brief explanation of the values in the delta table it might help. I tried looking at them in hex & binary to understand their purpose; but nothing leaps off the page at me.

It is (mostly) an exponential succession used to reduce the size of the B-M-H delta table to something that fits into the L1 cache. The script used to generate it is also in the repository.

The jump table contains indexes (uint8_t) into the delta table (uint32_t). That way, for a 14bits window size, the function uses 1*(1<<14) + 256*4 bytes = 17KB of working memory that fits in the 32KB L1 cache of current x86/x86_64 CPUs.

  • Comment on Re^18: [OT] The interesting problem of comparing (long) bit-strings.
  • Download Code

Replies are listed 'Best First'.
Re^19: [OT] The interesting problem of comparing (long) bit-strings.
by BrowserUk (Patriarch) on Apr 04, 2015 at 09:14 UTC

    Okay. I succeeded in getting the caching variant to compile, but it crashes for needles > 20 bits. (Note: almost certainly down to changes I have to make to get it to compile here (LLP64 v lp64 changes), but since I do not understand the code, and there's nothing to reference it against, introducing bugs was almost inevitable.)

    But, 2 days banging my head is enough. I don't need to prove your point for you. Here is a simple version of my brute-force bitstrnstrn(), if you're motivated to finish your code and compare, be my guest.

    Note:the eclectic ordering of the parameters bitstrnstrn( &hay, &pin, bit_offset_hay, bit_offset_pin, bit_len_hay, bit_len_pin ), means that the values most needing to be in registers, automatically are as a consequence of the x64 calling convention (for my compiler).

    #define SL( v, o ) __shiftleft128( *((v)+1), *((v)), o ) __forceinline U8 cmpBits( const register U64 *hay, const register U64 *pin, U8 regis +ter ohay, U8 register opin, U64 lpin ) { U64 register i; for( i=0; i < lpin/64; ++i, ++hay, ++pin ) { if( SL( hay, ohay ) != SL( pin, opin ) ) return 0; } { U8 lastbits = lpin % 64; U64 mask = ( ( 1ull << ( lastbits ) ) -1 ) << ( 64 - lastbits +); if( ( SL( hay, ohay ) & mask ) != ( SL( pin, opin ) & mask ) ) + return 0; } return 1; } U64 bitstrnstrn( const register U64 *hay, const register U64 *pin, U8 +register ohay, U8 register opin, U64 lhay, U64 lpin ) { U64 register i; for( i = ohay; i < lhay+ohay; ++i ) { if( cmpBits( hay+(i/64), pin, i%64, opin, lpin ) ) return i; } return -1; }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^19: [OT] The interesting problem of comparing (long) bit-strings.
by oiskuu (Hermit) on Apr 04, 2015 at 00:43 UTC

    I don't know if I'd call it B-M-H either. B-M-H-S, perhaps. :-)

    There are three crucial differences over Horspool as far as I can see:

    • Single-bad-character-shift is extended to multi-character, or rather, multi-bit in this case.
      That much is pretty obvious when working with bit-strings.
    • The shift values are compressed. If the extra work is offset by space savings, I cannot tell.
    • A window limit is enforced. In this instance, the biggest shift is (delta[255] = 106688) bits.

    Larger exponent base for (tail end of) the delta[] could cover the full range, so the last two points are separate issues.

    Window limit actually makes a lot of sense. In a random run of characters, they'll all be seen soon enough. There won't be many deltas above 5*(1<<BITS) #1; any further preprocessing would be wasted effort. (On random patterns.) Or, to put it another way, the limit acknowledges the hardware constraints (register, cache, memory sizes).

    The same limit can apply to other searches— B-M, skip search, etc. With the limit in place, they all become O(1) in preprocessing. Effectively, you only search for a part of the pattern. The slow compare, however, is done on the full pattern.

    #1   exp(-5) = .0067

      Working code trumps theoretical efficiencies every time.

      And by "works" I can only mean: 'works here'!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1122341]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-04-23 11:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found