Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Locating a specified number of contiguous 0 bits within a large bitstring efficiently.

by BrowserUk (Pope)
on Jun 06, 2013 at 14:41 UTC ( #1037467=perlquestion: print w/ replies, xml ) Need Help??
BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Given a bitstring $vec upto 64MB long, how to quickly locate a run of (at least) N contiguous, unset bits regardless of alignment?

N might be any number from 1 to 100s

My current thinking is that for N > 8, B = int( N /8 ), b = N % 8; Search for B contiguous 0 bytes, then check the preceding and following bytes for at least b adjacent 0 bits.

For N < 8; for each N, only a subset of the 256 bit patterns in a byte have N unset contiguous bits. Ie.

  • 7 contiguous bits there are 2 byte values: 1 & 128
  • 6 contiguous bits (not including the above), there are 5 byte values: 2, 3, 64, 129, 192.
  • 5 contiguous bits (not including the above), there are 12 byte values.
  • 4 contiguous bits (not including the above), there are 28 byte values.
  • 3 contiguous bits (not including the above), there are 59 byte values.
  • 2 contiguous bits (not including the above), there are 94 byte values
  • 1 contiguous bits (not including the above), there are 54 byte values.
  • 0 contiguous bits, there is 1 byte value.

With a lookup table mapping the number of bits required to the bit patterns that can provide them, I have a way to locate these smaller runs.

Except that:

  1. Any one of the run lengths > 2 can be provided by many combinations of 2 adjacent bytes.
  2. The whole process is becoming both decision rich and search heavy.

So then I thought about having 8 more bit strings, where each bit represents a byte in the primary bitstring, one for each of the 8 sets above, where a set bit indicates that this byte in the primary bitstring can provide the requisite number of bits.

And when those bits (in the primary bitstring) are utilised, the one bit is toggled (off) in the secondary bitstring where is was found; and one bit is set in the secondary bitstring that represents the maximum number of unset bits it can now provide.

This reduces the searching to picking the appropriate secondary bitstring and looking for a non-zero byte. (It doesn't yet deal with the adjacent bytes with adjacent, contiguous bits problem.).

It also helps (??) with the large N problem, in that instead of needing to search the primary bitstring for N/8 contiguous 0 bytes; I can search the '7-bits' secondary bitstring for N/8 contiguous bits, potentially cutting the search time to 1/8th.

But of course, the same alignment problem that existed with the primary bit string for which the secondary bitstrings are used to solve, now manifests itself again.

  1. Do I implement a tertiary set of bitstrings (See A.)

Thoughts, comments, better solutions?

(For background and because someone is going to ask though it doesn't affect the problem or possible solutions. The primary bitstring represents freespace in an allocator. (Could be disk or memory.))


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
Download Code
Re: Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
by dave_the_m (Parson) on Jun 06, 2013 at 16:25 UTC
    You could do it fairly efficiently with a regex, giving 8 alternations which represent the 8 possible start positions within a byte. For example, matching 12 contiguous zeros might be done using something along the lines of
    / \x00 [\x00\x10\x20\x30..\xf0] | [\x00\\x01] [\x00\x20\x40\x60..\xe0] ... [\x00-\x7f] \x00 [\x00\x08\x10\x18..\xf8] /x
    (That's approx off the top of my head. It probably includes errors and endian mistakes. You'd want to write a function that generates it automatically.)

    Dave.

      Any thoughts on how to convert $-[0] byte position at which the match is found to a bit position?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I guess just see how many high bits are zero in the byte at position $-[0] ?

        Dave.

Re: Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
by FloydATC (Hermit) on Jun 06, 2013 at 16:48 UTC
    I have no idea how they perform, but wouldn't the Set::IntSpan::* family of modules have to do something like this in order to implement the "holes" and "cover" methods?

    -- FloydATC

    Time flies when you don't know what you're doing

      Set::IntSpan::* family of modules have to do something like this in order to implement the "holes" and "cover"

      This is a interesting idea. Thanks.

      However, Set::IntSpan that has those interesting methods seems to be both memory hungry and slow.

      The neither of the faster, more memory lean modules, Set::IntSpan::Fast & Set::IntSpan::Fast::XS do :(


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
by Grimy (Monk) on Jun 06, 2013 at 17:05 UTC
    use strict; use warnings; # $leading_0s[$n] is a regex matching any byte that begins with at lea +st $n '0' bits my @leading_0s = (qr//, map { my $max = chr(0xFF >> $_); qr/[\0-$max]/; } 1..8); # $trailing_0s[$n] is a regex matching any byte that ends with at leas +t $n '0' bits my @trailing_0s = (qr//, map { my $step = 1 << $_; my @ok = map { chr($step * $_) } 0..0xFF >> $_; local $" = q[]; qr/[\Q@ok\E]/ } 1..8); # Test if a string contains $n consecutive '0' bits. # If yes, returns the bit offset of the match. # Otherwise, returns -1. sub match_0s { my $n = shift; # number of consecutive '0' bits to be matched local $_ = shift || $_; # string to match against for my $trailing (0..8) { my $full = $n - $trailing >> 3; my $leading = $n - $trailing & 7; if (/$trailing_0s[$trailing]\0{$full}$leading_0s[$leading]/) { return $-[0] << 3 | -$trailing & 7; } } return -1; } $_ = "ABC`\0abc"; print match_0s(14), $/; # 27 print match_0s(15), $/; # -1 print match_0s(24, 0.0.0), $/; # 0

    Tested, but not benchmarked. It may or may not be faster that the straightforward approach using unpack. To change endianness, just switch @leading_0s and @trailing_0s.

    EDIT: now works even when the string is made entirely of null bytes.

    EDIT: doesn't work for $n < 8. This case is left as an exercise to the reader. (:</code>

      This works perfectly, and fairly quickly. Thankyou.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I don't see how this can work for $N<8; in fact, it doesn't:
        match_0s(1,"\201\201\201\201") => -1
Re: Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
by johngg (Abbot) on Jun 06, 2013 at 23:23 UTC

    To avoid questions of byte boundary alignment you could perhaps convert the vector to a character string and use regex matching with @- to find your offset. This seems to work quite quickly unless you look for a contiguous block of a length that doesn't actually exist. On my fairly old laptop that will take a little while before it decides it can't find the block.

    use strict; use warnings; use 5.014; use List::Util qw{ max }; use Time::HiRes qw{ gettimeofday tv_interval }; my $t0 = [ gettimeofday() ]; srand 1234567; my $vec = q{}; vec( $vec, 536_870_911, 1 ) = 0; vec( $vec, int rand 536_870_911, 1 ) = 1 for 1 .. 1e7; my $t1 = [ gettimeofday() ]; say qq{Creating vector - @{ [ tv_interval( $t0, $t1 ) ] }}; my $bitStr = unpack q{b*}, $vec; my $t2 = [ gettimeofday() ]; say qq{Unpacking bitstring - @{ [ tv_interval( $t1, $t2 ) ] }}; say qq{Longest contiguous block of zeros is }, max map length, $bitStr =~ m{(0+)}g, q{ bits long}; my $t3 = [ gettimeofday() ]; say qq{Finding longest block - @{ [ tv_interval( $t2, $t3 ) ] }}; for my $numZeros ( 25, 78, 307, 599, 943 ) { my $ts = [ gettimeofday() ]; say qq{At least $numZeros contiguous 0s }, $bitStr =~ m{(0{$numZeros,})} ? qq{found at offset $-[ 0 ], length @{ [ length $1 ] }} : q{could not be found}; say qq{ Search took - @{ [ tv_interval( $ts, [ gettimeofday() +] ) ] }}; }

    The output.

    Creating vector - 6.612482 Unpacking bitstring - 2.121185 Longest contiguous block of zeros is 843 Finding longest block - 9.848668 At least 25 contiguous 0s found at offset 0, length 37 Search took - 1.685613 At least 78 contiguous 0s found at offset 134, length 85 Search took - 0.725265 At least 307 contiguous 0s found at offset 31289, length 343 Search took - 0.702597 At least 599 contiguous 0s found at offset 5476471, length 625 Search took - 0.82269 At least 943 contiguous 0s could not be found Search took - 12.095307

    I don't know whether this method will be fast enough for your purposes but I think it is likely to be simpler that juggling byte boundaries. I hope this will be of use.

    Cheers,

    JohnGG

      Whilst this certainly works; the economics of either converting the bitvector to a bytevector for every search; or just storing the bytevector and dropping the bitvector just don't work.

      A 64MB bit vector can map (at least) a 4GB space, for an administration overhead of 1.5%.

      Using 1/2GB to map the 4GB raises that to 12.5%.

      And I'm hoping for much faster than 3/4 of a second average search time. Part of the advantage of using a bitvector is only having 1/8th of the string to search.

      Whether I can capitalise on that is still an open question :)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Using index looking for contiguous NULL bytes in the vector seems to be much faster.

        use strict; use warnings; use 5.014; use List::Util qw{ max }; use Time::HiRes qw{ gettimeofday tv_interval }; my $t0 = [ gettimeofday() ]; srand 1234567; my $vec = q{}; vec( $vec, 536_870_911, 1 ) = 0; vec( $vec, $_ , 1 ) = 1 for 1 .. 21143; vec( $vec, int rand 536_870_911, 1 ) = 1 for 1 .. 1e7; my $t1 = [ gettimeofday() ]; say qq{Creating vector - @{ [ tv_interval( $t0, $t1 ) ] }}; my $bitStr = unpack q{b*}, $vec; my $t2 = [ gettimeofday() ]; say qq{Unpacking bitstring - @{ [ tv_interval( $t1, $t2 ) ] }}; say qq{Longest contiguous block of zeros is }, max map length, $bitStr =~ m{(0+)}g, q{ bits long}; my $t3 = [ gettimeofday() ]; say qq{Finding longest block - @{ [ tv_interval( $t2, $t3 ) ] }}; say qq{\nSearch using regex}; for my $numZeros ( 25, 10, 78, 3, 943, 307, 5, 599, 19, 345 ) { my $ts = [ gettimeofday() ]; say qq{At least $numZeros contiguous 0s }, $bitStr =~ m{(0{$numZeros,})} ? qq{found at offset $-[ 0 ], length @{ [ length $1 ] }} : q{could not be found}; say qq{ Search took - @{ [ tv_interval( $ts, [ gettimeofday() +] ) ] }}; } say qq{\nSearch using index}; for my $numZeros ( 25, 10, 78, 3, 943, 307, 5, 599, 19, 345 ) { my $ts = [ gettimeofday() ]; if ( $numZeros < 23 ) { my $buffer = q{}; my $offset = -1; my $bufStart = 0; my $lookFor = q{0} x $numZeros; while ( $bufStart < length $vec ) { $buffer = unpack q{b*}, substr $vec, $bufStart, 131; do { say qq{At least $numZeros contiguous 0s found at }, $bufStart * 8 + $offset; last; } unless ( $offset = index $buffer, $lookFor ) == -1; $bufStart += 128; } say qq{At least $numZeros contiguous 0s could not be found} if $offset == -1; } else { my $wholeBytes = int( ( $numZeros - 7 ) / 8 ); my $lookFor = qq{\0} x $wholeBytes; my $offset = -1; my $zerosToTheLeft = 0; my $zerosToTheRight = 0; while ( ( $offset = index $vec, $lookFor, $offset ) > -1 ) { $zerosToTheLeft = zerosToTheLeft( $offset ); $zerosToTheRight = zerosToTheRight( $offset, $wholeBytes ) +; last if ( $wholeBytes * 8 + $zerosToTheLeft + $zerosToTheR +ight ) >= $numZeros; $offset += $wholeBytes; } if ( $offset == -1 ) { say qq{At least $numZeros contiguous 0s could not be found +}; } else { say qq{At least $numZeros contiguous 0s found at }, $offset * 8 - $zerosToTheLeft; } } say qq{ Search took - @{ [ tv_interval( $ts, [ gettimeofday() +] ) ] }}; } sub zerosToTheLeft { my $offset = shift; return 0 unless $offset; my $byteStr = unpack q{b*}, substr $vec, $offset - 1, 1; return 0 unless $byteStr =~ m{(0+)$}; return length $1; } sub zerosToTheRight { my( $offset, $wholeBytes ) = @_; return 0 if ( $offset + $wholeBytes ) == length $vec; my $byteStr = unpack q{b*}, substr $vec, $offset + $wholeBytes, 2; return 0 unless $byteStr =~ m{^(0+)}; return length $1; }

        The output.

        Creating vector - 6.651795 Unpacking bitstring - 2.116776 Longest contiguous block of zeros is 843 Finding longest block - 9.871085 Search using regex At least 25 contiguous 0s found at offset 21144, length 65 Search took - 1.684111 At least 10 contiguous 0s found at offset 21144, length 65 Search took - 0.737168 At least 78 contiguous 0s found at offset 21302, length 94 Search took - 0.701232 At least 3 contiguous 0s found at offset 21144, length 65 Search took - 0.704558 At least 943 contiguous 0s could not be found Search took - 12.084963 At least 307 contiguous 0s found at offset 31289, length 343 Search took - 0.658849 At least 5 contiguous 0s found at offset 21144, length 65 Search took - 0.702935 At least 599 contiguous 0s found at offset 5476471, length 625 Search took - 0.822874 At least 19 contiguous 0s found at offset 21144, length 65 Search took - 0.702927 At least 345 contiguous 0s found at offset 70112, length 351 Search took - 0.703438 Search using index At least 25 contiguous 0s found at 21144 Search took - 6.5e-05 At least 10 contiguous 0s found at 21144 Search took - 0.000149 At least 78 contiguous 0s found at 21302 Search took - 4.1e-05 At least 3 contiguous 0s found at 21144 Search took - 0.00014 At least 943 contiguous 0s could not be found Search took - 0.959815 At least 307 contiguous 0s found at 31289 Search took - 8e-05 At least 5 contiguous 0s found at 21144 Search took - 0.00015 At least 599 contiguous 0s found at 5476471 Search took - 0.009874 At least 19 contiguous 0s found at 21144 Search took - 0.000154 At least 345 contiguous 0s found at 70112 Search took - 0.000122

        This looks a lot more encouraging, I hope it can be adapted for your needs.

        Cheers,

        JohnGG

Re: Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
by wrog (Monk) on Jun 07, 2013 at 00:16 UTC
    some observations, mainly about the N>=8 case and if you're going to do this with regexps:
    1. regexps are designed for searching forward; while there are backtracking/"look-behind" operators, they're rather ad hoc and sometimes a bad idea to use,

    2. the byte-classes you need for for "ends with a sequence of exactly n zero bits", that, because of (1), you'll want to use for anchoring your regexps, are much easier to construct if your bytes are taken to be little-endian (i.e., the high order bits of byte k are taken to be adjacent to the low order bits of byte k+1).
      E.g., the range for "ends with a sequence of exactly 1 zero bit" is 0x40-0x7f, "...2 zero bits" is 0x20-0x3f and so on.
      And while this messes up the corresponding expressions/classes for "begins with at least n zero bits", the "at least" vs. "exactly" makes a difference in that you now only have to check the zeroness of the first (N - n) mod 8 bits, so I think things are still easier in the little-endian world.
    Putting this all together we need a |-join of 8 regexps (n = 0..7 and m = N - n mod 8) of the form
    [2^(7-n) .. (2^(7-n+1)-1)] \0{N/8} (?: ($ch mod 2^m == 0) | "\0+" )
    modulo this not being quite correct perl-regexp code but you know how to fix that.
      Oh wait, this is simpler than I thought. We don't even need that last \0+

      Fleshing this out a bit more:

      # $pre[n] matches any character that ends with exactly n zero bits @pre = map { qr{[@{[chr(1<<(7-$_))]}-@{[chr((1<<(8-$_))-1)]}]} } 0 .. +7; $pre[0] = qr(^|$pre[0]); # $suf[m] matches if we have at least m zero bits following my @suf = ('', #done map { my $m=$_; qr{[@{[join '', map { chr($_<<$m) } 0..((1<<(8-$m))-1)]}]} } 1..7 ); sub zero_bit_regexp { my $N = shift; my $p = join '|', map { qr{$pre[$_]\0{@{[($N-$_)>>3]}}$suf[($N-$_)&0 +x7]} } 0..7; return qr{$p}; }
        except for the small matter that regexp quoting issues are turning out to be deadly, so better go with this instead
        # $pre[n] matches any character that ends with exactly n zero bits my @pre = map { sprintf "[\\x%02x-\\x%02x]", 1<<(7-$_), (1<<(8-$_))-1 +} 0 .. 7; $pre[0] = "(?:^|$pre[0])"; # $suf[m] matches any character that begins with at least m zero bits my @suf = ('', #m=0 -> match anything map { my $m=$_; '[' . join('', map { sprintf "\\x%02x", $_<<$m } 0..((1<<(8-$m))-1)) . ']' } 1..7); sub zero_bit_regexp { my $N = shift; my $p = join '|', map { "$pre[$_]\\0{@{[($N-$_)>>3]}}$suf[($N-$_)&0x +7]" } 0..7; return qr{$p}; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1037467]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (18)
As of 2014-07-24 16:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (162 votes), past polls