Think about Loose Coupling PerlMonks

### Locating a specified number of contiguous 0 bits within a large bitstring efficiently.

by BrowserUk (Pope)
 on Jun 06, 2013 at 14:41 UTC Need Help??
BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Given a bitstring \$vec upto 64MB long, how to quickly locate a run of (at least) N contiguous, unset bits regardless of alignment?

N might be any number from 1 to 100s

My current thinking is that for N > 8, B = int( N /8 ), b = N % 8; Search for B contiguous 0 bytes, then check the preceding and following bytes for at least b adjacent 0 bits.

For N < 8; for each N, only a subset of the 256 bit patterns in a byte have N unset contiguous bits. Ie.

• 7 contiguous bits there are 2 byte values: 1 & 128
• 6 contiguous bits (not including the above), there are 5 byte values: 2, 3, 64, 129, 192.
• 5 contiguous bits (not including the above), there are 12 byte values.
• 4 contiguous bits (not including the above), there are 28 byte values.
• 3 contiguous bits (not including the above), there are 59 byte values.
• 2 contiguous bits (not including the above), there are 94 byte values
• 1 contiguous bits (not including the above), there are 54 byte values.
• 0 contiguous bits, there is 1 byte value.

With a lookup table mapping the number of bits required to the bit patterns that can provide them, I have a way to locate these smaller runs.

Except that:

1. Any one of the run lengths > 2 can be provided by many combinations of 2 adjacent bytes.
2. The whole process is becoming both decision rich and search heavy.

So then I thought about having 8 more bit strings, where each bit represents a byte in the primary bitstring, one for each of the 8 sets above, where a set bit indicates that this byte in the primary bitstring can provide the requisite number of bits.

And when those bits (in the primary bitstring) are utilised, the one bit is toggled (off) in the secondary bitstring where is was found; and one bit is set in the secondary bitstring that represents the maximum number of unset bits it can now provide.

This reduces the searching to picking the appropriate secondary bitstring and looking for a non-zero byte. (It doesn't yet deal with the adjacent bytes with adjacent, contiguous bits problem.).

It also helps (??) with the large N problem, in that instead of needing to search the primary bitstring for N/8 contiguous 0 bytes; I can search the '7-bits' secondary bitstring for N/8 contiguous bits, potentially cutting the search time to 1/8th.

But of course, the same alignment problem that existed with the primary bit string for which the secondary bitstrings are used to solve, now manifests itself again.

1. Do I implement a tertiary set of bitstrings (See A.)

Thoughts, comments, better solutions?

(For background and because someone is going to ask though it doesn't affect the problem or possible solutions. The primary bitstring represents freespace in an allocator. (Could be disk or memory.))

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
• Comment on Locating a specified number of contiguous 0 bits within a large bitstring efficiently.

Replies are listed 'Best First'.
Re: Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
by Grimy (Pilgrim) on Jun 06, 2013 at 17:05 UTC
```use strict;
use warnings;

# \$leading_0s[\$n] is a regex matching any byte that begins with at lea
+st \$n '0' bits
my @leading_0s = (qr//, map {
my \$max = chr(0xFF >> \$_);
qr/[\0-\$max]/;
} 1..8);

# \$trailing_0s[\$n] is a regex matching any byte that ends with at leas
+t \$n '0' bits
my @trailing_0s = (qr//, map {
my \$step = 1 << \$_;
my @ok = map { chr(\$step * \$_) } 0..0xFF >> \$_;
local \$" = q[];
qr/[\Q@ok\E]/
} 1..8);

# Test if a string contains \$n consecutive '0' bits.
# If yes, returns the bit offset of the match.
# Otherwise, returns -1.
sub match_0s {
my \$n = shift; # number of consecutive '0' bits to be matched
local \$_ = shift || \$_; # string to match against

for my \$trailing (0..8) {
my \$full = \$n - \$trailing >> 3;
my \$leading = \$n - \$trailing & 7;
return \$-[0] << 3 | -\$trailing & 7;
}
}

return -1;
}

\$_ = "ABC`\0abc";
print match_0s(14), \$/;        # 27
print match_0s(15), \$/;        # -1
print match_0s(24, 0.0.0), \$/; # 0

Tested, but not benchmarked. It may or may not be faster that the straightforward approach using unpack. To change endianness, just switch @leading_0s and @trailing_0s.

EDIT: now works even when the string is made entirely of null bytes.

EDIT: doesn't work for \$n < 8. This case is left as an exercise to the reader. (:

This works perfectly, and fairly quickly. Thankyou.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I don't see how this can work for \$N<8; in fact, it doesn't:
```match_0s(1,"\201\201\201\201")
=> -1
Re: Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
by dave_the_m (Prior) on Jun 06, 2013 at 16:25 UTC
You could do it fairly efficiently with a regex, giving 8 alternations which represent the 8 possible start positions within a byte. For example, matching 12 contiguous zeros might be done using something along the lines of
```/
\x00        [\x00\x10\x20\x30..\xf0]
|   [\x00\\x01] [\x00\x20\x40\x60..\xe0]
...
[\x00-\x7f] \x00 [\x00\x08\x10\x18..\xf8]
/x
(That's approx off the top of my head. It probably includes errors and endian mistakes. You'd want to write a function that generates it automatically.)

Dave.

Any thoughts on how to convert \$-[0] byte position at which the match is found to a bit position?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I guess just see how many high bits are zero in the byte at position \$-[0] ?

Dave.

Re: Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
by johngg (Abbot) on Jun 06, 2013 at 23:23 UTC

To avoid questions of byte boundary alignment you could perhaps convert the vector to a character string and use regex matching with @- to find your offset. This seems to work quite quickly unless you look for a contiguous block of a length that doesn't actually exist. On my fairly old laptop that will take a little while before it decides it can't find the block.

```use strict;
use warnings;

use 5.014;
use List::Util  qw{ max };
use Time::HiRes qw{ gettimeofday tv_interval };

my \$t0 = [ gettimeofday() ];
srand 1234567;

my \$vec = q{};
vec( \$vec, 536_870_911, 1 ) = 0;
vec( \$vec, int rand 536_870_911, 1 ) = 1 for 1 .. 1e7;

my \$t1 = [ gettimeofday() ];
say qq{Creating vector - @{ [ tv_interval( \$t0, \$t1 ) ] }};

my \$bitStr = unpack q{b*}, \$vec;

my \$t2 = [ gettimeofday() ];
say qq{Unpacking bitstring - @{ [ tv_interval( \$t1, \$t2 ) ] }};

say
qq{Longest contiguous block of zeros is },
max map length, \$bitStr =~ m{(0+)}g,
q{ bits long};

my \$t3 = [ gettimeofday() ];
say qq{Finding longest block - @{ [ tv_interval( \$t2, \$t3 ) ] }};

for my \$numZeros ( 25, 78, 307, 599, 943 )
{
my \$ts = [ gettimeofday() ];
say qq{At least \$numZeros contiguous 0s },
\$bitStr =~ m{(0{\$numZeros,})}
? qq{found at offset \$-[ 0 ], length @{ [ length \$1 ] }}
: q{could not be found};
say qq{     Search took - @{ [ tv_interval( \$ts, [ gettimeofday()
+] ) ] }};
}

The output.

```Creating vector - 6.612482
Unpacking bitstring - 2.121185
Longest contiguous block of zeros is 843
Finding longest block - 9.848668
At least 25 contiguous 0s found at offset 0, length 37
Search took - 1.685613
At least 78 contiguous 0s found at offset 134, length 85
Search took - 0.725265
At least 307 contiguous 0s found at offset 31289, length 343
Search took - 0.702597
At least 599 contiguous 0s found at offset 5476471, length 625
Search took - 0.82269
At least 943 contiguous 0s could not be found
Search took - 12.095307

I don't know whether this method will be fast enough for your purposes but I think it is likely to be simpler that juggling byte boundaries. I hope this will be of use.

Cheers,

JohnGG

Whilst this certainly works; the economics of either converting the bitvector to a bytevector for every search; or just storing the bytevector and dropping the bitvector just don't work.

A 64MB bit vector can map (at least) a 4GB space, for an administration overhead of 1.5%.

Using 1/2GB to map the 4GB raises that to 12.5%.

And I'm hoping for much faster than 3/4 of a second average search time. Part of the advantage of using a bitvector is only having 1/8th of the string to search.

Whether I can capitalise on that is still an open question :)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Using index looking for contiguous NULL bytes in the vector seems to be much faster.

```use strict;
use warnings;

use 5.014;
use List::Util  qw{ max };
use Time::HiRes qw{ gettimeofday tv_interval };

my \$t0 = [ gettimeofday() ];
srand 1234567;

my \$vec = q{};
vec( \$vec, 536_870_911, 1 ) = 0;
vec( \$vec, \$_         , 1 ) = 1 for 1 .. 21143;
vec( \$vec, int rand 536_870_911, 1 ) = 1 for 1 .. 1e7;

my \$t1 = [ gettimeofday() ];
say qq{Creating vector - @{ [ tv_interval( \$t0, \$t1 ) ] }};

my \$bitStr = unpack q{b*}, \$vec;

my \$t2 = [ gettimeofday() ];
say qq{Unpacking bitstring - @{ [ tv_interval( \$t1, \$t2 ) ] }};

say
qq{Longest contiguous block of zeros is },
max map length, \$bitStr =~ m{(0+)}g,
q{ bits long};

my \$t3 = [ gettimeofday() ];
say qq{Finding longest block - @{ [ tv_interval( \$t2, \$t3 ) ] }};

say qq{\nSearch using regex};
for my \$numZeros ( 25, 10, 78, 3, 943, 307, 5, 599, 19, 345 )
{
my \$ts = [ gettimeofday() ];

say
qq{At least \$numZeros contiguous 0s },
\$bitStr =~ m{(0{\$numZeros,})}
? qq{found at offset \$-[ 0 ], length @{ [ length \$1 ] }}
: q{could not be found};

say qq{     Search took - @{ [ tv_interval( \$ts, [ gettimeofday()
+] ) ] }};
}

say qq{\nSearch using index};
for my \$numZeros ( 25, 10, 78, 3, 943, 307, 5, 599, 19, 345 )
{
my \$ts = [ gettimeofday() ];

if ( \$numZeros < 23 )
{
my \$buffer   = q{};
my \$offset   = -1;
my \$bufStart = 0;
my \$lookFor  = q{0} x \$numZeros;
while ( \$bufStart < length \$vec )
{
\$buffer = unpack q{b*}, substr \$vec, \$bufStart, 131;
do {
say
qq{At least \$numZeros contiguous 0s found at },
\$bufStart * 8 + \$offset;
last;
} unless ( \$offset = index \$buffer, \$lookFor ) == -1;

\$bufStart += 128;
}
say qq{At least \$numZeros contiguous 0s could not be found}
if \$offset == -1;
}
else
{
my \$wholeBytes      = int( ( \$numZeros - 7 ) / 8 );
my \$lookFor         = qq{\0} x \$wholeBytes;
my \$offset          = -1;
my \$zerosToTheLeft  = 0;
my \$zerosToTheRight = 0;

while ( ( \$offset = index \$vec, \$lookFor, \$offset ) > -1 )
{
\$zerosToTheLeft  = zerosToTheLeft( \$offset );
\$zerosToTheRight = zerosToTheRight( \$offset, \$wholeBytes )
+;
last if ( \$wholeBytes * 8 + \$zerosToTheLeft + \$zerosToTheR
+ight )
>= \$numZeros;

\$offset += \$wholeBytes;
}

if ( \$offset == -1 )
{
say qq{At least \$numZeros contiguous 0s could not be found
+};
}
else
{
say
qq{At least \$numZeros contiguous 0s found at },
\$offset * 8 - \$zerosToTheLeft;
}
}

say qq{     Search took - @{ [ tv_interval( \$ts, [ gettimeofday()
+] ) ] }};
}

sub zerosToTheLeft
{
my \$offset = shift;
return 0 unless \$offset;
my \$byteStr = unpack q{b*}, substr \$vec, \$offset - 1, 1;
return 0 unless \$byteStr =~ m{(0+)\$};
return length \$1;
}

sub zerosToTheRight
{
my( \$offset, \$wholeBytes ) = @_;
return 0 if ( \$offset + \$wholeBytes ) == length \$vec;
my \$byteStr = unpack q{b*}, substr \$vec, \$offset + \$wholeBytes, 2;
return 0 unless \$byteStr =~ m{^(0+)};
return length \$1;
}

The output.

```Creating vector - 6.651795
Unpacking bitstring - 2.116776
Longest contiguous block of zeros is 843
Finding longest block - 9.871085

Search using regex
At least 25 contiguous 0s found at offset 21144, length 65
Search took - 1.684111
At least 10 contiguous 0s found at offset 21144, length 65
Search took - 0.737168
At least 78 contiguous 0s found at offset 21302, length 94
Search took - 0.701232
At least 3 contiguous 0s found at offset 21144, length 65
Search took - 0.704558
At least 943 contiguous 0s could not be found
Search took - 12.084963
At least 307 contiguous 0s found at offset 31289, length 343
Search took - 0.658849
At least 5 contiguous 0s found at offset 21144, length 65
Search took - 0.702935
At least 599 contiguous 0s found at offset 5476471, length 625
Search took - 0.822874
At least 19 contiguous 0s found at offset 21144, length 65
Search took - 0.702927
At least 345 contiguous 0s found at offset 70112, length 351
Search took - 0.703438

Search using index
At least 25 contiguous 0s found at 21144
Search took - 6.5e-05
At least 10 contiguous 0s found at 21144
Search took - 0.000149
At least 78 contiguous 0s found at 21302
Search took - 4.1e-05
At least 3 contiguous 0s found at 21144
Search took - 0.00014
At least 943 contiguous 0s could not be found
Search took - 0.959815
At least 307 contiguous 0s found at 31289
Search took - 8e-05
At least 5 contiguous 0s found at 21144
Search took - 0.00015
At least 599 contiguous 0s found at 5476471
Search took - 0.009874
At least 19 contiguous 0s found at 21144
Search took - 0.000154
At least 345 contiguous 0s found at 70112
Search took - 0.000122

This looks a lot more encouraging, I hope it can be adapted for your needs.

Cheers,

JohnGG

Re: Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
by FloydATC (Deacon) on Jun 06, 2013 at 16:48 UTC
I have no idea how they perform, but wouldn't the Set::IntSpan::* family of modules have to do something like this in order to implement the "holes" and "cover" methods?

-- FloydATC

Time flies when you don't know what you're doing

Set::IntSpan::* family of modules have to do something like this in order to implement the "holes" and "cover"

This is a interesting idea. Thanks.

However, Set::IntSpan that has those interesting methods seems to be both memory hungry and slow.

The neither of the faster, more memory lean modules, Set::IntSpan::Fast & Set::IntSpan::Fast::XS do :(

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Locating a specified number of contiguous 0 bits within a large bitstring efficiently.
by wrog (Friar) on Jun 07, 2013 at 00:16 UTC
some observations, mainly about the N>=8 case and if you're going to do this with regexps:
1. regexps are designed for searching forward; while there are backtracking/"look-behind" operators, they're rather ad hoc and sometimes a bad idea to use,

2. the byte-classes you need for for "ends with a sequence of exactly n zero bits", that, because of (1), you'll want to use for anchoring your regexps, are much easier to construct if your bytes are taken to be little-endian (i.e., the high order bits of byte k are taken to be adjacent to the low order bits of byte k+1).
E.g., the range for "ends with a sequence of exactly 1 zero bit" is 0x40-0x7f, "...2 zero bits" is 0x20-0x3f and so on.
And while this messes up the corresponding expressions/classes for "begins with at least n zero bits", the "at least" vs. "exactly" makes a difference in that you now only have to check the zeroness of the first (N - n) mod 8 bits, so I think things are still easier in the little-endian world.
Putting this all together we need a |-join of 8 regexps (n = 0..7 and m = N - n mod 8) of the form
```[2^(7-n) .. (2^(7-n+1)-1)] \0{N/8} (?: (\$ch mod 2^m == 0) | "\0+" )
modulo this not being quite correct perl-regexp code but you know how to fix that.
Oh wait, this is simpler than I thought. We don't even need that last \0+

Fleshing this out a bit more:

```# \$pre[n] matches any character that ends with exactly n zero bits
@pre = map { qr{[@{[chr(1<<(7-\$_))]}-@{[chr((1<<(8-\$_))-1)]}]} } 0 ..
+7;
\$pre[0] = qr(^|\$pre[0]);

# \$suf[m] matches if we have at least m zero bits following
my @suf = ('', #done
map { my \$m=\$_;
qr{[@{[join '', map { chr(\$_<<\$m) }
0..((1<<(8-\$m))-1)]}]} }
1..7
);

sub zero_bit_regexp {
my \$N = shift;
my \$p = join '|', map { qr{\$pre[\$_]\0{@{[(\$N-\$_)>>3]}}\$suf[(\$N-\$_)&0
+x7]} } 0..7;
return qr{\$p};
}

except for the small matter that regexp quoting issues are turning out to be deadly, so better go with this instead
```# \$pre[n] matches any character that ends with exactly n zero bits
my @pre = map { sprintf "[\\x%02x-\\x%02x]", 1<<(7-\$_), (1<<(8-\$_))-1
+} 0 .. 7;
\$pre[0] = "(?:^|\$pre[0])";

# \$suf[m] matches any character that begins with at least m zero bits
my @suf = ('', #m=0 -> match anything
map { my \$m=\$_;
'[' . join('', map { sprintf "\\x%02x", \$_<<\$m }
0..((1<<(8-\$m))-1))
. ']'
} 1..7);

sub zero_bit_regexp {
my \$N = shift;
my \$p = join '|', map { "\$pre[\$_]\\0{@{[(\$N-\$_)>>3]}}\$suf[(\$N-\$_)&0x
+7]" } 0..7;
return qr{\$p};
}

Create A New User
Node Status?
node history
Node Type: perlquestion [id://1037467]
Front-paged by Arunbear
help
Chatterbox?
 [Corion]: A good daypart!

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2018-03-21 10:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
When I think of a mole I think of:

Results (265 votes). Check out past polls.

Notices?