Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Efficient selection mechanism?

by BrowserUk (Pope)
on Jan 14, 2014 at 14:21 UTC ( #1070558=perlquestion: print w/ replies, xml ) Need Help??
BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I have a AoA containing a couple of thousand arrays each of which contains 4 non-duplicate, small integers (say 0..19 ). eg.

my @AoA = ( [ 2, 13, 3, 16 ], [ 10, 1, 11, 6 ], [ 0, 10, 11, 19 ], [ 6, 1, 19, 15 ], [ 17, 6, 18, 12 ], ... );

Given that I have selected one of these subarrays (say, the first [ 2, 13, 3, 16 ]),

I then need to iterate the entire AoA to find all those that do not contain any of the integers in the selected subarray.

This needs to be done many times, so I'm looking for a reasonably efficient way to do so. Thoughts?

(I'm thinking about indexing; but how to do so efficiently without consuming huge amounts of space?)


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Efficient selection mechanism?
Select or Download Code
Re: Efficient selection mechanism?
by choroba (Abbot) on Jan 14, 2014 at 14:36 UTC
    Just an idea using bit vectors: create a bit vector for each small integer, 1 means the integer is present in the current quadruple. Then just do logical OR an the four vectors corresponding to the selected integers and look for zeroes.
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Working code:
      #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use constant MAX => 20; use constant EMPTY => "\0" x ((MAX / 8) + 1); my @AoA = map [ map int rand 1 + MAX, 1 .. 4 ], 0 .. 1000; my @vectors = (EMPTY) x (MAX + 1); for my $num (0 .. MAX) { for my $pos (0 .. $#AoA) { vec($vectors[$num], $pos, 1) = grep($num == $_, @{ $AoA[$pos] +}) ? 1 : 0; } } print "@$_; " for @AoA; print "\n"; my $subarray = $AoA[0]; say "Selecting: @$subarray."; my $result = EMPTY; for my $num (@$subarray) { $result |= $vectors[$num]; } for my $i (0 .. $#AoA) { say "@{ $AoA[$i] }" unless vec($result, $i, 1); }

      Update: Runs under 0.3s for 10_000 quadruples.

      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Efficient selection mechanism?
by Corion (Pope) on Jan 14, 2014 at 14:40 UTC

    This is a brute force approach:

    I would pack the four small integers as bits in a 32-bit integer instead of using an array. Then the problem reduces to scanning that array and checking each bit against the bitmask.

    If you sort the 32-bit integers, you get a quick way of eliminating half of the search space by looking at the highest bit.

Re: Efficient selection mechanism?
by salva (Monsignor) on Jan 14, 2014 at 14:41 UTC
    Supposing the number of distinct small numbers is really going to stay small and the number of entries on the array on the thousands:
    # untested! my %bitmap; for my $ix (0..$#AoA) { for my $sn (@{$AoA[$ix]}) { vec($bitmap{$sn}, $ix, 1) = 1; } } my @nope = (2, 13, 3, 16); my $bad = '' for my $bit (@nope) { $bad ||= $bitmap{$bit}; } my @ok = grep { !vec($bad, $_, 1) } 0..$#AoA
    That should use no more than a few KB and be O(N).
Re: Efficient selection mechanism?
by kcott (Abbot) on Jan 14, 2014 at 15:16 UTC

    G'day BrowserUk,

    Using your 5 subarrays, I selected the middle one ([ 0, 10, 11, 19 ]): that should find the first and last subarrays as being the only ones not containing any of the integers: 0, 10, 11, 19.

    I don't believe this should consume huge amounts of memory:

    #!/usr/bin/env perl -l use strict; use warnings; my @AoA = ( [ 2, 13, 3, 16 ], [ 10, 1, 11, 6 ], [ 0, 10, 11, 19 ], [ 6, 1, 19, 15 ], [ 17, 6, 18, 12 ], ); my $selected = $AoA[2]; my %unique = map { $_ => 1 } @$selected; print "@$_" for grep { ! map { $_ ? $_ : () } @unique{@$_} } @AoA;

    Output:

    2 13 3 16 17 6 18 12

    -- Ken

      Hi, this is also the approach that I was thinking to suggest. Using a hash to store the selected array is the easiest way. I do nt know whether it will be the fastest, but it will certainly be fairly fast.
Re: Efficient selection mechanism?
by hdb (Parson) on Jan 14, 2014 at 15:31 UTC

    I'm far too late, my approach is already covered in the posts above. However.

    I am translating the integers into a binary representation, pushing it onto each element (just for convenience) and the do a logical bitwise "and" to eliminate the unwanted elements:

    use strict; use warnings; use List::Util 'sum'; use Data::Dumper; sub asBinary { sum map { 2**$_ } @_ } my @AoA = ( [ 2, 13, 3, 16 ], [ 10, 1, 11, 6 ], [ 0, 10, 11, 19 ], [ 6, 1, 19, 15 ], [ 17, 6, 18, 12 ], ); push @$_, asBinary( @$_ ) for @AoA; my $mask = asBinary( @{$AoA[0]}[0..3] ); my @good = grep { ~$mask & $_->[4] } @AoA; print Dumper \@good;

    Clearly, this has an issue if your integers get larger than 31 or 63 depending on your system.

Re: Efficient selection mechanism?
by davido (Archbishop) on Jan 14, 2014 at 16:56 UTC

    Convert this AoA:

    my @control = ( [ 2, 13, 3, 16 ], [ 10, 1, 11, 6 ], );

    To this...

    # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 <-- Indicies rep +resent original ints. my @control = ( [ 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, ], [ 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ], );

    So that you can take a test array, "@test = ( 1, 4, 7, 9 )" and ask if sum(@{control[$n]}[@test]) == 0

    If you turn off warnings for undefined values in addition, you won't even need to worry about placing zeros; only place '1' where needed. ;)

    That becomes an O(n) solution that benefits from the fact that array slices are fast, and List::Util::sum is XS. Wrap it in a grep, and it should fly:

    my @valid_row_ix = grep { sum( @{$control[$_]}[@test] ) == 0 } 0 .. $# +control;

    Update:

    I'm not benchmarking, but it might be useful instead to try:

    my @valid_row_ix = grep { List::MoreUtils::none { $_ == 1 } @{$control +[$_]}[@test] } 0 .. $#control;

    ...since it will short-circuit out of the "none" loop as soon as a 1 is detected, whereas 'sum' will look at all elements in your test array. However, the "$_==1" portion drops from XS back into a MULTICALL pure-Perl sub, which is more computationally expensive per iteration than a simple sum. Since the array you're testing seems to be small, my bet is with the 'sum' solution.


    Dave

Re: Efficient selection mechanism?
by ikegami (Pope) on Jan 14, 2014 at 17:13 UTC
    Not indexed, but a regex should search an 8k string pretty fast.
    my $AoA = pack 'C*', map @$_, @AoA; my $omit = $AoA[0]; my $pat = '[^'.( join '', map quotemeta, pack 'C*', @$omit ).']{4}'; my $re = qr/\G(?:.{4})*?($pat)/s; my @matches = map [ unpack 'C*', $_ ], $AoA =~ /$re/g;

    The following avoids recreating the original arrays at the expense of two bytes per element of @AoA.

    my $AoA = pack '(C4S)*', map { @{ $AoA[$_] }, $_ } 0..$#AoA; my $omit = $AoA[0]; my $pat = '[^'.( join '', map quotemeta, pack 'C*', @$omit ).']{4}'; my $re = qr/\G(?:.{6})*?$pat(..)/s; my @matches = @AoA[ map unpack 'S', $AoA =~ /$re/g ];
Re: Efficient selection mechanism?
by Lennotoecom (Monk) on Jan 14, 2014 at 19:53 UTC
    about regexpes,as ikegami offered,
    my humble imput, turn every number into a letter from a to t
    0 - 19 and match the original line upon every:
    $v[$i++] = $_ for 'a'..'t'; @a = qw/2 13 3 16/; $a = join '', map {$v[$_]} @a; #generating 1 000 000 base of numbers for (0..1000000){ %rnd = (); while(keys %rnd < 4){$rnd{$v[int(rand(20))]} = undef;} push @b, join '', keys %rnd; } #doing search for (@b){ $counter++ if !/[$a]/; } print "$counter\n";
    on my pc
    1 000 000 base generated in 4 seconds,
    search upon it done in 1 second
    around ~370 000 unique numbers.
Re: Efficient selection mechanism?
by oiskuu (Pilgrim) on Jan 14, 2014 at 23:01 UTC
    Here's a small bench to highlight the speed difference in some of the chosen strategies.
    #! /usr/bin/perl -wl use Benchmark 'cmpthese'; my (@v, @b); @v[0 .. 19] = 'a' .. 'z'; for (1..1e6) { my %r; undef $r{$v[rand @v]} until keys %r == 4; push @b, join '', keys %r; } my @AoA = map [map ord()-97, split //,$_], @b; my @vectors; for my $t (@v) { push @vectors, pack "b*", pack "c*", map !!/$t/, @b; } my $AoA = pack 'C*', map @$_, @AoA; cmpthese -5, { bitmap => sub { my $res = ''; $res |= $vectors[$_] for @{$AoA[0]}; int(@AoA) - unpack "%32b*", $res; }, regex1 => sub { my $omit = $AoA[0]; my $pat = '[^'.( join '', map quotemeta, pack 'C*', @$omit ).' +]{4}'; my $re = qr/\G(?:.{4})*?($pat)/s; int(() = $AoA =~ /$re/g); }, regex2 => sub { my $pat = qr/[$b[0]]/; my $cnt = 0; $cnt += !/$pat/ for @b; $cnt; }, };
Re: Efficient selection mechanism?
by Anonymous Monk on Jan 14, 2014 at 23:48 UTC
    How many is "many?"

      What does it matter to you?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Efficient selection mechanism? (Thank you all)
by BrowserUk (Pope) on Jan 15, 2014 at 11:48 UTC

    Thank you all for your suggestions.

    As oiskuu demonstrated, the bit mapped index -- as suggested by Corion, Salva, Choroba & hdb -- is hands down winner in the performance stakes.

    By using vec and string-wise boolean operations (Salva,Choroba) rather than numeral ops, I don't have to worry about the size of the small integers outgrowing the platform integer size, which is a slight but possible consideration.

    Mixing code from various solutions, this is what I'm using:

    use constant MAX => 20; use constant BLANK => chr(0) x ((MAX / 8) + 1); ... my $key = BLANK; vec( $key, $_, 1 ) = 1 for @$_; $cache{ $key } = $_; ... my @subsel = grep{ ( $_ & $mask ) eq BLANK } keys %cache;

    And that's it. Thank you all.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I learnt the craft of vec from you here on PerlMonks.
      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        Sometimes, you just need to see your problem through someone else's eyes.

        I'd been faffing around with multi-level hashes so that given the first quad's numbers I could use them to index down four levels to just that subset that didn't contain those four numbers. Which ought to work, but proved to be clumsy and produced a huge, unwieldy data structure.

        But the real problem is that once I've found the second non-overlapping quad I then want to find a third that doesn't overlap either of the first two; then a fourth that doesn't overlap any of the first three. And that requires two more, deeper, multi-level data structures be built.

        With the vec solution I just OR the masks and grep the previous subset again to produce the next.

        Simple once you've seen it, but I was so locked in on my multi-level hash approach that using bitmasks didn't cross my mind.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1070558]
Approved by hdb
Front-paged by rminner
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (10)
As of 2014-07-25 03:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (167 votes), past polls