Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

boolean calculation with very large data

by llancet (Friar)
on Sep 24, 2009 at 07:01 UTC ( #797136=perlquestion: print w/replies, xml ) Need Help??
llancet has asked for the wisdom of the Perl Monks concerning the following question:

In an recent work, I have to compare two splitted regions, and found the intersection (or substraction, etc) of it. To achieve this, I used string masking system like below:
my $region1=[[1,2],[5,7]]; my $region2=[[2,6]]; my $mask1=&toMask($region1); # 1100111 my $mask2=&toMask($region2); # 0111110 my $mask3=&maskAND($region1,$region2); # 0100110 sub toMask { my @regions=@_; my @result; foreach my $curr (@regions) { for (my $i=$curr->[0];$i<=$curr->[1];$i++) { $result[$i-1]=1; } } for (my $i=0;$i<@result;$i++) { $result[$i]=0 if ($result[$i]!=1); } return join "",@result; }
However, when the region is big, the masking string becomes critically large and caused memory problem. So how can I pack the region AoA to some binary thing directly?

Replies are listed 'Best First'.
Re: boolean calculation with very large data
by BrowserUk (Pope) on Sep 24, 2009 at 08:23 UTC

    See vec. By using bitstrings instead of bytestrings, your masks will be 1/8th the size (and the boolean operations a lot faster).

    With a million random generated intervals, using bytestrings consumed nearly 2 GB and 80+ seconds; the same intervals using bitstrings only 0.6GB and 40 seconds.

    C:\test>797136-a -INTERVALS=1e6 Check mem: 1.975 GB ( 82 seconds ) C:\test>797136-b -INTERVALS=1e6 Check mem: 0.616 GB ( 44 seconds )

    BTW: Yourcodewouldbealoteasiertoreadandeditifitcontained a few spaces!

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: boolean calculation with very large data
by Corion (Pope) on Sep 24, 2009 at 07:15 UTC

    If your area is "sparse", that is, the number of regions is far smaller than the number of points in your space, I think using arithmetic instead of a bitmask will be much faster.

    The two regions can be ordered by the numerical order of their lower bound, and if the two are equal, by their upper bound. Two regions then overlap iff the lower bound of the upper region is smaller than the upper bound of the lower region.

    Alternatively, if you have many one-dimensional regions, or regions with holes in them, and a large space, like Unicode has, Inversion Lists (also greping big numbers) might be a good approach to do fast set operations.

Re: boolean calculation with very large data
by moritz (Cardinal) on Sep 24, 2009 at 07:25 UTC
Re: boolean calculation with very large data
by ccn (Vicar) on Sep 24, 2009 at 07:26 UTC
    #!/usr/bin/perl -l my $region1=[[1,2],[5,7]]; my $region2=[[2,6]]; sub max { return $_[0] > $_[1] ? $_[0] : $_[1]; } sub min { return $_[0] < $_[1] ? $_[0] : $_[1]; } sub getIntersect { my ($reg1, $reg2) = @_; return if $reg1->[1] < $reg2->[0]; return if $reg1->[0] > $reg2->[1]; return (max($reg1->[0], $reg2->[0]), min($reg1->[1], $reg2->[1])); } foreach my $r1 (@$region1) { foreach my $r2 (@$region2) { if( my @is = getIntersect($r1, $r2) ) { print "[$is[0], $is[1]]"; } } }

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://797136]
Approved by moritz
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2018-02-24 04:55 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (310 votes). Check out past polls.