There's more than one way to do things PerlMonks

### Set intersection problem

by baxy77bax (Deacon)
 on May 23, 2023 at 08:05 UTC Need Help??

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Dear all

I am trying to compute the number of intersecting array elements for multiple arrays (essentially elements belonging to separate parts of venn diagrams). Initially and naively I thought this should be an easy thing to do and for 2 arrays it is:

1. Make a hash table for one array with keys being its elements
2. Loop through another array and count if an element exists in the previously created hash table
3. Subtract the number of shared elements from the total number of elements of each array.

But when the number of arrays increases the number of possible intersections between respective arrays increases (for A,B,C arrays we have A/B, A/C, B/C, A/B/C). So the problem I am having is how to scale the above algorithm.

Does anyone know a solution to this problem even if a limited number of sets can only be included (I think this is an NP problem thus any algorithm that can produce a solution for more than 7 arrays would be acceptable)

Thank you

Replies are listed 'Best First'.
Re: Set intersection problem
by hv (Prior) on May 23, 2023 at 12:45 UTC

I think a possible approach to this would look roughly like: create an individual hash lookup for each array, and an additional results hash containing all the elements; then for each element of the results hash, check its presence in each of the individual hashes and build up a representation (such as a bitvector) to reflect that. If you need to group elements by the type of intersection, you can store a result by the bitvector in yet another hash.

```my \$grouped = intersector([1..5], [3..7], [2, 6]);
print join(' ', @\$_), "\n" for values %\$grouped;
# output (in some order):
# 2
# 6
# 1
# 5 3 4
# 7
exit 0;

sub intersector {
# accept a list of arrayrefs
my @ar = @_;
my(%all, @single);
# initialize the results hash %all and all the individual hashes in
+@single
for my \$i (0 .. \$#ar) {
my \$a = \$ar[\$i];
@all{@\$a} = (undef) x @\$a;
\$single[\$i] = { map +(\$_ => undef), @\$a };
}
# now for each element, find its signature
for my \$el (keys %all) {
my \$sig = '';
for my \$i (0 .. \$#ar) {
vec(\$sig, \$i, 1) = 1 if exists \$single[\$i]{\$el};
}
# store the final signature in the results array if needed
\$all{\$el} = \$sig;
# store the element by its signature if needed
push @{ \$grouped{\$sig} }, \$el;
}
# now analyse further, or return a result
return \%grouped;
}

A space saving can be made by combining the two main loops, removing the need for @single; I'll leave that as an exercise for the reader.

Re: Set intersection problem
by salva (Canon) on May 23, 2023 at 14:02 UTC
I am not sure I completely understand your problem, so maybe the following idea doesn't make sense, but, have you consider inverting the approach?

For every element make an array with the names of the sets containing it. Once that is done, you can reverse it again, converting the list of sets into a key and use it to group the elements:

```# untested

my %sets = { A => [1,2,3], B => [3, 4], C => [1, 3, 4] }

my %in;

for my \$set (keys %sets) {
for my \$element (@{\$sets{\$set}}) {
push @{\$in[\$element] //= []}, \$set;
}
}

my %parts;
for my \$element (keys %in) {
my \$key = join("/", sort @{\$in{\$element}});
push @{\$parts{\$key} //= []}, \$element
}

Dump \%parts;
Re: Set intersection problem
by Corion (Patriarch) on May 23, 2023 at 08:12 UTC

I think you can reduce the problem to the problem of overlapping "axis aligned bounding boxes" ("AABB"). There are many algorithms to improve "collision detection" for these AABBs as the problem occurs often in computer graphics. Depending on whether this is a one-time setup or whether the arrays change over time, it can pay off to use a more complicated structure (quad-tree) over brute force.

Re: Set intersection problem
by kcott (Archbishop) on May 23, 2023 at 15:36 UTC

G'day baxy77bax,

"any algorithm that can produce a solution for more than 7 arrays would be acceptable"

Major Update: On reviewing my post, I noticed some of the intersections were missing: A-B-F, A-D, and so on. There was a major logic flaw in the 'for my \$elem (keys %temp) { ... }' loop. I've completely rewritten that loop, made a couple of minor changes elsewhere, and reposted the output. The previous code and output has been stricken and is in the spoiler for any who wish to review the changes.

```#!/usr/bin/env perl

use strict;
use warnings;

my %data = (
A => [qw{ant bee cat dog eel fly gnu hog}],
B => [qw{ant cat eel gnu}],
C => [qw{bee dog fly hog}],
D => [qw{cat fly ant}],
E => [qw{ant dog gnu}],
F => [qw{gnu bee}],
G => [qw{eel hog}],
H => [qw{zoo}],
I => [qw{}],
);

my %temp;

for my \$set (keys %data) {
my @elems = @{\$data{\$set}};
@elems = ('') unless @elems;

for my \$elem (@elems) {
push @{\$temp{\$elem}}, \$set;
}
}

my (%venn, %seen);

for my \$elem (sort keys %temp) {
my @sets = @{\$temp{\$elem}};

my \$glob_expr = '{' . join(',', @sets) . '}';

for my \$i (1 .. \$#sets) {
\$glob_expr .= '-{,' . join(',', @sets[\$i .. \$#sets]) . '}';
}

for my \$key (glob(\$glob_expr)) {
next if \$key =~ /([^-]+)(?=.*?\1)/;
\$key =~ y/-/-/s;
\$key =~ s/(?:^-|-\$)//g;
\$key = join '-', sort split /-/, \$key;
push @{\$venn{\$key}}, \$elem unless \$seen{\$key}{\$elem}++;
}
}

# DEMO: Output
my \$fmt = "%-7s  %s  %s\n";

{
no warnings 'qw';
printf \$fmt, qw{Sets # Elements};
printf \$fmt, qw{---- - --------};
}

printf \$fmt, \$_, 0+@{\$venn{\$_}}, "@{\$venn{\$_}}" for sort keys %venn;

Output:

```Sets     #  Elements
----     -  --------
A        8  ant bee cat dog eel fly gnu hog
A-B      4  ant cat eel gnu
A-B-D    2  ant cat
A-B-D-E  1  ant
A-B-E    2  ant gnu
A-B-E-F  1  gnu
A-B-F    1  gnu
A-B-G    1  eel
A-C      4  bee dog fly hog
A-C-D    1  fly
A-C-E    1  dog
A-C-F    1  bee
A-C-G    1  hog
A-D      3  ant cat fly
A-D-E    1  ant
A-E      3  ant dog gnu
A-E-F    1  gnu
A-F      2  bee gnu
A-G      2  eel hog
B        4  ant cat eel gnu
B-D      2  ant cat
B-D-E    1  ant
B-E      2  ant gnu
B-E-F    1  gnu
B-F      1  gnu
B-G      1  eel
C        4  bee dog fly hog
C-D      1  fly
C-E      1  dog
C-F      1  bee
C-G      1  hog
D        3  ant cat fly
D-E      1  ant
E        3  ant dog gnu
E-F      1  gnu
F        2  bee gnu
G        2  eel hog
H        1  zoo
I        1

— Ken

Re: Set intersection problem
by kikuchiyo (Hermit) on May 23, 2023 at 10:03 UTC
Re: Set intersection problem
by tybalt89 (Monsignor) on May 23, 2023 at 15:00 UTC

I'm not sure I understand your problem, but my guess goes something like this:

```#!/usr/bin/perl

use strict; # https://perlmonks.org/?node_id=11152381
use warnings;

my %sets = ( A => [1,2,3], B => [3, 4], C => [1, 3, 4] );
use Data::Dump 'dd'; dd 'original sets', \%sets;

my %intersection;
for my \$setname ( keys %sets )
{
\$intersection{\$_}{\$setname}++ for @{ \$sets{\$setname} };
}
use Data::Dump 'dd'; dd 'intersection', \%intersection;

print "items in more than one set:\n";
for my \$setname ( sort keys %intersection )
{
if( 1 < keys %{ \$intersection{\$setname} } )
{
my \$where = join ' and ', map "'\$_'",
sort keys %{ \$intersection{\$setname} };
print "    '\$setname' is in \$where\n";
}
}

Outputs (complete with intermediate calculations):

```(
"original sets",
{ A => [1, 2, 3], B => [3, 4], C => [1, 3, 4] },
)
(
"intersection",
{
1 => { A => 1, C => 1 },
2 => { A => 1 },
3 => { A => 1, B => 1, C => 1 },
4 => { B => 1, C => 1 },
},
)
items in more than one set:
'1' is in 'A' and 'C'
'3' is in 'A' and 'B' and 'C'
'4' is in 'B' and 'C'
Re: Set intersection problem
by tybalt89 (Monsignor) on May 23, 2023 at 15:30 UTC

If you don't care about which sets an item belongs to, and there are no duplicates within any set:

```#!/usr/bin/perl

use strict; # https://perlmonks.org/?node_id=11152381
use warnings;

my %sets = ( A => [1,2,3], B => [3, 4], C => [1, 3, 4] );

my %seen;
my @items = sort grep \$seen{\$_}++ == 1, map @\$_, values %sets;

print "Items in more than one set: @items\n";

Outputs:

```Items in more than one set: 1 3 4
Re: Set intersection problem
by Fletch (Bishop) on May 23, 2023 at 11:21 UTC

I have a possible solution in mind but so as to not muddy the waters with something irrelevant could you give a couple samples what your arrays look like and what you’re considering an intersection? E.g. if A were [3,4,5] and B [23,5,17,42] is the intersection you want [5] ? I think I follow but moar concrete data would help see if I’m on a track in the same neighborhood. . .

Edit: The post below by hv is along the lines I was thinking of actually so maybe not completely irrelevant . . .

The cake is a lie.
The cake is a lie.
The cake is a lie.

Re: Set intersection problem
by karlgoethebier (Abbot) on May 23, 2023 at 18:44 UTC

List::Compare might be worth a try.

«The Crux of the Biscuit is the Apostrophe»

Re: Set intersection problem
by Anonymous Monk on May 23, 2023 at 09:20 UTC
It's 2**n at worst and inductive:

You get the solution S(i) for i sets by combining set s_i with all members of S(i-1). For i=2..n and S(1)={s_1}

This can be further optimized - no point in trying (A/B)/C if A/C is empty - but 2**7 is only 128.

To speed up intersection you can use hash slices in Perl or specialized CPAN modules.

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11152381]
Approved by Corion
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-08-14 06:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
When will the AI bubble burst?

Results (29 votes). Check out past polls.

Notices?
 • erzuuli ‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.