Perl Monk, Perl Meditation PerlMonks

Finding all connected nodes in an all-against-all comparison

 on May 06, 2010 at 19:43 UTC Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have a dataset that consists of a bunch of elements (DNA sequences) that are compared in an all-against-all manner. In case it is relevant, the matches are not necessarily symmetric, so just because A matches B does not guarantee that B will match A. The output looks like this, where each line indicates a match between the sequence in column A and the sequence in column B based on my search criteria:

```Contig1  Contig2
Contig1  Contig3
Contig2  Contig1
Contig2  Contig3
Contig3  Contig1
Contig3  Contig2
Contig3  Contig4
Contig4  Contig3
Contig4  Contig5
Contig6  Contig7
Contig7  Contig6
Contig8  Contig9
Contig9  Contig10
Contig10 Contig8
Contig10 Contig11
Contig11 Contig10

I want to divide this list into groups where a group consists of all the elements that are connected by at least one edge (including non-reciprocal edges). So in the example data, I would have three groups:

```Group1: Contig1 Contig2 Contig3 Contig4 Contig5
Group2: Contig6 Contig7
Group3: Contig8 Contig9 Contig10 Contig11

I am at a lost to figure out an efficient way to do this in Perl, and any help would be greatly appreciated.

Replies are listed 'Best First'.
Re: Finding all connected nodes in an all-against-all comparison
by lamprecht (Friar) on May 06, 2010 at 22:00 UTC
Hi,

take a look at the Graph module:

```use warnings;
use strict;
use Data::Dumper;
use Graph;

my \$g = Graph->new(directed => 1);

while (my \$line = <DATA>){
my @nodes = split( /\s+/,\$line );
}
print Dumper \$g->weakly_connected_components;

__DATA__
Contig1  Contig2
Contig1  Contig3
Contig2  Contig1
Contig2  Contig3
Contig3  Contig1
Contig3  Contig2
Contig3  Contig4
Contig4  Contig3
Contig4  Contig5
Contig6  Contig7
Contig7  Contig6
Contig8  Contig9
Contig9  Contig10
Contig10 Contig8
Contig10 Contig11
Contig11 Contig10
Cheers, Christoph
UPDATE: ARGH sorry this reply was meant for the OP!

Thats really not trivial and more complicated than I thought because of edge cases (not covered in your sample code)

So I can fully recommend using an already tested and optimized graph module because all solutions I had so far need to be rewritten for performance reasons...

Cheers Rolf

Re: Finding all connected nodes in an all-against-all comparison
by rubasov (Friar) on May 06, 2010 at 21:26 UTC
I don't really know what you mean by "non-reciprocal edges", but if I get it right, then you want want to find all the connected components in a directed graph, so try this:
```use strict;
use warnings;
use Data::Dump qw( pp );

sub find_parts {
my %graph = %{ shift() };

my %seen;
my @parts;
my \$i = 0;

my \$helper;
\$helper = sub {
my \$start = shift;
return if \$seen{\$start}++;
push @{ \$parts[\$i] }, \$start;
\$helper->(\$_) for @{ \$graph{\$start} };
};

for ( keys %graph ) {
\$helper->(\$_);
\$i = \$#parts + 1;
}
undef \$helper;

return @parts;
}

my %graph;

while (<DATA>) {
my ( \$src, \$dst ) = split;
push @{ \$graph{\$src} }, \$dst;
}

pp \%graph;
pp find_parts( \%graph );

__DATA__
Contig1  Contig2
Contig1  Contig3
Contig2  Contig1
Contig2  Contig3
Contig3  Contig1
Contig3  Contig2
Contig3  Contig4
Contig4  Contig3
Contig4  Contig5
Contig6  Contig7
Contig7  Contig6
Contig8  Contig9
Contig9  Contig10
Contig10 Contig8
Contig10 Contig11
Contig11 Contig10
foo bar
bar foo
quux quux
I haven't tested all the edge cases, but this will give you the idea.

Hope that helps.

update: LanX is right, my code above does not work for the case he has shown. Something like this would have been better to implement: Tarjan's strongly connected components algorithm

Re: Finding all connected nodes in an all-against-all comparison
by BrowserUk (Pope) on May 06, 2010 at 22:52 UTC

```#! perl -slw
use strict;
use Data::Dump qw[ pp ];

my %h;
while( <DATA> ) {
chomp;
my( \$k, \$v ) = split;
push @{ \$h{ \$k } }, \$v;
push @{ \$h{ \$v } }, \$k;
}

my @keys =  sort{ substr( \$a, 6 ) <=> substr( \$b, 6 ) } keys %h;
my \$n = 0;
my %offsets = map{ \$_ => \$n++ } @keys;

for my \$k ( @keys ) {
vec( \$masks{ \$k }, \$offsets{ \$_ }, 1 ) = 1 for \$k, @{ \$h{ \$k } };
}

for my \$i ( 0 .. \$#keys ) {
for my \$j ( 0 .. \$#keys ) {
if( ( \$masks{ \$keys[ \$i ] } & \$masks{ \$keys[ \$j ] } ) ne chr(0
+)x2 ) {
}
}
}

my %uniq; \$uniq{ \$_ } = 1 for values %masks;

\$n = 0;
for my \$group ( keys %uniq ) {
printf "Group %d : ", ++\$n;
print join ' ', map{
\$keys[ \$_ ]
} grep{
vec( \$group, \$_, 1 )
} 0 .. \$#keys;
}

__DATA__
Contig1  Contig2
Contig1  Contig3
Contig2  Contig1
Contig2  Contig3
Contig3  Contig1
Contig3  Contig2
Contig3  Contig4
Contig4  Contig3
Contig4  Contig5
Contig6  Contig7
Contig7  Contig6
Contig8  Contig9
Contig9  Contig10
Contig10 Contig8
Contig10 Contig11
Contig11 Contig10

Gives:

```c:\test>838787
Group 1 : Contig8 Contig9 Contig10 Contig11
Group 2 : Contig6 Contig7
Group 3 : Contig1 Contig2 Contig3 Contig4 Contig5

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

```C12 Contig11
C12 Contig5

to the test data prints:

```Group 1 : Contig6 Contig7
Group 2 : C12 Contig3 Contig4 Contig5 Contig8 Contig9 Contig10 Contig1
+1
Group 3 : C12 Contig1 Contig2 Contig3 Contig4 Contig5 Contig8 Contig9
+Contig10 Contig11

where something like:

```Group 1 : Contig6 Contig7
Group 2 : C12 Contig1 Contig2 Contig3 Contig4 Contig5 Contig8 Contig9
+Contig10 Contig11

is expected, at least according to my understanding of the OP's 'connected by at least one edge (including non-reciprocal edges)' criteria.

True laziness is hard work

Indeed. I need to |= the sets both ways:

(Note: I've changed your C12 to Config12 because it was easier than re-writing the sort Which isn't really necessary anyway, but makes the output nicer.)

```#! perl -slw
use strict;
use Data::Dump qw[ pp ];

my %h;
while( <DATA> ) {
chomp;
my( \$k, \$v ) = split;
push @{ \$h{ \$k } }, \$v;
push @{ \$h{ \$v } }, \$k;
}

my @keys =  sort{ substr( \$a, 6 ) <=> substr( \$b, 6 ) } keys %h;
my \$n = 0;
my %offsets = map{ \$_ => \$n++ } @keys;

for my \$k ( @keys ) {
vec( \$masks{ \$k }, \$offsets{ \$_ }, 1 ) = 1 for \$k, @{ \$h{ \$k } };
}

for my \$i ( 0 .. \$#keys ) {
for my \$j ( 0 .. \$#keys ) {
if( ( \$masks{ \$keys[ \$i ] } &  \$masks{ \$keys[ \$j ] } ) ne chr(
+0)x2 ) {
}
}
}

my %uniq; \$uniq{ \$_ } = 1 for values %masks;

\$n = 0;
for my \$group ( keys %uniq ) {
printf "Group %d : ", ++\$n;
print join ' ', map{
\$keys[ \$_ ]
} grep{
vec( \$group, \$_, 1 )
} 0 .. \$#keys;
}

__DATA__
Contig1  Contig2
Contig1  Contig3
Contig2  Contig1
Contig2  Contig3
Contig3  Contig1
Contig3  Contig2
Contig3  Contig4
Contig4  Contig3
Contig4  Contig5
Contig6  Contig7
Contig7  Contig6
Contig8  Contig9
Contig9  Contig10
Contig10 Contig8
Contig10 Contig11
Contig11 Contig10
Contig12 Contig11
Contig12 Contig5

Gives:

```c:\test>838787
Group 1 : Contig6 Contig7
Group 2 : Contig1 Contig2 Contig3 Contig4 Contig5 Contig8 Contig9 Cont
+ig10 Contig11 Contig12

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding all connected nodes in an all-against-all comparison
by GrandFather (Sage) on May 08, 2010 at 02:24 UTC

This is a somewhat tricky problem because later data may require that separate groups that were formed by earlier data need to be merged. Where the code is a little tricky it's worth being a little long winded, use explicit identifier names and comment each case. Consider:

```use strict;
use List::Util;

my %nodes;
my @groups;

while (<DATA>) {
chomp;
my (\$node1, \$node2) = split;

if (! exists \$nodes{\$node1} && ! exists \$nodes{\$node2}) {
# New group
\$nodes{\$node1} = @groups;
\$nodes{\$node2} = @groups;
push @groups, [\$node1, \$node2];
next;
}

if (! exists \$nodes{\$node1}) {
# node1 is part of node2's group
push @{\$groups[\$nodes{\$node2}]}, \$node1;
\$nodes{\$node1} = \$nodes{\$node2};
}

if (! exists \$nodes{\$node2}) {
# node2 is part of node1's group
push @{\$groups[\$nodes{\$node1}]}, \$node2;
\$nodes{\$node2} = \$nodes{\$node1};
}

next # Already met this pairing
if \$nodes{\$node1} == \$nodes{\$node2};

# node1 and node2 are in different groups. Merge the groups
my (\$group, \$nulGroup) = (\$nodes{\$node1}, \$nodes{\$node2});

push @{\$groups[\$group]}, @{\$groups[\$nulGroup]};
\$nodes{\$_} = \$group for @{\$groups[\$nulGroup]};
\$groups[\$nulGroup] = undef;
}

@groups = grep {defined} @groups;

for my \$group (0 .. \$#groups) {
print 'Group', \$group + 1, ': ', join (', ', @{\$groups[\$group]}),
+"\n";
}

__DATA__
Contig1  Contig2
Contig1  Contig3
Contig2  Contig1
Contig2  Contig3
Contig3  Contig1
Contig3  Contig2
Contig3  Contig4
Contig4  Contig3
Contig4  Contig5
Contig6  Contig7
Contig7  Contig6
Contig8  Contig9
Contig9  Contig10
Contig10 Contig8
Contig10 Contig11
Contig11 Contig10
C12 Contig11
C12 Contig5

Prints:

```Group1: Contig6, Contig7
Group2: Contig8, Contig9, Contig10, Contig11, C12, Contig1, Contig2, C
+ontig3, Contig4, Contig5
True laziness is hard work
Re: Finding all connected nodes in an all-against-all comparison
by LanX (Bishop) on May 07, 2010 at 14:52 UTC
So which solution do you expect with this data?
```A C
B C

There is no path from A to B, so do you consider them in the same or different groups?

Or do you have a guaranty that these intersections are not possible?

Cheers Rolf

Create A New User
Node Status?
node history
Node Type: perlquestion [id://838787]
Approved by Corion
help
Chatterbox?
 [choroba]: Hasn't someone forgot to tell the weather about it?

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2018-03-21 08:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
When I think of a mole I think of:

Results (265 votes). Check out past polls.

Notices?