counting pairwise incidences

reubs85 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I wonder if anyone could be so kind as to help me with a bit of a problem that has been frustrating the hell out of me for a while now! I'm sure it's relatively simple, but I am quite new to perl and my brain is frazzled...

The root of the problem is pretty basic; I have a data file that looks like this:

tomD gly4 phil
aesG tomD gly4 phil  
aesG phil 
aesG tomD gly4 
etc...
[download]

and what I would like to extract is the number of times along each line that each pairwise combination of names occurs, ie the number of times each pair of names occurs, summed over the whole file (apologies if this is a bit confusing...).

To clarify by example: for line 1 tomD_gly4 occurs once, tomD_phil occurs once and gly4_phil also occurs once, where an underscore between the two names simply indicates the relevant pairwise combo. Similarly, for line 2 aesG_tomD = 1, aesG_gly4 = 1, aesG_phil = 1, tomD_gly4 = 1, tomD_phil = 1 and gly4_phil = 1. So if the file were just these two lines then I would like an output that looked something like:

aesG_tomD = 1
aesG_gly4 = 1
aesG_phil = 1
tomD_gly4 = 2
tomD_phil = 2
gly4_phil = 2
[download]

thereby counting the number of pairwise occurrences of each possible combination across the file. To give it a bit of context, I'm counting the number of genes that are shared between two genomes. I've been wrestling with hash counts and what not but I can't think of a way to do it without having to declare a separate variable for each pairwise combination, a la:

my $tomD_phil;
etc...

foreach (@line_of_input_file) {

  if (($_ =~ /tomD/) and ($_ =~ /phil/)) { $tomD_phil++ }
  etc...  
}
[download]

but as the number of things increases the number of combo's increases exponentially and this becomes really repetitive and unfeasable (not to mention probably totally unnecessary and very amateurish...).

Any help would be fantastically appreciated, and I can give more info if it would help!

Cheers!

Comment on counting pairwise incidences Select or Download Code

Replies are listed 'Best First'.
Re: counting pairwise incidences by moritz (Cardinal) on May 17, 2011 at 14:03 UTC
When you count strings, think of a hash. I'm not sure if I understood what pairs you want counted. If a line has the entries a, b, c, d, do you want the pairs ab, ac, ad, bc, bd, cd? If yes, this code should do what you want: `use strict; use warnings; my %pairs; while (<DATA>) { my @names = split; for my $first (0..$#names - 1) { for my $second ($first + 1 .. $#names) { my $p = join '_', @names[$first, $second]; $pairs{$p}++; } } } while (my ($k, $v) = each %pairs) { print "$k: $v\n"; } __DATA__ tomD gly4 phil aesG tomD gly4 phil aesG phil aesG tomD gly4` [download] If not, you need to explain what pairs you want to count. Perl 6 - second systems done right	[reply] [d/l]
Re^2: counting pairwise incidences by ikegami (Patriarch) on May 17, 2011 at 17:26 UTC
Looks like ab ac ad to me, in which case it simplifies to: `use strict; use warnings; my %pairs; while (<DATA>) { my ($first, @names) = split; for my $second (@names) { my $p = join '_', $first, $second; ++$pairs{$p}; } } while (my ($k, $v) = each %pairs) { print "$k: $v\n"; } __DATA__ tomD gly4 phil aesG tomD gly4 phil aesG phil aesG tomD gly4` [download]	[reply] [d/l]
Re^2: counting pairwise incidences by LanX (Saint) on May 17, 2011 at 14:52 UTC
IMHO you should sort @names. apart from this it's practically the same algorith I use. :) Cheers Rolf	[reply]
Re: counting pairwise incidences by anonymized user 468275 (Curate) on May 17, 2011 at 14:20 UTC
The challenge is only to iterate all the 2-member subsets of the set on a per line basis. For example: `use Data::Dumper; my %pairs = (); while (<>) { chomp; my @found = split; for my $k1 ( @found ); for my $k2 ( @found ) { if ( $k2 gt $k1 ) { # ensuring each pair is counted once o +nly for the line $pairs{ $k1 }{ $k2 }++; } } } } print Dumper \%pairs;` [download] (updated thanks to syntax catch by Wind) One world, one people	[reply] [d/l]
Re: counting pairwise incidences by LanX (Saint) on May 17, 2011 at 14:18 UTC
If I understand you right, you need to count each pairwise combination in a hash. To avoid confusion between the combinations ("A","B") and ("B","A") you need to "normalize" i.e. in this case to sort the key "vector" into a unique "representative" (here ("A","B")). Now you can either use a HoH `$h{A}{B}++` or use this old multi-dim feature from Perl4 by separting keys by comma `$h{A,B}++`! I prefer the latter. Code example to follow...š UPDATE: IIRC comma as separator is more secure than using an arbitrary separator like "_" but since ~~I can't find the docs for multi dim keys and~~˛ your input already uses space as separator, you should stick with it. I.e. changing the code below to `$count{"@keys[$a,$b]"}++` `{ "aesG gly4" => 2, "aesG phil" => 2, "aesG tomD" => 2, "gly4 phil" => 2, "gly4 tomD" => 3, "phil tomD" => 2, }` [download] Cheers Rolf 1) `use Data::Dump "pp"; my %count; while(<DATA>) { chomp; @keys= sort split; for my $a (0..$#keys){ for my $b ($a+1.. $#keys) { # pp [$keys[$a],$keys[$b]]; $count{$keys[$a],$keys[$b]}++ } } } pp \%count; __DATA__ tomD gly4 phil aesG tomD gly4 phil aesG phil aesG tomD gly4` [download] prints `{ "aesG\34gly4" => 2, "aesG\34phil" => 2, "aesG\34tomD" => 2, "gly4\34phil" => 2, "gly4\34tomD" => 3, "phil\34tomD" => 2, }` [download] 2) see $;	[reply] [d/l] [select]
Re: counting pairwise incidences by jpl (Monk) on May 17, 2011 at 14:19 UTC
Or, as a minor variant of the code posted by moritz `use strict; use warnings; my %pairs; while (<DATA>) { my @names = split; # my @names = sort split; # if order shouldn't matter while ( @names > 1 ) { my $first = shift(@names); map { ++$pairs{ join '_', $first, $_ } } @names; } } while ( my ( $k, $v ) = each %pairs ) { print "$k: $v\n"; } __DATA__ tomD gly4 phil aesG tomD gly4 phil aesG phil aesG tomD gly4` [download]	[reply] [d/l]
Re^2: counting pairwise incidences by reubs85 (Acolyte) on May 19, 2011 at 13:33 UTC
Thanks everyone for their help! The order doesn't matter so the variant of the code supplied by moritz and jpl works a treat. Highly impressed by the speed of answering btw; I just checked back on the site I never expected replies to be so quick... long live the brotherhood! Cheers! Reuben	[reply]

Back to Seekers of Perl Wisdom