http://www.perlmonks.org?node_id=905267

reubs85 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I wonder if anyone could be so kind as to help me with a bit of a problem that has been frustrating the hell out of me for a while now! I'm sure it's relatively simple, but I am quite new to perl and my brain is frazzled...

The root of the problem is pretty basic; I have a data file that looks like this:

tomD gly4 phil aesG tomD gly4 phil aesG phil aesG tomD gly4 etc...

and what I would like to extract is the number of times along each line that each pairwise combination of names occurs, ie the number of times each pair of names occurs, summed over the whole file (apologies if this is a bit confusing...).

To clarify by example: for line 1 tomD_gly4 occurs once, tomD_phil occurs once and gly4_phil also occurs once, where an underscore between the two names simply indicates the relevant pairwise combo. Similarly, for line 2 aesG_tomD = 1, aesG_gly4 = 1, aesG_phil = 1, tomD_gly4 = 1, tomD_phil = 1 and gly4_phil = 1. So if the file were just these two lines then I would like an output that looked something like:

aesG_tomD = 1 aesG_gly4 = 1 aesG_phil = 1 tomD_gly4 = 2 tomD_phil = 2 gly4_phil = 2

thereby counting the number of pairwise occurrences of each possible combination across the file. To give it a bit of context, I'm counting the number of genes that are shared between two genomes. I've been wrestling with hash counts and what not but I can't think of a way to do it without having to declare a separate variable for each pairwise combination, a la:

my $tomD_phil; etc... foreach (@line_of_input_file) { if (($_ =~ /tomD/) and ($_ =~ /phil/)) { $tomD_phil++ } etc... }

but as the number of things increases the number of combo's increases exponentially and this becomes really repetitive and unfeasable (not to mention probably totally unnecessary and very amateurish...).

Any help would be fantastically appreciated, and I can give more info if it would help!

Cheers!

Replies are listed 'Best First'.
Re: counting pairwise incidences
by moritz (Cardinal) on May 17, 2011 at 14:03 UTC
    When you count strings, think of a hash.

    I'm not sure if I understood what pairs you want counted. If a line has the entries a, b, c, d, do you want the pairs ab, ac, ad, bc, bd, cd?

    If yes, this code should do what you want:

    use strict; use warnings; my %pairs; while (<DATA>) { my @names = split; for my $first (0..$#names - 1) { for my $second ($first + 1 .. $#names) { my $p = join '_', @names[$first, $second]; $pairs{$p}++; } } } while (my ($k, $v) = each %pairs) { print "$k: $v\n"; } __DATA__ tomD gly4 phil aesG tomD gly4 phil aesG phil aesG tomD gly4

    If not, you need to explain what pairs you want to count.

      Looks like ab ac ad to me, in which case it simplifies to:
      use strict; use warnings; my %pairs; while (<DATA>) { my ($first, @names) = split; for my $second (@names) { my $p = join '_', $first, $second; ++$pairs{$p}; } } while (my ($k, $v) = each %pairs) { print "$k: $v\n"; } __DATA__ tomD gly4 phil aesG tomD gly4 phil aesG phil aesG tomD gly4
      IMHO you should sort @names.

      apart from this it's practically the same algorith I use. :)

      Cheers Rolf

Re: counting pairwise incidences
by anonymized user 468275 (Curate) on May 17, 2011 at 14:20 UTC
    The challenge is only to iterate all the 2-member subsets of the set on a per line basis. For example:
    use Data::Dumper; my %pairs = (); while (<>) { chomp; my @found = split; for my $k1 ( @found ); for my $k2 ( @found ) { if ( $k2 gt $k1 ) { # ensuring each pair is counted once o +nly for the line $pairs{ $k1 }{ $k2 }++; } } } } print Dumper \%pairs;
    (updated thanks to syntax catch by Wind)

    One world, one people

Re: counting pairwise incidences
by LanX (Saint) on May 17, 2011 at 14:18 UTC
    If I understand you right, you need to count each pairwise combination in a hash.

    To avoid confusion between the combinations ("A","B") and ("B","A") you need to "normalize" i.e. in this case to sort the key "vector" into a unique "representative" (here ("A","B")).

    Now you can either use a HoH $h{A}{B}++ or use this old multi-dim feature from Perl4 by separting keys by comma $h{A,B}++!

    I prefer the latter.

    Code example to follow...¹

    UPDATE: IIRC comma as separator is more secure than using an arbitrary separator like "_" but since I can't find the docs for multi dim keys and² your input already uses space as separator, you should stick with it. I.e. changing the code below to $count{"@keys[$a,$b]"}++

    { "aesG gly4" => 2, "aesG phil" => 2, "aesG tomD" => 2, "gly4 phil" => 2, "gly4 tomD" => 3, "phil tomD" => 2, }

    Cheers Rolf

    1)

    use Data::Dump "pp"; my %count; while(<DATA>) { chomp; @keys= sort split; for my $a (0..$#keys){ for my $b ($a+1.. $#keys) { # pp [$keys[$a],$keys[$b]]; $count{$keys[$a],$keys[$b]}++ } } } pp \%count; __DATA__ tomD gly4 phil aesG tomD gly4 phil aesG phil aesG tomD gly4

    prints

    { "aesG\34gly4" => 2, "aesG\34phil" => 2, "aesG\34tomD" => 2, "gly4\34phil" => 2, "gly4\34tomD" => 3, "phil\34tomD" => 2, }

    2) see $;

Re: counting pairwise incidences
by jpl (Monk) on May 17, 2011 at 14:19 UTC
    Or, as a minor variant of the code posted by moritz
    use strict; use warnings; my %pairs; while (<DATA>) { my @names = split; # my @names = sort split; # if order shouldn't matter while ( @names > 1 ) { my $first = shift(@names); map { ++$pairs{ join '_', $first, $_ } } @names; } } while ( my ( $k, $v ) = each %pairs ) { print "$k: $v\n"; } __DATA__ tomD gly4 phil aesG tomD gly4 phil aesG phil aesG tomD gly4

      Thanks everyone for their help! The order doesn't matter so the variant of the code supplied by moritz and jpl works a treat. Highly impressed by the speed of answering btw; I just checked back on the site I never expected replies to be so quick... long live the brotherhood!

      Cheers!

      Reuben