|
BhariD has asked for the
wisdom of the Perl Monks concerning the following question:
I have the following input file. The input file include pairs of two id's separated by tab. Each line with a pair is a combination and has relevance.
DATA
YP_01 NP_02
NP_02 YP_01
YP_01 NP_03
NP_03 YP_01
NP_02 NP_03
NP_03 NP_02
NP_04 NP_05
.....
Here is the code which I am using to get the reciprocal best hit pair from this input file. For example, in above input file, YP_01\tNP_02 and NP_02\tYP_01, is reciprocal best hit pair. And NP_04\tNP_05 is not reciprocal hit pair as it lacks the reciprocal combination in the file i.e. NP_05\tNP_04.
open (DATA, $results) || die "couldn't open the file!";
my %unique = ();
while( <DATA> ){
chomp;
$unique{ join( "\t", sort split /\t/ ) } ++;
}
my @pairs = ();
my @not_pairs = ();
for my $item (sort keys %unique ){
if( $unique{$item} > 1 ){
push @pairs, $item;
}else{
push @not_pairs, $item;
}
}
print join( "\n", @pairs ), "\n";
This gives the following result file
YP_01 NP_02
YP_01 NP_03
NP_02 NP_03
The problem is now I need to modify the code such that for above result file it produces the following result file instead
YP_01 NP_02
YP_01 NP_03
3
The idea is if YP_01 is paired with NP_02, and NP_03 and if NP_02 is paired with NP_03, then I do not need the pair NP_02 and NP_03 printed in the output file as their info already there in combination with YP_01. But I do need to know whether that pair NP_02 and NP_03 existed or not and that is why I need towards the end a #. As shown above I have "3" here in the end representing that the three best hit pairs are present (YP_01 and NP_02; YP_01 and NP_03; NP_02 and NP_03). If NP_02 and NP_03 was not present, then the number should change to 2 showing that out of 3 only 2 comparison are there.
could anyone help me with this. please let me know if you need further clarification
Re: help needed in modifying the code for counting possible combinations by Anonymous Monk on Oct 30, 2009 at 19:15 UTC |
| [reply] [d/l] |
Re: help needed in modifying the code for counting possible combinations by gman (Pilgrim) on Oct 30, 2009 at 20:16 UTC |
ok, maybe way off here but I'll give it a try
#!/usr/bin/perl
#
use Data::Dumper;
my $datafile = "test.dat";
my %finalQ = ();
open(FH, "<$datafile") || die ;
my %combined = ();
while( <FH> ){
chomp;
my ($key,$value) = sort (split(/\t/,$_));
$combined{$key}=$value;
}
my %count = ();
my %reverse = reverse(%combined);
foreach my $key (sort %combined) {
print "$key => $combined{$key}\n";
if ( exists $reverse{$key} ) {
$count{"$key\t$reverse{$key}"}+=2;
print "$key => $reverse{$key} :: match\n";
} else {
$count{"$key\t$combined{$key}"}++;
}
}
print Dumper(%count);
output
$VAR1 = ' ';
$VAR2 = 4;
$VAR3 = 'NP_05 NP_04';
$VAR4 = 2;
$VAR5 = 'YP_01 NP_03';
$VAR6 = 2;
$VAR7 = 'NP_03 NP_02';
$VAR8 = 4;
$VAR9 = 'NP_02 NP_03';
$VAR10 = 1;
$VAR11 = 'NP_04 NP_05';
$VAR12 = 1;
| [reply] [d/l] |
Re: help needed in modifying the code for counting possible combinations by GrandFather (Bishop) on Oct 30, 2009 at 22:34 UTC |
use strict;
use warnings;
my %pairs;
while (<DATA>) {
chomp;
next if ! length;
my @pair = sort split;
++$pairs{$pair[0]}{$pair[1]};
}
my $totalHits = 0;
my %seconds;
for my $first (sort keys %pairs) {
for my $second (sort keys %{$pairs{$first}}) {
next if $pairs{$first}{$second} <= 1;
++$totalHits;
my $previous =
grep {exists $pairs{$_}{$first} && exists $pairs{$_}{$seco
+nd}}
keys %{$seconds{$first}};
++$seconds{$second}{$first};
next if $previous;
print "$first $second\n";
}
}
print "$totalHits\n";
__DATA__
NP_01 NP_02
NP_02 NP_01
NP_01 NP_03
NP_03 NP_01
NP_02 NP_03
NP_03 NP_02
NP_04 NP_05
Prints:
NP_01 NP_02
NP_01 NP_03
3
Note that I changed YP_01 to NP_01 to avoid inconsistencies between your reported results and the actual results.
True laziness is hard work
| [reply] [d/l] [select] |
|
This is awesome! Thank you so much GrandFather. I tried with the following data:
__DATA__
NP_01 NP_02
NP_02 NP_01
NP_01 NP_03
NP_03 NP_01
NP_02 NP_03
NP_03 NP_02
NP_04 NP_05
NP_06 NP_07
NP_07 NP_06
This prints:
NP_01 NP_02
NP_01 NP_03
NP_06 NP_07
4
I want the output to be like this instead:
NP_01 NP_02
NP_01 NP_03
3
NP_06 NP_07
1
3 for the presence of all three NP_01-NP_02, NP_01-NP_03, NP_02-NP_03 possible pairs in the file. Lets say, if NP_02-NP_03 combination was not present in the file then the number should become 2 showing that NP_02-NP_03 combo does not exist in the file. Any suggestion how can I get this from your code.
Example in case when NP_02-NP_03 reciprocal pair does not exist in the file and the required output
__DATA__
NP_01 NP_02
NP_02 NP_01
NP_01 NP_03
NP_03 NP_01
NP_04 NP_05
NP_06 NP_07
NP_07 NP_06
prints:
NP_01 NP_02
NP_01 NP_03
2 [not present in NP_02 NP_03]
NP_06 NP_07
1
| [reply] [d/l] [select] |
|
That makes it more interesting. If you discover any more interesting cases however you'd better tell us what the application actually is and give us the bigger picture.
use strict;
use warnings;
my %pairs;
while (<DATA>) {
chomp;
next if ! length;
my @pair = sort split;
++$pairs{$pair[0]}{$pair[1]};
}
for my $first (sort keys %pairs) {
my @hits;
my $count = 0;
my @implied;
for my $second (keys %{$pairs{$first}}) {
next if $pairs{$first}{$second} < 2;
push @implied, $second;
++$count;
if (exists $pairs{$second}{$first}) {
delete $pairs{$second}{$first};
++$count;
}
print "$first $second\n";
}
next if ! $count;
@implied = sort @implied;
if (@implied == 2 && $pairs{$implied[0]}{$implied[1]}) {
++$count;
delete $pairs{$implied[0]}{$implied[1]};
@implied = ();
}
print "$count";
print " [not present in @implied]" if @implied == 2;
print "\n\n"
}
__DATA__
AP_01 AP_02
AP_02 AP_01
AP_01 AP_03
AP_03 AP_01
NP_01 NP_02
NP_02 NP_01
NP_01 NP_03
NP_03 NP_01
NP_02 NP_03
NP_03 NP_02
NP_04 NP_05
NP_06 NP_07
NP_07 NP_06
Prints:
AP_01 AP_02
AP_01 AP_03
2 [not present in AP_02 AP_03]
NP_01 NP_03
NP_01 NP_02
3
NP_06 NP_07
1
I added the AP_dd set to make it clearer which groups were which and to provide data to demonstrate the third case.
True laziness is hard work
| [reply] [d/l] [select] |
|
Back to
Seekers of Perl Wisdom
|