Others have addressed your immediate problem, but there is plenty of other help to be provided. Consider the following:
#!/usr/bin/perl
use strict;
use warnings;
my $tax2locus_file;
my %tax2loc;
open my $in, '<', $tax2locus_file or die "Can't open $tax2locus_file:
+$!\n";
while (<$in>) {
chomp;
my ($taxid, $locus) = split /\t/;
$tax2loc{$locus} = $taxid;
}
close ($in);
print "there are\t" . keys (%tax2loc) . "\tlocus_ids as key in hash\n"
+;
############### Now read in sharedTab file with pairwise overlap info
my $sharedTab_file = $ARGV[0];
my $outfile = "$sharedTab_file.hostinfo";
open my $out, '>', $outfile or die "Can't create $outfile: $!\n";
open $in, '<', $sharedTab_file or die "Can't open $sharedTab_file: $!\
+n";
print $out "#prophageA\tprophageB\thostA\ttaxidA\thostB\ttaxidB\tjacc\
+n";
while (<$in>) {
chomp;
next if (/^#/); # ignore comments
my @columns = split (/\t/, $_);
my ($prophageA, $hostA, $taxidA) = getTaxId($columns[0]);
my ($prophageB, $hostB, $taxidB) = getTaxId($columns[0]);
print $out join ("\t",
$prophageA, $prophageB, $hostA, $taxidA, $hostB, $taxidB, $col
+umns[5]),
"\n";
}
sub getTaxId {
my ($prophage, $lu) = @_;
my ($host, $PFnum) = split /\./, $prophage;
## for wgs genomes just match first 7 characters as only NZ_XXXX00
+0000 are
## in tax2locus
$host =~ s/^(NZ.{5}).*/$1/;
my @matches = grep {$_ =~ /$host/} keys %$lu;
die "Expected exactly one match for $host. Got " . scalar @matches
+ . "\n";
return $prophage, $host, $matches[0];
}
Note that the code is completely untested so may suffer from typos and egregious errors of all sorts, however points to note are:
- use three parameter open with lexical file handles and check the result
- declare variables at their first point of use so their life time and scope are clear
- use indentation to make flow control and other code structures clear
- avoid duplication of code
- check that assumptions made by the code are correct
Note that this code doesn't check to ensure the input data are correctly formatted as I'm not entirely sure what the format ought to be, but "production" code would ensure that sensible values were passed into getTaxId for $prophage for example.
True laziness is hard work
|