Dear PerlMonks,
I am working on an alignment file. I am trying to print the number of occurrence of individual bases (a, c, g, t) at each position. I seek some help to print it for the multiple alignment lines.
Test file (test.dat) has single alignment column and data file has 2 alignment column.
Output from test.dat is given below.
Test.dat file
CLUSTAL O(1.2.1) multiple sequence alignment
gnl|hbvcds|AB014370_PreC_P-A -----------------------------------
+-------------------------
gnl|hbvcds|AB064314_PreC_P-A -----------------------------------
+-------------------------
gnl|hbvcds|AB014384_C_P-C -----------------------------------
+-------------------------
gnl|hbvcds|AB014385_C_P-C -----------------------------------
+-------------------------
gnl|hbvcds|AB048701_PreS1_P-D atggggcagaatctttccaccagcaatcctctggg
+attctttcccgaccatcagttggat
gnl|hbvcds|AB078031_PreS1_P-D atggggcagaatctttccaccagcaaccctctggg
+attctttcccgaccaccagttggat
gnl|hbvcds|AB030513_S_P-A -----------------------------------
+-------------------------
gnl|hbvcds|AB064314_S_P-A -----------------------------------
+-------------------------
gnl|hbvcds|AB194947_PreS2_P-E -----------------------------------
+-------------------------
gnl|hbvcds|AB194948_PreS2_P-E -----------------------------------
+-------------------------
#!/usr/bin/perl
use strict;
use warnings;
my ($rna,$rnalen,$posi, $pos, $base, $dibase,$percent);
my (%bases, %occur, %count); my($max_position, $sum, $tot) = 0;
my @dinucleotide = qw(A C G T);#change the number of bases
my $file = $ARGV[0] or die "Please provide the input file\n";
my $column = $ARGV[1];
open (X, $file) || die "Check the input $file\n";
while (<X>) {
# chomp $_;
if ($_=~/^#/) { next; }
if ($_=~/^\s+/) { next; }
if ($_=~/^CLUSTAL/) { next; }
if ($_=~/\*/) { next; }
my @temp = split (/\s+/, $_);
$rna = $temp[$column];
$rna = uc $rna;
$rnalen = length($rna);
my @nucleotides = split (//,$rna);
$tot++;
foreach my $key(@dinucleotide) {
$bases{$key}=1;
$sum++;
}
$max_position = $rnalen if($rnalen > $max_position);
for (my $position=0;$position<=$rnalen-1;$position++) {
$dibase = substr($rna, $position, 1); #change the splitting number
foreach my $dinucleotide(@dinucleotide) {
if ($dibase eq $dinucleotide){
$posi = $position+1;
$occur{$posi,$dibase}++;
# print "$posi\t$dibase\n";
}
}
}
}
for($pos=1;$pos<=$max_position;$pos++) {
print "$pos\t";
foreach $base(sort keys %bases) {
print "$base\t";
if(exists $occur{$pos,$base}) {
my $percent = sprintf ("%0.1f", (($occur{$pos,$base})/$tot)*10
+0);
# print "$occur{$pos,$base}\t";
print "$percent\t";
}
else {
print "0\t";
}
}
print "\n";
}
close (X);
Result from test.dat
1 A 20.0 C 0 G 0 T 0
2 A 0 C 0 G 0 T 20.0
3 A 0 C 0 G 20.0 T 0
4 A 0 C 0 G 20.0 T 0
5 A 0 C 0 G 20.0 T 0
6 A 0 C 0 G 20.0 T 0
7 A 0 C 20.0 G 0 T 0
8 A 20.0 C 0 G 0 T 0
9 A 0 C 0 G 20.0 T 0
10 A 20.0 C 0 G 0 T 0
11 A 20.0 C 0 G 0 T 0
12 A 0 C 0 G 0 T 20.0
13 A 0 C 20.0 G 0 T 0
14 A 0 C 0 G 0 T 20.0
15 A 0 C 0 G 0 T 20.0
16 A 0 C 0 G 0 T 20.0
17 A 0 C 20.0 G 0 T 0
18 A 0 C 20.0 G 0 T 0
19 A 20.0 C 0 G 0 T 0
20 A 0 C 20.0 G 0 T 0
21 A 0 C 20.0 G 0 T 0
22 A 20.0 C 0 G 0 T 0
23 A 0 C 0 G 20.0 T 0
24 A 0 C 20.0 G 0 T 0
25 A 20.0 C 0 G 0 T 0
26 A 20.0 C 0 G 0 T 0
27 A 0 C 10.0 G 0 T 10.0
28 A 0 C 20.0 G 0 T 0
29 A 0 C 20.0 G 0 T 0
30 A 0 C 0 G 0 T 20.0
31 A 0 C 20.0 G 0 T 0
32 A 0 C 0 G 0 T 20.0
33 A 0 C 0 G 20.0 T 0
34 A 0 C 0 G 20.0 T 0
35 A 0 C 0 G 20.0 T 0
36 A 20.0 C 0 G 0 T 0
37 A 0 C 0 G 0 T 20.0
38 A 0 C 0 G 0 T 20.0
39 A 0 C 20.0 G 0 T 0
40 A 0 C 0 G 0 T 20.0
41 A 0 C 0 G 0 T 20.0
42 A 0 C 0 G 0 T 20.0
43 A 0 C 20.0 G 0 T 0
44 A 0 C 20.0 G 0 T 0
45 A 0 C 20.0 G 0 T 0
46 A 0 C 0 G 20.0 T 0
47 A 20.0 C 0 G 0 T 0
48 A 0 C 20.0 G 0 T 0
49 A 0 C 20.0 G 0 T 0
50 A 20.0 C 0 G 0 T 0
51 A 0 C 10.0 G 0 T 10.0
52 A 0 C 20.0 G 0 T 0
53 A 20.0 C 0 G 0 T 0
54 A 0 C 0 G 20.0 T 0
55 A 0 C 0 G 0 T 20.0
56 A 0 C 0 G 0 T 20.0
57 A 0 C 0 G 20.0 T 0
58 A 0 C 0 G 20.0 T 0
59 A 20.0 C 0 G 0 T 0
60 A 0 C 0 G 0 T 20.0
Data file
CLUSTAL O(1.2.1) multiple sequence alignment
1
+ 60
gnl|hbvcds|AB014370_PreC_P-A -----------------------------------
+-------------------------
gnl|hbvcds|AB064314_PreC_P-A -----------------------------------
+-------------------------
gnl|hbvcds|AB014384_C_P-C -----------------------------------
+-------------------------
gnl|hbvcds|AB014385_C_P-C -----------------------------------
+-------------------------
gnl|hbvcds|AB048701_PreS1_P-D atggggcagaatctttccaccagcaatcctctggg
+attctttcccgaccatcagttggat
gnl|hbvcds|AB078031_PreS1_P-D atggggcagaatctttccaccagcaaccctctggg
+attctttcccgaccaccagttggat
gnl|hbvcds|AB030513_S_P-A -----------------------------------
+-------------------------
gnl|hbvcds|AB064314_S_P-A -----------------------------------
+-------------------------
gnl|hbvcds|AB194947_PreS2_P-E -----------------------------------
+-------------------------
gnl|hbvcds|AB194948_PreS2_P-E -----------------------------------
+-------------------------
61
+ 120
gnl|hbvcds|AB014370_PreC_P-A tagagtctcctgagcattgctcacctcaccatact
+gcactcaggcaagccattctctgct
gnl|hbvcds|AB064314_PreC_P-A tagagtctcctgagcattgctcacctcaccatacg
+gcactcaggcaagccattctctgct
gnl|hbvcds|AB014384_C_P-C tagagtctccggaacattgttcacctcaccataca
+gcactcaggcaagctattctgtgtt
gnl|hbvcds|AB014385_C_P-C tagagtctccggaacattgttcacctcaccataca
+gcactcaggcaagctattctgtgtt
gnl|hbvcds|AB048701_PreS1_P-D gggtttttcttgttgacaagaatcctcacaatacc
+gcagagtctagactcgtggtggact
gnl|hbvcds|AB078031_PreS1_P-D gggtttttcttgttgacaagaatcctcacaatacc
+gcagagtctagactcgtggtggact
gnl|hbvcds|AB030513_S_P-A gggtttttcttgttgacaagaatcctcacaatacc
+gcagagtctagactcgtggtggact
gnl|hbvcds|AB064314_S_P-A gggtttttcttgttgacaagaatcctcacaatacc
+gcagagtctagactcgtggtggact
gnl|hbvcds|AB194947_PreS2_P-E gggtttttcttgttgacaaaaatcctcacaatacc
+gcagagtctagactcgtggtggact
gnl|hbvcds|AB194948_PreS2_P-E gggtttttcttgttgacaaaaatcctcacaatacc
+gcagagtctagactcgtggtggact
* * ** * * ****** ****
+*** * * * *
I am trying to get an result with all the position and occurrence of nucleotides similar to the test.dat result. Any help is greatly acknowledged
Regards
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.