Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Alignment and matrix generation

by newtoperlprog (Sexton)
on Mar 27, 2015 at 18:49 UTC ( [id://1121554]=perlquestion: print w/replies, xml ) Need Help??

newtoperlprog has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks,

I am working on an alignment file. I am trying to print the number of occurrence of individual bases (a, c, g, t) at each position. I seek some help to print it for the multiple alignment lines.

Test file (test.dat) has single alignment column and data file has 2 alignment column.

Output from test.dat is given below.

Test.dat file

CLUSTAL O(1.2.1) multiple sequence alignment gnl|hbvcds|AB014370_PreC_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB064314_PreC_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB014384_C_P-C ----------------------------------- +------------------------- gnl|hbvcds|AB014385_C_P-C ----------------------------------- +------------------------- gnl|hbvcds|AB048701_PreS1_P-D atggggcagaatctttccaccagcaatcctctggg +attctttcccgaccatcagttggat gnl|hbvcds|AB078031_PreS1_P-D atggggcagaatctttccaccagcaaccctctggg +attctttcccgaccaccagttggat gnl|hbvcds|AB030513_S_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB064314_S_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB194947_PreS2_P-E ----------------------------------- +------------------------- gnl|hbvcds|AB194948_PreS2_P-E ----------------------------------- +-------------------------
#!/usr/bin/perl use strict; use warnings; my ($rna,$rnalen,$posi, $pos, $base, $dibase,$percent); my (%bases, %occur, %count); my($max_position, $sum, $tot) = 0; my @dinucleotide = qw(A C G T);#change the number of bases my $file = $ARGV[0] or die "Please provide the input file\n"; my $column = $ARGV[1]; open (X, $file) || die "Check the input $file\n"; while (<X>) { # chomp $_; if ($_=~/^#/) { next; } if ($_=~/^\s+/) { next; } if ($_=~/^CLUSTAL/) { next; } if ($_=~/\*/) { next; } my @temp = split (/\s+/, $_); $rna = $temp[$column]; $rna = uc $rna; $rnalen = length($rna); my @nucleotides = split (//,$rna); $tot++; foreach my $key(@dinucleotide) { $bases{$key}=1; $sum++; } $max_position = $rnalen if($rnalen > $max_position); for (my $position=0;$position<=$rnalen-1;$position++) { $dibase = substr($rna, $position, 1); #change the splitting number foreach my $dinucleotide(@dinucleotide) { if ($dibase eq $dinucleotide){ $posi = $position+1; $occur{$posi,$dibase}++; # print "$posi\t$dibase\n"; } } } } for($pos=1;$pos<=$max_position;$pos++) { print "$pos\t"; foreach $base(sort keys %bases) { print "$base\t"; if(exists $occur{$pos,$base}) { my $percent = sprintf ("%0.1f", (($occur{$pos,$base})/$tot)*10 +0); # print "$occur{$pos,$base}\t"; print "$percent\t"; } else { print "0\t"; } } print "\n"; } close (X);

Result from test.dat

1 A 20.0 C 0 G 0 T 0 2 A 0 C 0 G 0 T 20.0 3 A 0 C 0 G 20.0 T 0 4 A 0 C 0 G 20.0 T 0 5 A 0 C 0 G 20.0 T 0 6 A 0 C 0 G 20.0 T 0 7 A 0 C 20.0 G 0 T 0 8 A 20.0 C 0 G 0 T 0 9 A 0 C 0 G 20.0 T 0 10 A 20.0 C 0 G 0 T 0 11 A 20.0 C 0 G 0 T 0 12 A 0 C 0 G 0 T 20.0 13 A 0 C 20.0 G 0 T 0 14 A 0 C 0 G 0 T 20.0 15 A 0 C 0 G 0 T 20.0 16 A 0 C 0 G 0 T 20.0 17 A 0 C 20.0 G 0 T 0 18 A 0 C 20.0 G 0 T 0 19 A 20.0 C 0 G 0 T 0 20 A 0 C 20.0 G 0 T 0 21 A 0 C 20.0 G 0 T 0 22 A 20.0 C 0 G 0 T 0 23 A 0 C 0 G 20.0 T 0 24 A 0 C 20.0 G 0 T 0 25 A 20.0 C 0 G 0 T 0 26 A 20.0 C 0 G 0 T 0 27 A 0 C 10.0 G 0 T 10.0 28 A 0 C 20.0 G 0 T 0 29 A 0 C 20.0 G 0 T 0 30 A 0 C 0 G 0 T 20.0 31 A 0 C 20.0 G 0 T 0 32 A 0 C 0 G 0 T 20.0 33 A 0 C 0 G 20.0 T 0 34 A 0 C 0 G 20.0 T 0 35 A 0 C 0 G 20.0 T 0 36 A 20.0 C 0 G 0 T 0 37 A 0 C 0 G 0 T 20.0 38 A 0 C 0 G 0 T 20.0 39 A 0 C 20.0 G 0 T 0 40 A 0 C 0 G 0 T 20.0 41 A 0 C 0 G 0 T 20.0 42 A 0 C 0 G 0 T 20.0 43 A 0 C 20.0 G 0 T 0 44 A 0 C 20.0 G 0 T 0 45 A 0 C 20.0 G 0 T 0 46 A 0 C 0 G 20.0 T 0 47 A 20.0 C 0 G 0 T 0 48 A 0 C 20.0 G 0 T 0 49 A 0 C 20.0 G 0 T 0 50 A 20.0 C 0 G 0 T 0 51 A 0 C 10.0 G 0 T 10.0 52 A 0 C 20.0 G 0 T 0 53 A 20.0 C 0 G 0 T 0 54 A 0 C 0 G 20.0 T 0 55 A 0 C 0 G 0 T 20.0 56 A 0 C 0 G 0 T 20.0 57 A 0 C 0 G 20.0 T 0 58 A 0 C 0 G 20.0 T 0 59 A 20.0 C 0 G 0 T 0 60 A 0 C 0 G 0 T 20.0

Data file

CLUSTAL O(1.2.1) multiple sequence alignment 1 + 60 gnl|hbvcds|AB014370_PreC_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB064314_PreC_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB014384_C_P-C ----------------------------------- +------------------------- gnl|hbvcds|AB014385_C_P-C ----------------------------------- +------------------------- gnl|hbvcds|AB048701_PreS1_P-D atggggcagaatctttccaccagcaatcctctggg +attctttcccgaccatcagttggat gnl|hbvcds|AB078031_PreS1_P-D atggggcagaatctttccaccagcaaccctctggg +attctttcccgaccaccagttggat gnl|hbvcds|AB030513_S_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB064314_S_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB194947_PreS2_P-E ----------------------------------- +------------------------- gnl|hbvcds|AB194948_PreS2_P-E ----------------------------------- +------------------------- 61 + 120 gnl|hbvcds|AB014370_PreC_P-A tagagtctcctgagcattgctcacctcaccatact +gcactcaggcaagccattctctgct gnl|hbvcds|AB064314_PreC_P-A tagagtctcctgagcattgctcacctcaccatacg +gcactcaggcaagccattctctgct gnl|hbvcds|AB014384_C_P-C tagagtctccggaacattgttcacctcaccataca +gcactcaggcaagctattctgtgtt gnl|hbvcds|AB014385_C_P-C tagagtctccggaacattgttcacctcaccataca +gcactcaggcaagctattctgtgtt gnl|hbvcds|AB048701_PreS1_P-D gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl|hbvcds|AB078031_PreS1_P-D gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl|hbvcds|AB030513_S_P-A gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl|hbvcds|AB064314_S_P-A gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl|hbvcds|AB194947_PreS2_P-E gggtttttcttgttgacaaaaatcctcacaatacc +gcagagtctagactcgtggtggact gnl|hbvcds|AB194948_PreS2_P-E gggtttttcttgttgacaaaaatcctcacaatacc +gcagagtctagactcgtggtggact * * ** * * ****** **** +*** * * * *

I am trying to get an result with all the position and occurrence of nucleotides similar to the test.dat result. Any help is greatly acknowledged

Regards

Replies are listed 'Best First'.
Re: Alignment and matrix generation
by choroba (Cardinal) on Mar 28, 2015 at 01:19 UTC
    Do you expect the output like the following?
    1 A 20.0 C 0.0 G 0.0 T 0.0 2 A 0.0 C 0.0 G 0.0 T 20.0 3 A 0.0 C 0.0 G 20.0 T 0.0 4 A 0.0 C 0.0 G 20.0 T 0.0 5 A 0.0 C 0.0 G 20.0 T 0.0 ... 115 A 0.0 C 0.0 G 0.0 T 100.0 116 A 0.0 C 20.0 G 80.0 T 0.0 117 A 0.0 C 0.0 G 60.0 T 40.0 118 A 60.0 C 0.0 G 40.0 T 0.0 119 A 0.0 C 80.0 G 0.0 T 20.0 120 A 0.0 C 0.0 G 0.0 T 100.0

    It was generated by the following script:

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Dear Choroba,

      Thank you for your time and helping me with the code to develop matrix for the data.

      However, the data file should not be having number 1 , 60 , this was provided by me for the clarification that its 60 column wide.

      Actual data file is like this:

      CLUSTAL O(1.2.1) multiple sequence alignment + gnl|hbvcds|AB014370_PreC_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB064314_PreC_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB014384_C_P-C ----------------------------------- +------------------------- gnl|hbvcds|AB014385_C_P-C ----------------------------------- +------------------------- gnl|hbvcds|AB048701_PreS1_P-D atggggcagaatctttccaccagcaatcctctggg +attctttcccgaccatcagttggat gnl|hbvcds|AB078031_PreS1_P-D atggggcagaatctttccaccagcaaccctctggg +attctttcccgaccaccagttggat gnl|hbvcds|AB030513_S_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB064314_S_P-A ----------------------------------- +------------------------- gnl|hbvcds|AB194947_PreS2_P-E ----------------------------------- +------------------------- gnl|hbvcds|AB194948_PreS2_P-E ----------------------------------- +------------------------- + + gnl|hbvcds|AB014370_PreC_P-A tagagtctcctgagcattgctcacctcaccatact +gcactcaggcaagccattctctgct gnl|hbvcds|AB064314_PreC_P-A tagagtctcctgagcattgctcacctcaccatacg +gcactcaggcaagccattctctgct gnl|hbvcds|AB014384_C_P-C tagagtctccggaacattgttcacctcaccataca +gcactcaggcaagctattctgtgtt gnl|hbvcds|AB014385_C_P-C tagagtctccggaacattgttcacctcaccataca +gcactcaggcaagctattctgtgtt gnl|hbvcds|AB048701_PreS1_P-D gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl|hbvcds|AB078031_PreS1_P-D gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl|hbvcds|AB030513_S_P-A gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl|hbvcds|AB064314_S_P-A gggtttttcttgttgacaagaatcctcacaatacc +gcagagtctagactcgtggtggact gnl|hbvcds|AB194947_PreS2_P-E gggtttttcttgttgacaaaaatcctcacaatacc +gcagagtctagactcgtggtggact gnl|hbvcds|AB194948_PreS2_P-E gggtttttcttgttgacaaaaatcctcacaatacc +gcagagtctagactcgtggtggact * * ** * * ****** **** +*** * * * *

      Please let me know where I can modify the code to incorporate the changes.

      Regards

        Just few changes:
        --- 1.pl 2015-03-30 22:58:28.434954333 +0200 +++ 2.pl 2015-03-30 22:58:11.261833250 +0200 @@ -3,16 +3,18 @@ use warnings; use Syntax::Construct qw{ // }; -my ($startpos, $endpos, $count); +my $endpos = 0; +my ($startpos, $count); my %occurrences; while (<DATA>) { - if (/^\s+ ([0-9]+) \s+ ([0-9]+) \s*$/x) { - ($startpos, $endpos) = ($1, $2); + if (/^ +$/) { + $startpos = $endpos + 1; $count = 0; } elsif (/\s+ ([-actg]+) \s*$/x) { ++$count; my @nucleotides = split //, $1; + $endpos = $endpos + length $1 if $startpos == $endpos + 1; for my $pos (0 .. $#nucleotides) { ++$occurrences{ $nucleotides[$pos] }[$startpos + $pos] unless '-' eq $nucleotides[$pos];
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1121554]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-16 19:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found