http://www.perlmonks.org?node_id=1007821

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow Monks! I need your help to discover the bug in my script... It should something really silly, but I can't seem to be able to detect it..
So, suppose you have a file like the following:
LA5 ATGAAAAAGAC--AGCTATCGCGATTGCAGTGGCACTGGCTGGTTTCGCTAC----CGTAGCG +----CAGGCCG----------CTCCGAAAGATAACACCTGGTACGCTGGTGCT-----AAACTGGGCTG +GTCTCAGTACCATGACACCGGCTTCATTCACAATGATGGCCCGACTCATGAAAACCAACTGGGCG-CAG +GTGCTTTTGGTGGTTACCAGGTTAACCCGTATGTTGGCTTTGAAATGGGCTACGACTGGTTAGGCCGTA +TGCCGTACAAAGGCGACAACATCAATGGCGCTTATAAAGCTCAGGGCGTTCAGTTGACCGCTAAACTGG +GTTATCCAATCACTGACGATCTGGACG--TTTATACCCGTCTGGGTGGTATGGTATGGCGTG-CAGACA +CCAAGTCTAACGTCCCTGGC------GGCCCGTCTACTAAAGACCACGACACCGGCGTTTCCCCGGTAT +TCGCGGGCGGTATCGAGTATGCCATCACCCCTGAAATCGCAACCCGTCTGGAATACCAGTG----GACT +AACAACATCGGTGATGCCAACACCATCGGCACCCGTCCGGACAACGGCCTGCTGAGCGTAGGTGTTTCC +TACCGTTTCGGCCAGCAAGAAGCTGCTC-CGGTAGTAGCTCCGGCACCGGCTCCGGCTCCGGAAGTA-- +CAG---ACCAAGCACTTCACTCT-GAAGTCTGACGTACTGTTCAACTTCAACAAATCTACCCTGAAG-- +CCGGAAGGCCAGCAGGCT-CTGGATCAGCTGTACAGCCAGCTGAGCAACCTGGATCCGAAAGACGGTTC +CGTTGTCGTTCTGGGCTTCACTGACCGTATCGGTTCTGACGC-TTACAACCAGGGTCTGT-CCGAGAAA +CGTGCTCAGTCTGTTGTTGATTACCTGATCTCCAAAGGTATTCCGTCTGACAAAATCTCCGCACGTGGT +ATGGGCGAATCTAACCCGGTTACCGGCAACACCTGTGACAACGTGAAACCTCGCGCTGCCCTG---ATC +GATTGCCTGGCT-CCGGATCGTCGCGTAGAGATCGAAGTTAAAG--GCGTTAAAGACGTGGTAACTCAG +CCGCAGGCT- RKS5078 ATGAAAAAGAC--AGCTATCGCGATTGCAGTGGCACTGGCTGGTTTCGCTAC----CGT +AGCG----CAGGCCG----------CTCCGAAAGATAACACCTGGTACGCTGGTGCT-----AAACTGG +GCTGGTCTCAGTACCATGACACCGGCTTCATTCACAATGATGGCCCGACTCATGAAAACCAACTGGGCG +-CAGGTGCTTTTGGTGGTTACCAGGTTAACCCGTATGTTGGCTTTGAAATGGGCTACGACTGGTTAGGC +CGTATGCCGTACAAAGGCGACAACATCAATGGCGCTTATAAAGCTCAGGGCGTTCAGTTGACCGCTAAA +CTGGGTTATCCAATCACTGACGATCTGGACG--TTTATACCCGTCTGGGTGGTATGGTATGGCGTG-CA +GACACCAAGTCTAATGTCCCTGGC------GGCCCGTCTACTAAAGACCACGACACCGGCGTTTCCCCG +GTATTCGCGGGCGGTATCGAGTATGCCATCACCCCTGAAATCGCAACCCGTCTGGAATACCAGTG---- +GACTAACAACATCGGTGATGCCAACACCATCGGCACCCGTCCGGACAACGGCCTGCTGAGCGTAGGTGT +TTCCTACCGTTTCGGCCAGCAAGAAGCTGCTC-CGGTAGTAGCTCCGGCACCGGCTCCGGCTCCGGAAG +TA--CAG---ACCAAGCACTTCACTCT-GAAGTCTGACGTACTGTTCAACTTCAACAAATCTACCCTGA +AG--CCGGAAGGCCAGCAGGCT-CTGGATCAGCTGTACAGCCAGCTGAGCAACCTGGATCCGAAAGACG +GTTCCGTTGTCGTTCTGGGCTTCACTGACCGTATCGGTTCTGACGC-TTACAACCAGGGTCTGT-CCGA +GAAACGTGCTCAGTCTGTTGTTGATTACCTGATCTCCAAAGGTATTCCGTCTGACAAAATCTCCGCACG +TGGTATGGGCGAATCTAACCCGGTTACCGGCAACACCTGTGACAACGTGAAACCTCGCGCTGCCCTG-- +-ATCGATTGCCTGGCT-CCGGATCGTCGCGTAGAGATCGAAGTTAAAG--GCGTTAAAGACGTGGTAAC +TCAGCCGCAGGCT- 06-0676 ATGAAAAAGAC--AGCTATCGCGATTGCAGTGGCACTGGCTGGTTTCGCTAC----CGT +AGCG----CAGGCCG----------CTCCGAAAGATAACACCTGGTACGCTGGTGCT-----AAACTGG +GCTGGTCTCAGTACCATGACACCGGCTTCATTCACAATGATGGCCCGACTCATGAAAACCAACTGGGCG +-CAGGTGCTTTTGGTGGTTACCAGGTTAACCCGTATGTTGGCTTTGAAATGGGCTACGACTGGTTAGGC +CGTATGCCGTACAAAGGCGACAACATCAATGGCGCTTATAAAGCTCAGGGCGTTCAGTTGACCGCTAAA +CTGGGTTATCCAATCACTGACGATCTGGACG--TTTATACCCGTCTGGGTGGTATGGTATGGCGTG-CA +GACACCAAGTCTAACGTCCCTGGC------GGCCCGTCTACTAAAGACCACGACACCGGCGTTTCCCCG +GTATTCGCGGGCGGTATCGAGTATGCCATCACCCCTGAAATCGCAACCCGTCTGGAATACCAGTG---- +GACTAACAACATCGGTGATGCCAACACCATCGGCACCCGTCCGGACAACGGCCTGCTGAGCGTAGGTGT +TTCCTACCGTTTCGGCCAGCAAGAAGCTGCTC-CGGTAGTAGCTCCGGCACCGGCTCCGGCTCCGGAAG +TA--CAG---ACCAAGCACTTCACTCT-GAAGTCTGACGTACTGTTCAACTTCAACAAATCTACCCTGA +AG--CCGGAAGGCCAGCAGGCT-CTGGATCAGCTGTACAGCCAGCTGAGCAACCTGGATCCGAAAGACG +GTTCCGTTGTCGTTCTGGGCTTCACTGACCGTATCGGTTCTGACGC-TTACAACCAGGGTCTGT-CCGA +GAAACGTGCTCAGTCTGTTGTTGATTACCTGATCTCCAAAGGTATTCCGTCTGACAAAATCTCCGCACG +TGGTATGGGCGAATCTAACCCGGTTACCGGCAACACCTGTGACAACGTGAAACCTCGCGCTGCCCTG-- +-ATCGATTGCCTGGCT-CCGGATCGTCGCGTAGAGATCGAAGTTAAAG--GCGTTAAAGACGTGGTAAC +TCAGCCGCAGGCT- 58-6482 ATGAAAAAGAC--AGCTATCGCGATTGCAGTGGCACTGGCTGGTTTCGCTAC----CGT +AGCG----CAGGCCG----------CTCCGAAAGATAACACCTGGTACGCTGGTGCT-----AAACTGG +GCTGGTCTCAGTACCATGACACCGGCTTCATTCACAATGATGGCCCGACTCATGAAAACCAACTGGGCG +-CAGGTGCTTTTGGTGGTTACCAGGTTAACCCGTATGTTGGCTTTGAAATGGGCTACGACTGGTTAGGC +CGTATGCCGTACAAAGGCGACAACATCAATGGCGCTTATAAAGCTCAGGGCGTTCAGTTGACCGCTAAA +CTGGGTTATCCAATCACTGACGATCTGGACG--TTTATACCCGTCTGGGTGGTATGGTATGGCGTG-CA +GACACCAAGTCTAACGTCCCTGGC------GGCCCGTCTACTAAAGACCACGACACCGGCGTTTCCCCG +GTATTCGCGGGCGGTATCGAGTATGCCATCACCCCTGAAATCGCAACCCGTCTGGAATACCAGTG---- +GACTAACAACATCGGTGATGCCAACACCATCGGCACCCGTCCGGACAACGGCCTGCTGAGCGTAGGTGT +TTCCTACCGTTTCGGCCAGCAAGAAGCTGCTC-CGGTAGTAGCTCCGGCACCGGCTCCGGCTCCGGAAG +TA--CAG---ACCAAGCACTTCACTCT-GAAGTCTGACGTACTGTTCAACTTCAACAAATCTACCCTGA +AG--CCGGAAGGCCAGCAGGCT-CTGGATCAGCTGTACAGCCAGCTGAGCAACCTGGATCCGAAAGACG +GTTCCGTTGTCGTTCTGGGCTTCACTGACCGTATCGGTTCTGACGC-TTACAACCAGGGTCTGT-CCGA +GAAACGTGCTCAGTCTGTTGTTGATTACCTGATCTCCAAAGGTATTCCGTCTGACAAAATCTCCGCACG +TGGTATGGGCGAATCTAACCCGGTTACCGGCAACACCTGTGACAACGTGAAACCTCGCGCTGCCCTG-- +-ATCGATTGCCTGGCT-CCGGATCGTCGCGTAGAGATCGAAGTTAAAG--GCGTTAAAGACGTGGTAAC +TCAGCCGCAGGCT- 648905 ATGAAAAAGAC--AGCTATCGCGATTGCAGTGGCACTGGCTGGTTTCGCTAC----CGTA +GCG----CAGGCCG----------CTCCGAAAGATAACACCTGGTACGCTGGTGCT-----AAACTGGG +CTGGTCTCAGTACCATGACACCGGCTTCATTCACAATGATGGCCCGACTCATGAAAACCAACTGGGCG- +CAGGTGCTTTTGGTGGTTACCAGGTTAACCCGTATGTTGGCTTTGAAATGGGCTACGACTGGTTAGGCC +GTATGCCGTACAAAGGCGACAACATCAATGGCGCTTATAAAGCTCAGGGCGTTCAGTTGACCGCTAAAC +TGGGTTATCCAATCACTGACGATCTGGACG--TTTATACCCGTCTGGGTGGTATGGTATGGCGTG-CAG +ACACCAAGTCTAACGTCCCTGGC------GGCCCGTCTACTAAAGACCACGACACCGGCGTTTCCCCGG +TATTCGCGGGCGGTATCGAGTATGCCATCACCCCTGAAATCGCAACCCGTCTGGAATACCAGTG----G +ACTAACAACATCGGTGATGCCAACACCATCGGCACCCGTCCGGACAACGGCCTGCTGAGCGTAGGTGTT +TCCTACCGTTTCGGCCAGCAAGAAGCTGCTC-CGGTAGTAGCTCCGGCACCGGCTCCGGCTCCGGAAGT +A--CAG---ACCAAGCACTTCACTCT-GAAGTCTGACGTACTGTTCAACTTCAACAAATCTACCCTGAA +G--CCGGAAGGCCAGCAGGCT-CTGGATCAGCTGTACAGCCAGCTGAGCAACCTGGATCCGAAAGACGG +TTCCGTTGTCGTTCTGGGCTTCACTGACCGTATCGGTTCTGACGC-TTACAACCAGGGTCTGT-CCGAG +AAACGTGCTCAGTCTGTTGTTGATTACCTGATCTCCAAAGGTATTCCGTCTGACAAAATCTCCGCACGT +GGTATGGGCGAATCTAACCCGGTTACCGGCAACACCTGTGACAACGTGAAACCTCGCGCTGCCCTG--- +ATCGATTGCCTGGCT-CCGGATCGTCGCGTAGAGATCGAAGTTAAAG--GCGTTAAAGACGTGGTAACT +CAGCCGCAGGCT- 8b-1 ATGAAAAAGAC--AGCTATCGCGATTGCAGTGGCACTGGCTGGTTTCGCTAC----CGTAGC +G----CAGGCCG----------CTCCGAAAGATAACACCTGGTACGCTGGTGCT-----AAACTGGGCT +GGTCTCAGTACCATGACACCGGCTTCATTCACAATGATGGCCCGACTCATGAAAACCAACTGGGCG-CA +GGTGCTTTTGGTGGTTACCAGGTTAACCCGTATGTTGGCTTTGAAATGGGCTACGACTGGTTAGGCCGT +ATGCCGTACAAAGGCGACAACATCAATGGCGCTTATAAAGCTCAGGGCGTTCAGTTGACCGCTAAACTG +GGTTATCCAATCACTGACGATCTGGACG--TTTATACCCGTCTGGGTGGTATGGTATGGCGTG-CAGAC +ACCAAGTCTAACGTCCCTGGC------GGCCCGTCTACTAAAGACCACGACACCGGCGTTTCCCCGGTA +TTCGCGGGCGGTATCGAGTATGCCATCACCCCTGAAATCGCAACCCGTCTGGAATACCAGTG----GAC +TAACAACATCGGTGATGCCAACACCATCGGCACCCGTCCGGACAACGGCCTGCTGAGCGTAGGTGTTTC +CTACCGTTTCGGCCAGCAAGAAGCTGCTC-CGGTAGTAGCTCCGGCACCGGCTCCGGCTCCGGAAGTA- +-CAG---ACCAAGCACTTCACTCT-GAAGTCTGACGTACTGTTCAACTTCAACAAATCTACCCTGAAG- +-CCGGAAGGCCAGCAGGCT-CTGGATCAGCTGTACAGCCAGCTGAGCAACCTGGATCCGAAAGACGGTT +CCGTTGTCGTTCTGGGCTTCACTGACCGTATCGGTTCTGACGC-TTACAACCAGGGTCTGT-CCGAGAA +ACGTGCTCAGTCTGTTGTTGATTACCTGATCTCCAAAGGTATTCCGTCTGACAAAATCTCCGCACGTGG +TATGGGCGAATCTAACCCGGTTACCGGCAACACCTGTGACAACGTGAAACCTCGCGCTGCCCTG---AT +CGATTGCCTGGCT-CCGGATCGTCGCGTAGAGATCGAAGTTAAAG--GCGTTAAAGACGTGGTAACTCA +GCCGCAGGCT- 22510-1 ATGAAAAAGAC--AGCTATCGCGATTGCAGTGGCACTGGCTGGTTTCGCTAC----CGT +AGCG----CAGGCCG----------CTCCGAAAGATAACACCTGGTACGCTGGTGCT-----AAACTGG +GCTGGTCTCAGTACCATGACACCGGCTTCATTCACAATGATGGCCCGACTCATGAAAACCAACTGGGCG +-CAGGTGCTTTTGGTGGTTACCAGGTTAACCCGTATGTTGGCTTTGAAATGGGCTACGACTGGTTAGGC +CGTATGCCGTACAAAGGCGACAACATCAATGGCGCTTATAAAGCTCAGGGCGTTCAGTTGACCGCTAAA +CTGGGTTATCCAATCACTGACGATCTGGACG--TTTATACCCGTCTGGGTGGTATGGTATGGCGTG-CA +GACACCAAGTCTAACGTCCCTGGC------GGCCCGTCTACTAAAGACCACGACACCGGCGTTTCCCCG +GTATTCGCGGGCGGTATCGAGTATGCCATCACCCCTGAAATCGCAACCCGTCTGGAATACCAGTG---- +GACTAACAACATCGGTGATGCCAACACCATCGGCACCCGTCCGGACAACGGCCTGCTGAGCGTAGGTGT +TTCCTACCGTTTCGGCCAGCAAGAAGCTGCTC-CGGTAGTAGCTCCGGCACCGGCTCCGGCTCCGGAAG +TA--CAG---ACCAAGCACTTCACTCT-GAAGTCTGACGTACTGTTCAACTTCAACAAATCTACCCTGA +AG--CCGGAAGGCCAGCAGGCT-CTGGATCAGCTGTACAGCCAGCTGAGCAACCTGGATCCGAAAGACG +GTTCCGTTGTCGTTCTGGGCTTCACTGACCGTATCGGTTCTGACGC-TTACAACCAGGGTCTGT-CCGA +GAAACGTGCTCAGTCTGTTGTTGATTACCTGATCTCCAAAGGTATTCCGTCTGACAAAATCTCCGCACG +TGGTATGGGCGAATCTAACCCGGTTACCGGCAACACCTGTGACAACGTGAAACCTCGCGCTGCCCTG-- +-ATCGATTGCCTGGCT-CCGGATCGTCGCGTAGAGATCGAAGTTAAAG--GCGTTAAAGACGTGGTAAC +TCAGCCGCAGGCT-

and you want to count, per position, how many A, T, G, C or - you have.
I have written this:
@all_seqs = (); while(<>) { if($_=~/(.*)\t(.*)/) { $id=$1; $seq=$2; push @all_seqs, $seq; } } for ($i=0; $i<=$#all_seqs; $i++) { $seq_to_examine=$all_seqs[$i]; @split_seq_to_examine=split(//, $seq_to_examine); for($j=0; $j<=1108; $j++) { if ($split_seq_to_examine[$j] eq 'A') {$count_A++;} elsif ($split_seq_to_examine[$j] eq 'T') {$count_T++;} elsif ($split_seq_to_examine[$j] eq 'C') {$count_C++;} elsif ($split_seq_to_examine[$j] eq 'G') {$count_G++;} elsif ($split_seq_to_examine[$j] eq '-') {$count_non++;} print $j."\t".$count_A."\t".$count_T."\t".$count_C."\t".$count_G." +\n"; } }

but it keeps increasing the counters, and not reporting position-by-position. I think I must somewhere set the counters to 0, but the positions I tried just made the script worse...

Replies are listed 'Best First'.
Re: What is wrong in this code???
by moritz (Cardinal) on Dec 07, 2012 at 20:47 UTC
    I think I must somewhere set the counters to 0, but the positions I tried just made the script worse...

    If you want to count the number of symbols per line or record, you must reset the counters before counting the symbols in each line/record.

      Yes but where, can you tell me? I tried printing outside the loop, but printed statistics only for the last position (all sequences have the same number of letters, and I want to keep statistics for each position, but for all the sequences. So for example, since I have 200 sequences in my file, for position 56, I could have 150 A, 40 C, 25 G and 35 -.
Re: What is wrong in this code???
by Lotus1 (Vicar) on Dec 07, 2012 at 22:01 UTC

    tr/// can be used to count characters in a string. No need to loop through.

    use warnings; use strict; print "LATCG-\n"; while(<DATA>) { print $. . tr/A// . tr/T// . tr/C// . tr/G// . tr/-// . "\n"; } __DATA__ ATGAAAAAGAC AGCTATCGCGA TTGCAGTGGCA CTGGCTGGTTT CGCTACCGTAG CGC-A---TAG

    Output:

    LATCG- 171120 232330 323240 405240 522430 621224
Re: What is wrong in this code???
by Kenosis (Priest) on Dec 07, 2012 at 20:56 UTC

    Try the following:

    use strict; use warnings; my %sequences; while (<>) { for my $char ( split '', (split)[1] ) { $sequences{$char}++; } } print "$_ => $sequences{$_}", "\n" for keys %sequences;

    Output on your data set:

    - => 406 A => 1708 T => 1639 C => 2057 G => 1946
      Thanks, but that is not what I need. What I must have, is the number of A, C, T, G and - per POSITION. So the output would be something like:
      POS #1 A:150 | T:50 | G:0 | C:20 | -:30 POS 21 A:80 | T:60 | G:40 | C:20 | -:0 etc

      So the numbers must add up the number of sequences, each time...

        My apologies; I did misunderstand. The following should work:

        use strict; use warnings; my %sequences; while (<>) { my $i = 1; for my $char ( split '', (split)[1] ) { $sequences{ $i++ }{$char}++; } } for my $pos ( sort { $a <=> $b } keys %sequences ) { my @charCount; while ( my ( $char, $count ) = each %{ $sequences{$pos} } ) { push @charCount, "$char:$count"; } print "POS #$pos " . ( join ' | ', @charCount ) . "\n"; }

        This is the output:

        POS #1 a:4 POS #2 q:1 | b:3 POS #3 c:4 POS #4 x:1 | d:2 | z:1 POS #5 e:4 POS #6 f:4 POS #7 c:1 | a:1 | b:1 | d:1 POS #8 h:4 POS #9 i:4 POS #10 j:4

        Run on this data:

        12 abcdefahij 12 abcdefbhij 12 aqcxefchij 12 abczefdhij
        use strict; use warnings; while (<>) { my %sequences; for my $char ( split '', (split)[1] ) { $sequences{$char}++; } print "$_ => $sequences{$_}", "\n" for keys %sequences; }

        Note how I declared the hash inside the loop. Scoping variables correctly is the easiest way to reset them.

Re: What is wrong in this code???
by thundergnat (Deacon) on Dec 07, 2012 at 21:45 UTC
    use strict; use warnings; my @gene; while (<>) { my $i = 0; for my $char ( split '', (split)[1] ) { $gene[$i++]{$char}++; } } my $i = 0; for my $pos (@gene){ print "\nPOS: ", $i++; for my $key ( sort keys %$pos){ print " | $key => ", $pos->{$key}; } };
Re: What is wrong in this code???
by CountZero (Bishop) on Dec 08, 2012 at 11:28 UTC
    What about this?
    use Modern::Perl; use Data::Dump qw /dump/; my @results; while (<DATA>) { my @codon = split '', (split /\s+/)[1]; while (my ($seq, $char) = each @codon) { $results[$seq]{$char}++; } } say dump(@results); __DATA__ first ACGATT--GAT second AGTTAAC-TTT third TTAGCAGG-TA
    Output:
    ( { A => 2, T => 1 }, { C => 1, G => 1, T => 1 }, { A => 1, G => 1, T => 1 }, { A => 1, G => 1, T => 1 }, { A => 1, C => 1, T => 1 }, { A => 2, T => 1 }, { "-" => 1, "C" => 1, "G" => 1 }, { "-" => 2, "G" => 1 }, { "-" => 1, "G" => 1, "T" => 1 }, { A => 1, T => 2 }, { A => 1, T => 2 }, )
    Alternatively you can write the while loop more compact (but less easy to read)
    while (<DATA>) { my $seq; $results[$seq++]{$_}++ for split '', (split /\s+/)[1]; }

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: What is wrong in this code???
by brap (Pilgrim) on Dec 07, 2012 at 21:08 UTC

    Hey anonymous,

    I could be way off base here, but it almost sounds like you want an array of counters, where each element of the array represents the position of the line you're counting. Something along the order of (untested):

    if ($split_seq_to_examine[$j] eq 'A') {$count_A[$j]++;} elsif ($split_seq_to_examine[$j] eq 'T') {$count_T[$j]++;} elsif ($split_seq_to_examine[$j] eq 'C') {$count_C[$j]++;} elsif ($split_seq_to_examine[$j] eq 'G') {$count_G[$j]++;} elsif ($split_seq_to_examine[$j] eq '-') {$count_non[$j]++;}

    To find out what you have in, say, position 56, look at $count_A[56], $count_T[56], $count_G[56], and $count_C[56].

Re: What is wrong in this code???
by BillKSmith (Monsignor) on Dec 08, 2012 at 20:02 UTC

    A much shorter solution (probably slower) uses a regexp to copy all the A's to a temp array and then uses the size of that array as the count. Repeat for other letters.

    use strict; use warnings; use Data::Dumper qw(Dumper); my @letters = ('A', 'T', 'G', 'C', '-'); while (my $seq = <>){ print $seq; my @temp; my %cnts = map {($_, scalar ( @temp = $seq =~ /($_)/g))} @letters; print Dumper \%cnts; }
    Bill