2d Array - making an array for the column then adding the values

hansoffate has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

I have a 2d array that has all the values filled in correctly which was checked with Data::Dumper. I then used map to create a slice to make an array of just a column. Then, I pass this array to another function to add up the counts for the column and return the total.

The error occurs when trying to add all the values in each column. I have the "total" on the right hand side of each print. As you can see, the columns with only 0's some how add up to 60 and yet the column with actual values that should be well over 300, is only at 36. Does anyone have any ideas how to solve this?

Thanks for the help,

-Hans

Problem Code expected in the score_count function on line 135

 
#!/usr/bin/perl -w
use strict;
use POSIX;
use Getopt::Std;
use Data::Dumper;

my $usage = "\nusage: $0 [-flags] <fasta file> <output file> (must be 
+all 1 type. ie all fasta NT, all fastq, all fasta fna)\n".
            "The script is able to detect if fasta or fastq files have
+ been suplied. Depending on the flags, it produces different output.\
+n".
            "It will produce a histogram, a vector distribution, and a
+ NT/AA distribution. For fastq files, you must use a flag to specify 
+which version\n".
            "of fastq the file is formatted with.\n\n".
            "-l #          Y or N to specify Solexa/Illumina < 1.3 Fas
+tq: Defaults N\n".
            "-s #          Y or N to specify Sanger Fastq: Defaults N\
+n".
            "-u #          Y or N to specify Illumina >= 1.3 Fastq: De
+faults N\n";

our($opt_l, $opt_s, $opt_u, $opt_b); # histogram interval
getopts('l:s:u:b:') or die $usage;
if (!defined($opt_l)) {$opt_l = 'N';}
if (!defined($opt_s)) {$opt_s = 'N';}
if (!defined($opt_u)) {$opt_u = 'N';}
if (!defined($opt_b)) {$opt_u = 'N';}

if( ( $#ARGV + 1 ) < 1 ) {
    die $usage;
}

my $infile = shift or die $usage;
my $outfile = shift or die $usage;
open OUT, ">$outfile" or die $usage;
open STDERR, ">/dev/null";

# Read in sequences from one or more fasta files

my $Id;
my $nextline;
my $qual;
my @phred_means;
my @phred1scr;
my $numseqs = 0;
my @total;
my @count;
my @min;
my @max;
my @stddev;
my @matrix;
my @Q1, my @Q2;

for(my $j = 0; $j < 70; $j++) {
  $total[$j] = 0;
  $count[$j] = 0;
  $min[$j] = 100;
  $max[$j] = 0;
}

for(my $i = 0; $i < 40; $i++) {
    for(my $j = 0; $j < 60; $j++) {
    $matrix[$i][$j] = 0;
    }
}

#automatic-1fasta or fastq detection
  open(FASTA, $infile) or die "Can't open file $infile\n";
  while (<FASTA>) {
      ##if Fastq format
    if(/^@(.+)\s/) {
      $Id = $1;
      my $nextline = <FASTA>;
      if($nextline =~ /^(\S+)$/) {
        my $nextline = <FASTA>;
        if($nextline =~ /^\+(.*)\s/) {
          my $nextline = <FASTA>;
          if($nextline =~ /^(\S+)$/) {
            $qual = $1;
            $numseqs += 1;
    ##returns an array with quality raw scores
            @phred1scr = fastq_processing($qual);
            my $i = 0;
    ##adds each returned array score to the correct $total[position]
            foreach my $score (@phred1scr) {
              $score = round($score);
              $total[$i] += $score;
          $count[$i] += 1;

          if($score > $max[$i]) { $max[$i] = $score; }
          if($score < $min[$i]) { $min[$i] = $score; }
          if($score >= 0 and $score <= 40) { $matrix[$score][$i]++; }
              $i++;
            }
          }
        }
      }
    }
  }
close (FASTA);

##calculates the means
for my $j (0..69) {
  last if($count[$j] == 0);
  my $avg = $total[$j] / $count[$j];
  push (@phred_means, $avg);
}

#print Dumper(\@matrix);

##calculate the stddevs and quartiles
#if(@val60) {
#  my $total2 = 0;
#  foreach my $num (@val60) {
#    $total2 += ($phred_means[59]-$num)**2;
#  }
#  my $mean2 = $total2 / (scalar @val60);
#  my $std_dev = sqrt($mean2);
#  push(@stddev, $std_dev);
#}


fastq_dist(\@phred_means);
print OUT "\n\nScore\t\tCount\n";
#my @col = get_col(\@matrix, 38);
#print Dumper(\@col);
for my $i (0..$#matrix) {
    my @col = map { $_->[$i] } \@matrix;
    print Dumper(\@col);
    my $score = score_count(@col);
#  print OUT "$i\t\t$score\n";
}

###########################################
##### Sub Processing Routines follow ######
###########################################
sub round {
    my($number) = shift;
    return int($number + .5);
}

sub score_count {
    my @arr = shift;
    my $total = 0;
    foreach my $score (@arr) {
      print "@$score ";
    $total += @$score;
    }
    print "$total\n";
    return $total;
}

sub fastq_processing {
  my $raw_qual = shift;
  my $i = 1;
  my @phred1seq;

  my @bases = split(//, $raw_qual);

  #different phred score calculations depending on fastq version
  foreach my $byte (@bases) {
    if($opt_l eq 'Y') {
      my $phred = (10 * log(1 + 10 ** ((ord($byte) - 64) / 10.0)) / lo
+g(10));
      push (@phred1seq, $phred);
    }
    if($opt_s eq 'Y') {
      my $phred = ord($byte) - 33;
      push (@phred1seq, $phred);
    }
    if($opt_u eq 'Y') {
      my $phred = ord($byte) - 64;
      push (@phred1seq, $phred);
    }
  }
  return @phred1seq;
}

sub fastq_dist {
  my @means = shift;
  my $i;
  my $text;
  if($opt_l eq 'Y') {
    $text = "Solexa/Illumina < 1.3 scoring";
  } elsif ($opt_s eq 'Y') {
    $text = "Sanger scoring";
  } elsif ($opt_u eq 'Y') {
    $text = "Illumina >= 1.3 scoring";
  }
  print OUT "Base\nPos\tCount\tMin\tMax\tStd Dev\tMean Phred\tHistogra
+m - $text\n";

  my $lrgstpos = 0;
  my $mean_arr = $means[0];

  foreach my $mean (@$mean_arr) {
    ##get's the number of phred values per sequence
      if($mean > 0) {
        $lrgstpos += 1;
      }
    }

    #loops through the total number of postions and gets the mean for 
+each
    for my $i (0 .. ($lrgstpos - 1)) {
      my $mean = $$mean_arr[$i];
      my $pos = $i + 1;
      print OUT "$pos\t$count[$i]\t";
#      printf OUT "%.2f\t%.2f\t%.2f\t%.2f\t", $min[$i], $max[$i], $std
+dev[$i], $mean;
      my $histogram = "|" x $mean;
      print OUT $histogram."\n";
    }
    print OUT "\n";
}
[download]

inputfile

@SRR006331.1 201WGAAXX:1:1:782:676 length=36
TTTTAGGTCAAAATCAGTTCTATTAGCAACTCCTAA
+SRR006331.1 201WGAAXX:1:1:782:676 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR006331.2 201WGAAXX:1:1:821:689 length=36
TCTTAAGTTTTTTGCATATTCATAATTATATAAATC
+SRR006331.2 201WGAAXX:1:1:821:689 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR006331.3 201WGAAXX:1:1:845:648 length=36
TTTTATTGCCATTTTTGTCAACAAAATGTATTCCAT
+SRR006331.3 201WGAAXX:1:1:845:648 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR006331.4 201WGAAXX:1:1:863:661 length=36
TATGAAGTTACTAAAAAAATAATTATTGATAAGATT
+SRR006331.4 201WGAAXX:1:1:863:661 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR006331.5 201WGAAXX:1:1:521:222 length=36
TAATAATAACTAGTTCATCTTTAGCCCTAGTTATAG
+SRR006331.5 201WGAAXX:1:1:521:222 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR006331.6 201WGAAXX:1:1:774:423 length=36
TTAACTAAATCCTCTTGATTTATAGAGGCTTTTTTT
+SRR006331.6 201WGAAXX:1:1:774:423 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR006331.7 201WGAAXX:1:1:658:111 length=36
TACTTTATCACCATTAAAATAAAGCTGATTTTCTGA
+SRR006331.7 201WGAAXX:1:1:658:111 length=36
AIIIIIIIIIIIIIIAIIIIIIIIIIIIIIIIIIII
@SRR006331.8 201WGAAXX:1:1:413:670 length=36
TTTATCACATTTATCTAATTCTTCATATTGTGCTTT
+SRR006331.8 201WGAAXX:1:1:413:670 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR006369.8 201WGAAXX:1:1:413:670 length=36
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+SRR006369.8 201WGAAXX:1:1:413:670 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[download]

With the input file named test.txt, run the script as

perl fastq_qualquartileBIN.pl -s Y test.txt testing.out
[download]

Comment on 2d Array - making an array for the column then adding the values Select or Download Code

Replies are listed 'Best First'.
Re: 2d Array - making an array for the column then adding the values by desemondo (Hermit) on Sep 21, 2009 at 02:23 UTC
sigh... Its all good - we've all been here and done something silly at some point or other :) What your score_count sub is doing is summing the element counts of the arrays rather than the contents of them... Try this: `sub score_count { my @arr = shift; my $total = 0; foreach my $score (@arr) { print "@$score "; foreach my $i (@$score){ $total += $i; } } print "$total\n"; return $total; }` [download]	[reply] [d/l]
Re^2: 2d Array - making an array for the column then adding the values by ikegami (Patriarch) on Sep 21, 2009 at 02:31 UTC
Good catch. What follows is a more detailed explanation. The original code `$total += @$score;` [download] adds an array to `$total`. What's the value of an array? Well, arrays don't naturally have a value. It turns out it's very convenient for arrays to return the number of elements it contains as its value. It allows us to replace calls to (non-existent) function `array_length(@array)` with just `@array`, a very common operation. If you want to sum the values of the array's elements, you'll have to sum the values of the array's elements. I showed two ways of doing that in my earlier post.	[reply] [d/l] [select]
Re^2: 2d Array - making an array for the column then adding the values by hansoffate (Novice) on Sep 21, 2009 at 02:37 UTC
Thanks for your great explanation and code! This fixed the problem.	[reply]
Re^3: 2d Array - making an array for the column then adding the values by desemondo (Hermit) on Sep 21, 2009 at 02:48 UTC
Glad I could help ;)	[reply]
Re: 2d Array - making an array for the column then adding the values by ikegami (Patriarch) on Sep 21, 2009 at 02:06 UTC
You say your problem is with extracting a column and summing it. That only takes one line (three without a module), so why did you present so much code? `use List::Util qw( sum ); my $sum = sum map { $array[$_][$col] } 0..$#array;` [download] `my $sum = 0; for my $row (0..$#array) { $sum += $array[$row][$col]; }` [download] Please locate your problem, then come back if you can't fix it once you locate it. By the way, `my $nextline = <FASTA>; if($nextline =~ /^(\S+)$/) { my $nextline = <FASTA>; if($nextline =~ /^\+(.*)\s/) { my $nextline = <FASTA>; if($nextline =~ /^(\S+)$/) { $qual = $1;` [download] causes lines to be skipped if an "`if`" is ever false. You want redo. And the unused captures needlessly slow down your matches. `my $nextline = <FASTA>; redo if $nextline !~ /^\S+$/; $nextline = <FASTA>; redo if $nextline !~ /^\+/; $nextline = <FASTA>; redo if $nextline !~ /^\S+$/; chomp( my $qual = $nextline );` [download]	[reply] [d/l] [select]
Re^2: 2d Array - making an array for the column then adding the values by hansoffate (Novice) on Sep 21, 2009 at 02:30 UTC
Thanks for your help. My PI doesn't want me to use any other modules other then the default ones that come installed with perl, so I can't use List::Util. I already have the the column assigned to a 1D array on line 121, but I believe it is all referenced. If I try to use this sub with a passed in @col, the total is calculated incorrectly. `sub score_count { my @arr = shift; my $total = 0; foreach my $score (@arr) { print "@$score "; ##prints actual values; If @ taken out, +values printed are like ARRAY(0x95533e0) $total += @$score; ##if @ is taken out the summation is somewhe +re ~140915576; however, with the @, the values are still incorrect. } print "$total\n"; return $total; }` [download] Are you suggesting to not even use the column slice so it's a 1D array? Thanks for letting me know about if construct issue but I do want to skip the line if it is false so that code should be correct. Thanks for the help	[reply] [d/l]
Re^3: 2d Array - making an array for the column then adding the values by ikegami (Patriarch) on Sep 21, 2009 at 02:35 UTC
List::Util does come with Perl. Thanks for isolating the problem. I was explaining it here while you were writing your post. The only thing I didn't cover is that references evaluate to the address of the referenced item when they are treated as a number.	[reply]
Re^3: 2d Array - making an array for the column then adding the values by toolic (Bishop) on Sep 21, 2009 at 03:03 UTC
My PI doesn't want me to use any other modules other then the default ones that come installed with perl, so I can't use List::Util. Firstly, List::Util is a core module, which means that it is one of "the default ones that come installed with perl". Feel free to use it, just as you are using Getopt::Std. Update: I didn't see ikegami's comment relating to this. Secondly, even if it were not a core module, you could reuse some of the module's source code. So, tell your PI (whatever that is) to put that in his pipe and smoke it!	[reply]
Re: 2d Array - making an array for the column then adding the values by ELISHEVA (Prior) on Sep 21, 2009 at 06:01 UTC
Whittling down code to the bare minimum needed to show a problem is an essential part of the art of programming. Learning this art will save you oodles of time debugging time, teach you a great deal about how code and data interconnect, and make it much easier for you to get help on problems that really have you stuck. There are no cookbook recipes but a few guidelines may help you in the future whether you are posting a problem here or just trying to find out a solution on your own: Work on a copy of the original code. This way you will have less fear of messing things up. Be experimental! Remove things, add them back. See what happens. Continue to remove things until you stop having the problem or run into compilation errors. Put them back until the problem reappears. Repeat this process over and over until you find the right combination of things to remove. Get rid of the command line processing. Unless the interpretation of command line arguments is the problem, you can reduce the amount of code by replacing command line/ui processing code with code that assigns hard coded values to the variables set via the command line or interactive user interface. Using hard coded values also makes it easier for others to reproduce the problem, should you decide to post. In your case, that would mean assigning fixed values to `$inputfile`, `$outputfile` and the `$opt_...` variables. Reduce your input data to a single line (or two). Usually you can demonstrate the problem using just one or two lines of data. In your case you needed just two lines that should have had a total non-zero score but didn't. Follow your code flow. In your reduced sample you want only the code that actually impacts the problem output. This means you will have to visually trace the flow of your code. This can be hard to do if you are a new programmer because efficiently following code flow requires some fluency in reading code. Fortunately, the more you do the faster you will get and the less awkward it will feel. Experimenting by commenting out and adding back in code can help confirm your guesses about how each bit of code is connected to the next. In your case, your program contains both score totals and summary statistics (min, max, means), but your problem is only with the totals, so you would keep only the code needed to calculate and display totals and eliminate all code that contributes only to min,max, or means. Best, beth	[reply] [d/l] [select]
Re: 2d Array - making an array for the column then adding the values by toolic (Bishop) on Sep 21, 2009 at 03:32 UTC
Completely unrelated to your problem... If you are interested in something more WYSIWYG than your `$usage`, you could try using a HERE DOC to eliminate all the quoting and newlines: `my $usage = <<EOF usage: $0 [-flags] <fasta file> <output file> (must be all 1 type. ie all fasta NT, all fastq, all fasta fna) etc... EOF` [download] Or better yet, use POD and the Pod::Usage core module. `if( ( $#ARGV + 1 ) < 1 ) { die $usage; }` [download] Another way to code this is: `die $usage unless @ARGV;` [download]	[reply] [d/l] [select]
Re: 2d Array - making an array for the column then adding the values by Marshall (Canon) on Sep 21, 2009 at 05:08 UTC
I would like to make suggestions for this code. I just started with the UI. A small problem is that $0 is not a portable name for the current program name. I show below how to make $0 portable. I think a bigger problem is this use of opts. The normal command line usage for an option like "-l" means that I want the "l" option. If I don't mention -l, that means "no". This idea of -l Y or -l N doesn't make sense, just -l means "yes, I want -l" and no mention of -l means "no -l". Below is a standard design pattern for a command line I/F. #!/usr/bin/perl -w use strict; use Getopt::Std; use Data::Dumper; use File::Basename; my $prog_name = basename($0); my $usage = <<"USAGE"; Usage: $prog_name [-flags] <fasta file> <output file> Files must be all one type. ie (all fasta NT, all fastq, all fasta fn +a). The script is able to detect if fasta or fastq files have been suplie +d. Depending on the flags, it produces different output. $prog_name will produce a histogram, a vector distribution, and a NT/AA distribution. For fastq files, you must use a flag to specify which version. of fastq the file is formatted with. -l to specify Solexa/Illumina < 1.3 Fastq -s to specify Sanger Fastq -u to specify Illumina >= 1.3 Fastq only one option is allowed! USAGE ###### "code" starts here ########### my %opts; die "$usage" if !getopts ("lsu", \%opts); # above removes any -lu or -lus options from command line # all that is left in @ARGV are the two file paths... # above will cause $usage to be printed if unknown option, say # "-x" is encountered. die "must have 2 files: fasta infile and an output file!\n$usage" if (my ($inflile, $outfile) = @ARGV) !=2; foreach my $opt (keys %opts) #just for debugging.... { print "$opt= $opts{$opt}\n"; } # here is where you put stuff that rules out combos of options. # I am guessing that more than one option is not right die "only one option allowed!\n$usage" if (keys %opts >1); #ie only 0 or 1 options are allowed #now open the input file and the output files # #error should say "can't open $infile for read", etc.... #we are past the usage stuff... [download] Definitely DO NOT DO THIS! `open STDERR, ">/dev/null";` [download] That sends error output from like "die" to the "bit bucket"!	[reply] [d/l] [select]
Re: 2d Array - making an array for the column then adding the values by Sandy (Curate) on Sep 21, 2009 at 02:33 UTC
Hello, Please, next time, try to post only the code that exposes your problem. The problem that you directed us to is in the following `sub score_count { my @arr = shift; my $total = 0; foreach my $score (@arr) { print "@$score "; $total += @$score; } print "$total\n"; return $total; }` [download] Although there are other issues, lets look at this code. `@arr` is being fed a single element (via the `shift` command). This makes @arr an array of an array Your code `print @$score` I believe misled you to believe that use `@$score` will give you all the values one at a time. It does only because print processes an array in list mode. In the next line, the code `$total += @$score;` processes the array in SCALAR context, which will return the TOTAL number of elements in your array (which is 60), NOT the sum of each value. What you need is: `foreach my $score (@arr) { foreach my $i (@$score) { $total += $i; } }` [download] Or, better yet (notice `$arr` instead of `@arr`): `sub score_count { my $arr = shift; my $total = 0; foreach my $score (@$arr) { $total+= $score; } print "total: $total\n"; return $total; }` [download]	[reply] [d/l] [select]