http://www.perlmonks.org?node_id=1200730

TJCooper has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that essentially extracts DNA sequences at specific locations and piles them up to calculate the frequency of each letter (A/G/C/T):
use strict; use warnings; use Bio::SeqIO; use Getopt::Long; use File::Basename; use List::Util qw(all); my $scriptname = basename($0); #Obtain script-name my $outext = '.txt'; #Output .file-extension my ($peaks, $fasta, $width, $mode, $exclusion, %list, $chr, $min, $max +, %sequences); my $usage = "Usage: $scriptname -i <HistogramFile> -r <ReferenceFASTA> + -w <Width> -m <Mode: Pos/Freq> -e <ExclusionList (Optional)>"; + #Error/usage message GetOptions('i=s' => \$peaks, 'r=s' => \$fasta, 'e=s' => \$exclusion, 'w=i' => \$width, 'm=s' => \$mode) or die("\n$usage\n"); die("\nError: Arguments or -flags are missing and/or incorrectly speci +fied.\n\n$usage\n\n") unless all {defined} $peaks, $fasta, $width, $m +ode; my @files = glob($peaks."/*.txt"); my $chk = scalar(@files); print "\nFailed to detect any valid files within the current directory +.\n\n" if $chk == 0; exit if $chk == 0; if (defined $exclusion) { open my $IN, '<', $exclusion or die "$!"; while (<$IN>) { chomp $_; ($chr, $min, $max) = split('\s+', $_); push @{$list{$chr}}, [$min,$max]; } } my $seqio = Bio::SeqIO->new(-file => $fasta); while(my $seqobj = $seqio->next_seq) { my $id = $seqobj->display_id; my $seq = $seqobj->seq; $sequences{$id} = $seq; } for my $file (@files) { open my $IN2, '<', $file or die "$!"; (my $sample = basename($file)) =~ s/.txt//; my $dir = dirname($file); my $outfile = $dir."/".$sample."_SeqBias_"."$mode".$outext; open my $OUT, '>', $outfile or die "$!"; my (@flankseq, $seqEX); sub seqstore { my @rcd = @_; if ($mode eq "Pos") { push (@flankseq, $rcd[0]); } elsif ($mode eq "Freq") { push (@flankseq, $rcd[0]) for (1..$rcd[1]); } return; } <$IN2> for (1..1); while (<$IN2>) { chomp $_; my(@F) = split("\t", $_); next if grep {$F[1] >= $_->[0] && $F[1] <= $_->[1]} @{ $list{$ +F[0]}}; next if $F[1]-$width < 1; if ($F[3] > 0) { $seqEX = substr($sequences{$F[0]},$F[1]-1-$width,$width) # +Extracts Xbp upstream (not incl. 5' end) .substr($sequences{$F[0]},$F[1]-1,$width+1); $seqEX =~ tr/GATC/CTAG/; $seqEX = reverse($seqEX); seqstore($seqEX,$F[3]); } if ($F[2] > 0) { $seqEX = substr($sequences{$F[0]},$F[1]-1-$width,$width) # +Extracts Xbp upstream (not incl. 5' end) .substr($sequences{$F[0]},$F[1]-1,$width+1); seqstore($seqEX,$F[2]); } } my %freq; my $inc = 1 / @flankseq; for my $FS (@flankseq) { $freq{substr $FS, $_, 1}[$_] += $inc for 0..length($FS)-1; } printf($OUT "%s\t"."%s\t" x ((keys %freq)-1)."%s"."\n", "Pos",sort + keys %freq); my $label = -$width-1; for my $pos (1..length $flankseq[0]) { $label++; printf($OUT "%d\t"."%7.4f\t" x ((keys %freq)-1)."%7.4f"."\n", +$label,map{$freq{$_}[$pos-1] // 0} sort keys %freq); } } my $run_time = time() - $^T; print "\n-------------------------------------"; print "\nAnalysis Complete\n"; print "Execution Time: $run_time Seconds\n"; print "-------------------------------------\n\n";

I've recently tried to add batch processing into the script so it can handle more than 1 input file at a time. However, i've run into a problem that I cannot figure out:

Illegal division by zero at line 80, <$IN2>

Line 80 is:

my $inc = 1 / @flankseq;

When handling the first file, @flankseq is populated as expected (I push the extracted sequences into an array).

However when it comes to opening the second file - which is an identical copy of the first - and repeating the process, @flankseq remains empty leading to the error. Everything else seems to work up until that point and all variables such as $seqEX look as expected.

I have no idea why! Any suggestions?

Replies are listed 'Best First'.
Re: Array ending up empty when processing more than a single file
by Eily (Monsignor) on Oct 05, 2017 at 13:02 UTC

    The sub keyword will only define the subroutine once, and to be able to write on @flankseq it will hold a reference to the variable (because otherwise, @flankseq being lexical, it could be destroyed although the function always needs it). So while you create a new @flankseq in each loop, the function always uses the same one. This is demonstrated by:

    $\ = $/; for (1..2) { my $var = 1; print \$var, " created"; sub function { print \$var; } function(); }
    SCALAR(0x825d70) created SCALAR(0x825d70) SCALAR(0x96ec08) created SCALAR(0x825d70)
    The solution is to define seqstore outside of the loop (it's not actually required, it's just better practice) and pass the array as a paramater (as a reference, so that it is passed all at once).
    sub seqstore { my ($flankseq, $mode, @rcd) = @_; if ($mode eq "Pos") { push (@$flankseq, $rcd[0]); } elsif ($mode eq "Freq") { push (@$flankseq, $rcd[0]) for (1..$rcd[1]); } return; }
    Now you have to call the function like that: seqstore(\@flankseq, $mode, $seqEX,$F[2]);

    More info on closures, and on references

      The sub keyword will only define the subroutine once

      This is interesting and, to me at least, unexpected. Why does this only warn once instead of twice?

      perl -wle 'for(1,2){ sub moo{} }; sub moo{}' Subroutine moo redefined at -e line 1.

        Well for the same reason as my example, the inner sub is only applied once, so only the second sub is a redefinition.

        I didn't find where the fact that a given definition of a sub (except anonymous of course) will only be applied once, no matter how often the scope of its definition is entered (actually I don't think the place where a function is defined matters at all, except for namespace and during compilation), but Lexical Subroutines explains how to create what you might expect (lexically scoped function, with redefinition each time), and so implies that the normal behaviour is different.

Re: Array ending up empty when processing more than a single file
by choroba (Cardinal) on Oct 05, 2017 at 13:02 UTC
    sub seqstore is declared inside a loop. I fear that @flankseq it closes around is the variable created in the first iteration of the loop, not the next ones. Declare your named subs at the topmost level unless you really know what you're doing, and send references to the arrays you want to change as arguments to them.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,