comment on

Dear Monks,

I seek your help in identifying the best and fastest algorithm for my bioinformatics problem. Please allow me to explain (got deleted when I edited my original post)

For every run, I have 2 input files, each with up to a few million DNA sequences (sometimes much fewer). From input file 1, I will infer the length distribution of all sequences i.e. how many times a certain length is observed. GOAL: To extract from input file 2, as many DNA sequences and with an identical length distribution as in input file 1. The Perl script I have written is quick ONLY for small input file pairs. IT does not scale up at all for large inputs it seems. That is the help I seek from you. Thank you very much!

UPDATE1: Script edited to improve readability

UPDATE2: I checked my script, and I have to admit, I do not see changes / errors compared to the previous time that I ran it. So the question is how can I modify it to make it much faster, because I can only imagine that the large input file sizes need to be compensated for by a speedier version. So modifications in logic / syntax etc will be MOST welcome. Thank you!

#! usr/bin/perl
# Length_dstrbtn_seq_extractor.pl

# This PERL script accepts two input files - the first input is a mult
+ifasta file containing alternating lines of fasta headers and sequenc
+es generated using using the Perl script fix_multifasta0.pl.
# This Perl scripts calculates the frequency distribution of sequences
+ of all sizes in this input file.
# The second input file is the user chosen multifasta sequence file fr
+om which sequences of the same frequency distribution of lengths will
+ be extracted.
# The only output is multifasta sequence file. The fasta headers speci
+fy source sequence, start coordinate and stop coordinate.

# Syntax : perl Length_dstrbtn_seq_extractor.pl <input1> <input2-sourc
+e4extraction>

#********************#
#PROCESS INPUT FILES
#********************#
use strict;
use warnings;

my $start_time = time;

my $input1 = shift @ARGV;
my $input2 = shift @ARGV;
my (@lengths, @source, @distrbtn, @output);

open(IN1, '<', $input1) or die "Can't read source file $input1 : $!\n"
+;

while(<IN1>)        {
chomp;    
    if ($_=~ m/\>/)        {                        
#looks for match to the '>' character in the header line
# if match to fasta header, does nothing
        }
    elsif ($_!~ m/\>/)        {                        
#looks for non-match to the '>' character in the sequence line
    push @lengths, length($_);                
# if match to fasta sequence, calculates and collects length of sequen
+ce
        }
}
close IN1;


open(IN2, '<', $input2) or die "Can't read source file $input2 : $!\n"
+;
while(<IN2>)        {
chomp;    
    if ($_=~ m/\>/)        {                        
#looks for match to the '>' character in the header line
    push @source, $_;                                                
# if match to fasta header, includes in array
        }
    elsif ($_!~ m/\>/)        {                        
#looks for non-match to the '>' character in the sequence line
    push @source, $_;                
# if match to fasta sequence, includes sequence in array
        }
}
close IN2;

#********************#
# CALCULATE LENGTH DISTRIBUTION FROM INPUT FILE #1
#********************#
my @sorted = sort {$a <=> $b}@lengths;

my %seen = ();
my @uniques = grep { !$seen{$_}++ } @sorted;

foreach my $len(@uniques)        {
push @distrbtn, $len;
my $index = 0;
my $count = 0;
    while($index <= $#sorted)        {
        if($len == $sorted[$index])
        {
        $count++;
        }
        $index++;
        }
    push @distrbtn, $count;
}

my %dstrbtn_hash = @distrbtn; # hash of predicted sORF length (key) an
+d number of times (value) that size is observed in the multifasta inp
+ut file #1
# print @distrbtn, "\n"; # works thus far

#********************#
# EXTRACT SEQUENCES (RANDOMLY) FROM INPUT2, WITH IDENTICAL LENGTH DIST
+RIBUTION OF INPUT 1
#********************#

my $header_count = 1;
foreach my $key (keys %dstrbtn_hash) {

my $size = $key - 3; # the sORF size is calculated only for coding reg
+ion with the stop codon length (3 nucleotides - TAA, TGA or TAG) remo
+ved to obtain JUST the coding sequence's length
my $freq = $dstrbtn_hash{$key}; # the number of times that a certain s
+ORF size is seen in the input file #1, as calculated by the earlier p
+ortion of this Perl script

my $iteration = 1;
while ($iteration <= $freq) {
    
    my ($temp_source_seq, $temp_source_seq_len);
    
EXTRACT: 
# choose a random sequence ONLY if it is as long or longer 
# than the length of sequence that needs to be extracted out

    {
        my $chosen=int(rand($#source));
        $chosen++ if(($chosen%2) == 0);
# even numbered array index is for fasta header, and therefore not of 
+interest, 
# instead, we are interested in odd numbered array index number for th
+e fasta sequence 
# which precedes or follows the header
        
        $temp_source_seq = $source[$chosen];
        $temp_source_seq_len = length ($temp_source_seq);
        
        redo EXTRACT if($temp_source_seq_len < $size); 
    }

    
        START:
# choose a random start coordinate ONLY if the substring starting at t
+hat position
# falls within the end coord of the sequence that it is part of

            {
                my $random_start_coord = int(rand($temp_source_seq_len
+));
                
                    redo START if(($random_start_coord + $size) > $tem
+p_source_seq_len);
                
            my $extracted_seq = substr($temp_source_seq, $random_start
+_coord, $size);
                        push @output,">".$header_count."extracted_seq"
+, "\n";
                        push @output, $extracted_seq, "\n";
                        $header_count++;
            }

    $iteration++;
        }
}

#********************#
#             WRITE TO OUTPUT FILE, REPORT TIME TAKEN FOR CALCULATION
#********************#

    my $filename = $input1.$input2."_extracted_seqs.fasta";
    open (OUT, '>', $filename) or die "Can't write to file $filename :
+ $!\n";
    print OUT @output;
my $end_time = time;
my $duration = ($end_time - $start_time)/60;
print "Thank you for your patience, this Perl script operation has com
+pleted, it took $duration minutes, good bye!", " \n";
    close OUT;

#********************#
[download]

In reply to Speeding up stalled script by onlyIDleft

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks