Dear Monks,
I seek your help in identifying the best and fastest algorithm for my bioinformatics problem. Please allow me to explain (got deleted when I edited my original post)
For every run, I have 2 input files, each with up to a few million DNA sequences (sometimes much fewer). From input file 1, I will infer the length distribution of all sequences i.e. how many times a certain length is observed. GOAL: To extract from input file 2, as many DNA sequences and with an identical length distribution as in input file 1. The Perl script I have written is quick ONLY for small input file pairs. IT does not scale up at all for large inputs it seems. That is the help I seek from you. Thank you very much!
UPDATE1: Script edited to improve readability
UPDATE2: I checked my script, and I have to admit, I do not see changes / errors compared to the previous time that I ran it. So the question is how can I modify it to make it much faster, because I can only imagine that the large input file sizes need to be compensated for by a speedier version. So modifications in logic / syntax etc will be MOST welcome. Thank you!
#! usr/bin/perl
# Length_dstrbtn_seq_extractor.pl
# This PERL script accepts two input files - the first input is a mult
+ifasta file containing alternating lines of fasta headers and sequenc
+es generated using using the Perl script fix_multifasta0.pl.
# This Perl scripts calculates the frequency distribution of sequences
+ of all sizes in this input file.
# The second input file is the user chosen multifasta sequence file fr
+om which sequences of the same frequency distribution of lengths will
+ be extracted.
# The only output is multifasta sequence file. The fasta headers speci
+fy source sequence, start coordinate and stop coordinate.
# Syntax : perl Length_dstrbtn_seq_extractor.pl <input1> <input2-sourc
+e4extraction>
#********************#
#PROCESS INPUT FILES
#********************#
use strict;
use warnings;
my $start_time = time;
my $input1 = shift @ARGV;
my $input2 = shift @ARGV;
my (@lengths, @source, @distrbtn, @output);
open(IN1, '<', $input1) or die "Can't read source file $input1 : $!\n"
+;
while(<IN1>) {
chomp;
if ($_=~ m/\>/) {
#looks for match to the '>' character in the header line
# if match to fasta header, does nothing
}
elsif ($_!~ m/\>/) {
#looks for non-match to the '>' character in the sequence line
push @lengths, length($_);
# if match to fasta sequence, calculates and collects length of sequen
+ce
}
}
close IN1;
open(IN2, '<', $input2) or die "Can't read source file $input2 : $!\n"
+;
while(<IN2>) {
chomp;
if ($_=~ m/\>/) {
#looks for match to the '>' character in the header line
push @source, $_;
# if match to fasta header, includes in array
}
elsif ($_!~ m/\>/) {
#looks for non-match to the '>' character in the sequence line
push @source, $_;
# if match to fasta sequence, includes sequence in array
}
}
close IN2;
#********************#
# CALCULATE LENGTH DISTRIBUTION FROM INPUT FILE #1
#********************#
my @sorted = sort {$a <=> $b}@lengths;
my %seen = ();
my @uniques = grep { !$seen{$_}++ } @sorted;
foreach my $len(@uniques) {
push @distrbtn, $len;
my $index = 0;
my $count = 0;
while($index <= $#sorted) {
if($len == $sorted[$index])
{
$count++;
}
$index++;
}
push @distrbtn, $count;
}
my %dstrbtn_hash = @distrbtn; # hash of predicted sORF length (key) an
+d number of times (value) that size is observed in the multifasta inp
+ut file #1
# print @distrbtn, "\n"; # works thus far
#********************#
# EXTRACT SEQUENCES (RANDOMLY) FROM INPUT2, WITH IDENTICAL LENGTH DIST
+RIBUTION OF INPUT 1
#********************#
my $header_count = 1;
foreach my $key (keys %dstrbtn_hash) {
my $size = $key - 3; # the sORF size is calculated only for coding reg
+ion with the stop codon length (3 nucleotides - TAA, TGA or TAG) remo
+ved to obtain JUST the coding sequence's length
my $freq = $dstrbtn_hash{$key}; # the number of times that a certain s
+ORF size is seen in the input file #1, as calculated by the earlier p
+ortion of this Perl script
my $iteration = 1;
while ($iteration <= $freq) {
my ($temp_source_seq, $temp_source_seq_len);
EXTRACT:
# choose a random sequence ONLY if it is as long or longer
# than the length of sequence that needs to be extracted out
{
my $chosen=int(rand($#source));
$chosen++ if(($chosen%2) == 0);
# even numbered array index is for fasta header, and therefore not of
+interest,
# instead, we are interested in odd numbered array index number for th
+e fasta sequence
# which precedes or follows the header
$temp_source_seq = $source[$chosen];
$temp_source_seq_len = length ($temp_source_seq);
redo EXTRACT if($temp_source_seq_len < $size);
}
START:
# choose a random start coordinate ONLY if the substring starting at t
+hat position
# falls within the end coord of the sequence that it is part of
{
my $random_start_coord = int(rand($temp_source_seq_len
+));
redo START if(($random_start_coord + $size) > $tem
+p_source_seq_len);
my $extracted_seq = substr($temp_source_seq, $random_start
+_coord, $size);
push @output,">".$header_count."extracted_seq"
+, "\n";
push @output, $extracted_seq, "\n";
$header_count++;
}
$iteration++;
}
}
#********************#
# WRITE TO OUTPUT FILE, REPORT TIME TAKEN FOR CALCULATION
#********************#
my $filename = $input1.$input2."_extracted_seqs.fasta";
open (OUT, '>', $filename) or die "Can't write to file $filename :
+ $!\n";
print OUT @output;
my $end_time = time;
my $duration = ($end_time - $start_time)/60;
print "Thank you for your patience, this Perl script operation has com
+pleted, it took $duration minutes, good bye!", " \n";
close OUT;
#********************#
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.