Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Dear Monks,

I seek your help in identifying the best and fastest algorithm for my bioinformatics problem. Please allow me to explain (got deleted when I edited my original post)

For every run, I have 2 input files, each with up to a few million DNA sequences (sometimes much fewer). From input file 1, I will infer the length distribution of all sequences i.e. how many times a certain length is observed. GOAL: To extract from input file 2, as many DNA sequences and with an identical length distribution as in input file 1. The Perl script I have written is quick ONLY for small input file pairs. IT does not scale up at all for large inputs it seems. That is the help I seek from you. Thank you very much!

UPDATE1: Script edited to improve readability

UPDATE2: I checked my script, and I have to admit, I do not see changes / errors compared to the previous time that I ran it. So the question is how can I modify it to make it much faster, because I can only imagine that the large input file sizes need to be compensated for by a speedier version. So modifications in logic / syntax etc will be MOST welcome. Thank you!

#! usr/bin/perl # Length_dstrbtn_seq_extractor.pl # This PERL script accepts two input files - the first input is a mult +ifasta file containing alternating lines of fasta headers and sequenc +es generated using using the Perl script fix_multifasta0.pl. # This Perl scripts calculates the frequency distribution of sequences + of all sizes in this input file. # The second input file is the user chosen multifasta sequence file fr +om which sequences of the same frequency distribution of lengths will + be extracted. # The only output is multifasta sequence file. The fasta headers speci +fy source sequence, start coordinate and stop coordinate. # Syntax : perl Length_dstrbtn_seq_extractor.pl <input1> <input2-sourc +e4extraction> #********************# #PROCESS INPUT FILES #********************# use strict; use warnings; my $start_time = time; my $input1 = shift @ARGV; my $input2 = shift @ARGV; my (@lengths, @source, @distrbtn, @output); open(IN1, '<', $input1) or die "Can't read source file $input1 : $!\n" +; while(<IN1>) { chomp; if ($_=~ m/\>/) { #looks for match to the '>' character in the header line # if match to fasta header, does nothing } elsif ($_!~ m/\>/) { #looks for non-match to the '>' character in the sequence line push @lengths, length($_); # if match to fasta sequence, calculates and collects length of sequen +ce } } close IN1; open(IN2, '<', $input2) or die "Can't read source file $input2 : $!\n" +; while(<IN2>) { chomp; if ($_=~ m/\>/) { #looks for match to the '>' character in the header line push @source, $_; # if match to fasta header, includes in array } elsif ($_!~ m/\>/) { #looks for non-match to the '>' character in the sequence line push @source, $_; # if match to fasta sequence, includes sequence in array } } close IN2; #********************# # CALCULATE LENGTH DISTRIBUTION FROM INPUT FILE #1 #********************# my @sorted = sort {$a <=> $b}@lengths; my %seen = (); my @uniques = grep { !$seen{$_}++ } @sorted; foreach my $len(@uniques) { push @distrbtn, $len; my $index = 0; my $count = 0; while($index <= $#sorted) { if($len == $sorted[$index]) { $count++; } $index++; } push @distrbtn, $count; } my %dstrbtn_hash = @distrbtn; # hash of predicted sORF length (key) an +d number of times (value) that size is observed in the multifasta inp +ut file #1 # print @distrbtn, "\n"; # works thus far #********************# # EXTRACT SEQUENCES (RANDOMLY) FROM INPUT2, WITH IDENTICAL LENGTH DIST +RIBUTION OF INPUT 1 #********************# my $header_count = 1; foreach my $key (keys %dstrbtn_hash) { my $size = $key - 3; # the sORF size is calculated only for coding reg +ion with the stop codon length (3 nucleotides - TAA, TGA or TAG) remo +ved to obtain JUST the coding sequence's length my $freq = $dstrbtn_hash{$key}; # the number of times that a certain s +ORF size is seen in the input file #1, as calculated by the earlier p +ortion of this Perl script my $iteration = 1; while ($iteration <= $freq) { my ($temp_source_seq, $temp_source_seq_len); EXTRACT: # choose a random sequence ONLY if it is as long or longer # than the length of sequence that needs to be extracted out { my $chosen=int(rand($#source)); $chosen++ if(($chosen%2) == 0); # even numbered array index is for fasta header, and therefore not of +interest, # instead, we are interested in odd numbered array index number for th +e fasta sequence # which precedes or follows the header $temp_source_seq = $source[$chosen]; $temp_source_seq_len = length ($temp_source_seq); redo EXTRACT if($temp_source_seq_len < $size); } START: # choose a random start coordinate ONLY if the substring starting at t +hat position # falls within the end coord of the sequence that it is part of { my $random_start_coord = int(rand($temp_source_seq_len +)); redo START if(($random_start_coord + $size) > $tem +p_source_seq_len); my $extracted_seq = substr($temp_source_seq, $random_start +_coord, $size); push @output,">".$header_count."extracted_seq" +, "\n"; push @output, $extracted_seq, "\n"; $header_count++; } $iteration++; } } #********************# # WRITE TO OUTPUT FILE, REPORT TIME TAKEN FOR CALCULATION #********************# my $filename = $input1.$input2."_extracted_seqs.fasta"; open (OUT, '>', $filename) or die "Can't write to file $filename : + $!\n"; print OUT @output; my $end_time = time; my $duration = ($end_time - $start_time)/60; print "Thank you for your patience, this Perl script operation has com +pleted, it took $duration minutes, good bye!", " \n"; close OUT; #********************#

In reply to Speeding up stalled script by onlyIDleft

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-26 00:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found