Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

finding open reading frames

by ic23oluk (Sexton)
on Jun 05, 2017 at 14:37 UTC ( [id://1192179]=perlquestion: print w/replies, xml ) Need Help??

ic23oluk has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks, I try to find all possible open reading frames in a sequence and print them out. I have already written a working program, but I want an alternative that is more efficient (I currently have two loops). The idea is to store the position of starts, store the position of the stops and then look for pairs that are in frame. The latter point is what makes it difficult to me. How can I look for in-frame pairs? here's the code for my first solution that works:

my $sequence = "sequence.fa"; my @starts; open (FASTA, "$sequence") || die "Cannot open $sequence: $!.\n"; chomp (my @seq = <FASTA>); close FASTA; shift @seq; $sequence = join ('', @seq); @seq = split ('', $sequence); for (my $i=0; $i<=$#seq-5; $i++){ ## -5 könnte man weglassen # start codon: ATG # stopp codon: TAG, TAA, TGA # multiple of 3 between start and stop if ($seq[$i] eq 'A' && $seq[$i+1] eq 'T' && $seq[$i+2] eq 'G') +{ push (@starts, $i); for (my $j=$i+3; $j<=$#seq-2; $j=$j+3){ if (($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2 +] eq 'A') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'G' && $seq[$j+2] eq + 'A') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq + 'G')){ print "ORF: $i-", ($j+2), "\n"; last; ##lasts the j loop } } } }

I would be very pleased if anyone wrote an alternative algorithm according to my aforementioned demands. Thanks in advance!

Replies are listed 'Best First'.
Re: finding open reading frames
by tybalt89 (Monsignor) on Jun 05, 2017 at 15:16 UTC

    Lacking a test case, I'm guessing a little...

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1192179 use strict; use warnings; my @seq = map qw( A C G T )[rand 4], 1..1e2; # fake for missing test c +ase $_ = join '', @seq; print "$_\n\n"; print "ORF: $-[0]-", $+[1] - 1, " $&$1\n" while /ATG(?=((?:.{3})*?(?:TAG|TAA|TGA)))/g;

    I'm also printing a little extra for checking purposes.

Re: finding open reading frames
by thanos1983 (Parson) on Jun 05, 2017 at 14:53 UTC

    Hello ic23oluk,

    Can we have a sample of your data (couple of lines) not the whole file so we can replicate the script and see how we can improve it?

    Update: I see from your script $sequence = join ('', @seq); @seq = split ('', $sequence); it looks you are joining and then splitting the array. I am not sure what are you doing there this is why I need to see your input.

    Update2: Another possible approach instead of the complex if conditions that you have you could use grep but I need to practice with your code to see what is the expected output as you said that the script works.

    Update3: From a quick search online I think you should be looking into getting familiar with BioPerl.

    Update4: I think you should be looking for something like this Bio::SeqIO? You can also find some relevant information from similar question Parse DNA fasta file.

    Update5: I noticed that you have not provided us yet with a sample of input data and desired output, so I can not really try tackle and help you. But as I was thinking about your problem, I think that you might be interested in using the index function. It will return the sub string position withing your string of data. I case you want to check for multiple sub strings just use the last location and after as a position. Read more about it on the documentation.

    Looking forward to your update.

    Hope this helps.

    Seeking for Perl wisdom...on the process of learning...not there...yet!

      the input file in Fasta format has always a one line header followed by a sequence, i.e.:

      >NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA

      the join and split operations provide an array in which every Nucleotide (letter) is an element (instead of every line).
      thanks

        Hello ic23oluk,

        I do not have that much time to modify and experiment with your code but a few minor modifications can make your script faster:

        #!usr/bin/perl use strict; use warnings; my @starts = (); while (<>) { chomp; next if $. < 2; # Skip first line my @seq = split '', $_; for (0..$#seq-5){ if ($seq[$_] eq 'A' && $seq[$_+1] eq 'T' && $seq[$_+2] eq 'G') { push (@starts, $_); for (my $j=$_+3; $j<=$#seq-2; $j=$j+3){ if ( ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'A +') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'G' && $seq[$j+2] eq 'A +') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'G +') ) { print "ORF: $_-", ($j+2), "\n"; last; ##lasts the j loop } } } } } continue { close ARGV if eof; # reset $. } __END__ $ perl bio_test.pl sequence.fa ORF: 16-75 ORF: 50-136 ORF: 133-357 ORF: 232-357 ORF: 252-290 ORF: 287-334 ORF: 305-334 ORF: 363-380 ORF: 394-450 ORF: 489-623 ORF: 575-724 ORF: 651-797 ORF: 689-724 ORF: 724-936 ORF: 854-859 ORF: 859-936 ORF: 954-959 ORF: 1051-1200 ORF: 1145-1255 ORF: 1228-1275 ORF: 1249-1275

        Update: Another possible way, this is how I would approach it. This is 3/4 of the solution, I do not want to give you the full solution as this looks like school work. It should not be really difficult to continue from here.

        As I said before use index for multiple occurrences in the string.

        #!usr/bin/perl use strict; use warnings; use Data::Dumper; my $substring = 'ATG'; while (<>) { chomp; next if $. < 2; # Skip first line my @seq = split ' ', $_; # print Dumper \@seq; # Remove hashtag and see how the input looks for (@seq) { my $found = index($_, $substring); while ($found != -1) { # check for multiple occurrences in the sam +e string print "Found $substring at $found\n"; # Do your calculations h +ere my $offset = $found + 1; # keep track of offset in your counte +r as you do on your primary script and should have it $found = index($_, $substring, $offset); } } } continue { close ARGV if eof; # reset $. } __END__ $ perl bio_test.pl sequence.fa Found 'ATG' at 16 Found 'ATG' at 50 Found 'ATG' at 62 Found 'ATG' at 19 Found 'ATG' at 39 Found 'ATG' at 3 Found 'ATG' at 21 Found 'ATG' at 8 Found 'ATG' at 39 Found 'ATG' at 63 Found 'ATG' at 7 Found 'ATG' at 12 Found 'ATG' at 50 Found 'ATG' at 14 Found 'ATG' at 2 Found 'ATG' at 7 Found 'ATG' at 31 Found 'ATG' at 20 Found 'ATG' at 50 Found 'ATG' at 57 Found 'ATG' at 9 Found 'ATG' at 21 Found 'ATG' at 42 Found 'ATG' at 65

        Notice that the index is going back to zero, keep track of it and it should give you the same results as your primary script. Let us know if there is something that is not clear.

        Update2: If the space in between the sequences in normal you do not need to split and you can loop through like this:

        #!usr/bin/perl use strict; use warnings; my $substring = 'ATG'; while (<>) { chomp; next if $. < 2; # Skip first line my $found = index($_, $substring); while ($found != -1) { # check for multiple occurrences in the sam +e string print "Found $substring at $found\n"; # Do your calculations here my $offset = $found + 1; # keep track of offset in your counter as + you do on your primary script and should have it $found = index($_, $substring, $offset); } } continue { close ARGV if eof; # reset $. } __END__ $ perl bio.pl sequence.fa Found ATG at 16 Found ATG at 50 Found ATG at 133 Found ATG at 232 Found ATG at 252 Found ATG at 287 Found ATG at 305 Found ATG at 363 Found ATG at 394 Found ATG at 489 Found ATG at 575 Found ATG at 651 Found ATG at 689 Found ATG at 724 Found ATG at 854 Found ATG at 859 Found ATG at 954 Found ATG at 1014 Found ATG at 1044 Found ATG at 1051 Found ATG at 1145 Found ATG at 1228 Found ATG at 1249 Found ATG at 1272

        Now you just need to apply an if condition and voila. :D

        Hope this helps.

        Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: finding open reading frames
by Anonymous Monk on Jun 05, 2017 at 15:20 UTC
    Ok, so it looks like the rules are:
    • ATG can be be a start codon, or it can code for Methionine.
    • * TAG, TAA, and TGA are always end codons.
    • * A paired start and end have to be separated by a multiple of three bases.
    Here's a one-pass solution:
    my @start = ([],[],[]); my $sequence = 'ATGATGAATGAATAGTAGATAG'; for my $i (0 .. length($sequence)-3) { my $triplet = substr($sequence, $i, 3); if ($triplet eq 'ATG') { push @{$start[$i%3]}, $i; } elsif ($triplet eq 'TAG' || $triplet eq 'TAA' || $triplet eq 'TGA') + { for my $j (@{$start[$i%3]}) { print $j, '..', $i+2, "\n"; } @{$start[$i%3]} = (); } }
    Output:
    0..14 3..14 7..21
    Explanation of the output:
    ATGATGAATGAATAGTAGATAG STA123123123END STA123123END STA123123123END
Re: finding open reading frames
by glasswalk3r (Friar) on Jun 06, 2017 at 13:57 UTC

    I'll not even try to understand what your code does (and other already provided some material for you on that), but here are some general guidelines:

    1. You should run a profiler over it: see Devel::NYTProf and identify the areas that are using more time/CPU so you can start with them.
    2. Since you are working inside loops, check out if you can get ride of them. If not, check some algorithms like Schwartzian_transform.
    3. Benchmark your code before and after the changes you implement to see if they are effective. See Dumbbench.
    Alceu Rodrigues de Freitas Junior
    ---------------------------------
    "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill
Re: finding open reading frames
by johngg (Canon) on Jun 06, 2017 at 18:30 UTC

    A one loop solution that uses regexen to find the positions of all starters and stoppers then, in a loop over starters, uses List::Util::first() to find the next stopper that fits the criteria.

    use strict; use warnings; use 5.018; use List::Util qw{ first }; open my $fastaFH, q{<}, \ <<EOD or die $!; >NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA EOD chomp( my @seqLines = <$fastaFH> ); close $fastaFH or die $!; my $sequence = join q{}, @seqLines[ 1 .. $#seqLines ]; my @starts; push @starts, pos $sequence while $sequence =~ m{(?=ATG)}g; my @stops; push @stops, pos $sequence while $sequence =~ m{(?=(?:TAG|TAA|TGA))}g; foreach my $start ( @starts ) { my $firstStop = first { $_ > ( $start + 3 ) and not ( ( $_ - $start ) % 3 ) } @stops; next unless $firstStop; say qq{ORF: $start - $firstStop}; shift @stops while $stops[ 0 ] < $start; }

    The output:

    ORF: 16 - 133 ORF: 50 - 218 ORF: 132 - 390 ORF: 229 - 328 ORF: 249 - 390 ORF: 283 - 328 ORF: 301 - 328 ORF: 358 - 373 ORF: 389 - 845 ORF: 483 - 498 ORF: 567 - 840 ORF: 642 - 840 ORF: 680 - 845 ORF: 714 - 840 ORF: 842 - 944 ORF: 847 - 961 ORF: 941 - 1193 ORF: 1037 - 1193 ORF: 1211 - 1256 ORF: 1232 - 1256

    I hope this is of interest.

    Cheers,

    JohnGG

      Clever, but it's still a nested-loop solution. You just hid one of the loops in a function call. An efficient solution could be constructed along these lines, but as it is, it also chokes on
      $sequence = 'ATG' x 1e6;
Re: finding open reading frames
by Anonymous Monk on Jun 05, 2017 at 14:43 UTC
    It would help us understand what you are talking about if you made it clear that you are talking about DNA sequences.
Re: finding open reading frames
by thanos1983 (Parson) on Jun 06, 2017 at 23:58 UTC

    Hello again ic23oluk,

    I spend some time and I got even more confused. I got different results from your scrip:

    #!usr/bin/perl use strict; use warnings; my %hash; my $substring = 'ATG'; my @tags = ('TAG', 'TAA', 'TGA'); while (<>) { chomp; next if $. < 2; # Skip first line my $found = index($_, $substring); while ($found != -1) { foreach my $tag (@tags) { my $position = index($_, $tag, $found + 2); $hash{$position} = $tag if ($position != -1); } my $offset = $found + 1; $found = index($_, $substring, $offset); } } continue { close ARGV if eof; # reset $. } foreach my $position (sort {$a <=> $b} keys %hash) { print "Position: $position String: $hash{$position}\n"; } __END__ $ perl bio.pl sequence.fa Position: 29 String: TAA Position: 48 String: TGA Position: 73 String: TAA Position: 134 String: TGA Position: 201 String: TAA Position: 221 String: TGA Position: 248 String: TAA Position: 288 String: TGA Position: 309 String: TAG Position: 324 String: TGA Position: 332 String: TAA Position: 378 String: TAA Position: 381 String: TAG Position: 395 String: TGA Position: 408 String: TGA Position: 448 String: TAA Position: 505 String: TGA Position: 512 String: TAA Position: 621 String: TGA Position: 700 String: TGA Position: 722 String: TAA Position: 743 String: TAA Position: 752 String: TGA Position: 857 String: TAA Position: 864 String: TAA Position: 957 String: TAG Position: 974 String: TAA Position: 1002 String: TGA Position: 1019 String: TAA Position: 1052 String: TGA Position: 1198 String: TGA Position: 1210 String: TAG Position: 1253 String: TGA Position: 1265 String: TAA

    At this point I do not know if your script is giving you the correct output. I am not sure if mine also works perfectly but I tried to confirm that by using grep and the tags e.g. TAA to check if I should get an output as my script does and yours skips.

    For example:

    $ cat sequence.fa | grep TAA GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT + GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGA +AA ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGAT +TCTG TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACT +GGTTTA GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGA +TTACTATC TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAA +TCCGTACGTT TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAAC +CGAAGATGATTT CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTC +GCTGCGTTGAGGCT TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTT +GAGTTTATTGCTGCCG TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCAT +GGAAGGCGCTGAATTTAC GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGT +TCGCGTTTACCTTGCGTGTA CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGT +CAAAAATTACGTGCGGAAGGAG TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGT +CGTCCGCAGCCGTTGCGAGGTACT AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCA +ACAATTTTAATTGCAGGGGCTTCGGC CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCG +CCGAGCGTATGCCGCATGACCTTTCCCA TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACC +ATTTCAACTACTCCGGTTATCGCTGGCGAC TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTC +TCCATTGCGTCGTGGCCTTGCTATTGACTCTA CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTC +ACGTTTATGGTGAACAGTGGATTAAGTTCATGAA $ cat sequence.fa | grep -o TAA | wc -l 21

    Based on that I can tell that I am getting confident that your script is not calculating something correctly.

    Update: I took a look again on the bash search (remember to remove first line from file!):

    $ cat sequence.fa | grep -b -o TAA 29:TAA 73:TAA 106:TAA 201:TAA 248:TAA 332:TAA 378:TAA 385:TAA 448:TAA 480:TAA 512:TAA 722:TAA 743:TAA 857:TAA 864:TAA 870:TAA 934:TAA 974:TAA 1008:TAA 1019:TAA 1265:TAA

    The output of my script:

    #!usr/bin/perl use strict; use warnings; my $substring = 'TAA'; while (<>) { chomp; next if $. < 2; # Skip first line my $found = index($_, $substring); while ($found != -1) { # check for multiple occurrences in the sam +e string print "Found $substring at $found\n"; my $offset = $found + 1; $found = index($_, $substring, $offset); } } continue { close ARGV if eof; # reset $. } __END__ $ perl bio_2.pl sequence.fa Found TAA at 29 Found TAA at 73 Found TAA at 106 Found TAA at 201 Found TAA at 248 Found TAA at 332 Found TAA at 378 Found TAA at 385 Found TAA at 448 Found TAA at 480 Found TAA at 512 Found TAA at 722 Found TAA at 743 Found TAA at 857 Found TAA at 864 Found TAA at 870 Found TAA at 934 Found TAA at 974 Found TAA at 1008 Found TAA at 1019 Found TAA at 1265

    Update2: If you check also the ATG

    $ perl bio_2.pl sequence.fa Found ATG at 16 Found ATG at 50 Found ATG at 133 Found ATG at 232 Found ATG at 252 Found ATG at 287 Found ATG at 305 Found ATG at 363 Found ATG at 394 Found ATG at 489 Found ATG at 575 Found ATG at 651 Found ATG at 689 Found ATG at 724 Found ATG at 854 Found ATG at 859 Found ATG at 954 Found ATG at 1014 Found ATG at 1044 Found ATG at 1051 Found ATG at 1145 Found ATG at 1228 Found ATG at 1249 Found ATG at 1272 tinyos@tinyOS:~/Monks$ cat sequence.fa | grep -b -o ATG 16:ATG 50:ATG 133:ATG 232:ATG 252:ATG 287:ATG 305:ATG 363:ATG 394:ATG 489:ATG 575:ATG 651:ATG 689:ATG 724:ATG 854:ATG 859:ATG 954:ATG 1014:ATG 1044:ATG 1051:ATG 1145:ATG 1228:ATG 1249:ATG 1272:ATG

    Now check one by one the index and the final output of my script:

    $ perl bio.pl sequence.fa Position: 29 String: TAA Position: 48 String: TGA Position: 73 String: TAA Position: 134 String: TGA Position: 201 String: TAA Position: 221 String: TGA Position: 248 String: TAA Position: 288 String: TGA Position: 309 String: TAG Position: 324 String: TGA Position: 332 String: TAA Position: 378 String: TAA Position: 381 String: TAG Position: 395 String: TGA Position: 408 String: TGA Position: 448 String: TAA Position: 505 String: TGA Position: 512 String: TAA Position: 621 String: TGA Position: 700 String: TGA Position: 722 String: TAA Position: 743 String: TAA Position: 752 String: TGA Position: 857 String: TAA Position: 864 String: TAA Position: 957 String: TAG Position: 974 String: TAA Position: 1002 String: TGA Position: 1019 String: TAA Position: 1052 String: TGA Position: 1198 String: TGA Position: 1210 String: TAG Position: 1253 String: TGA Position: 1265 String: TAA

    Observe that the script is searching after the ATG as you requested (check indexes). It looks correct to me.

    Take a look and give it a try to verify it or not.

    Hope this helps instead of making things more confusing.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: finding open reading frames
by thanos1983 (Parson) on Jun 07, 2017 at 09:27 UTC

    Hello ic23oluk,

    Ok this is the final and correct solution all the rest of my solutions are wrong.

    #!usr/bin/perl use strict; use warnings; use Data::Dumper; my %HoH; my @AoH; my $substring = 'ATG'; sub get_first_index { my %hash; my ($found, $string) = @_; my @tags = ('TAG', 'TAA', 'TGA'); my @indexes; foreach my $tag (@tags) { my $position = index($string, $tag, $found + 2); push @indexes, $position if ($position != -1); } # hash slice @hash{@indexes} = @tags; # sort has based on the lowest key first my @sorted = (sort {$a <=> $b} keys %hash); # remove the rest of the keys as we only want first occurence my $array_size = @sorted; delete $hash{$_} for @sorted [1..$array_size - 1]; return \%hash; } while (<>) { chomp; next if $. < 2; # Skip first line my $found = index($_, $substring); while ($found != -1) { my $hash_result = get_first_index( $found, $_ ); # choose one or the other what ever you prefer $HoH{"Found $substring at $found"} = $hash_result if (%$hash_resul +t); push @AoH, "Found $substring at $found" ,$hash_result if (%$hash_r +esult); my $offset = $found + 1; $found = index( $_, $substring, $offset ); } } continue { close ARGV if eof; # reset $. } my @keys = keys %HoH; print scalar @keys . "\n"; # print Dumper \@AoH; # print Dumper \%HoH; __END__ $ perl bio.pl sequence.fa 23

    Why this is the correct one? Because:

    $ cat sequence.fa | grep -bo ATG 16:ATG 50:ATG 133:ATG 232:ATG 252:ATG 287:ATG 305:ATG 363:ATG 394:ATG 489:ATG 575:ATG 651:ATG 689:ATG 724:ATG 854:ATG 859:ATG 954:ATG 1014:ATG 1044:ATG 1051:ATG 1145:ATG 1228:ATG 1249:ATG 1272:ATG tinyos@tinyOMN:~/Monks$ cat sequence.fa | grep -bo ATG | wc -l 24

    As you can see from the sample above, we have 24 matches with the key word ATG. But why we get 23 results from the script above? Simply because there is nothing matching from the data file after the last ATG.

    Update2: Since you said you want maximum speed you can replace the foreach loop with a while loop.

    #!usr/bin/perl use say; use strict; use warnings; use Data::Dumper; my %HoH; my @AoH; my $substring = 'ATG'; sub get_first_index { my %hash; my ($found, $string) = @_; my @tags = ('TAG', 'TAA', 'TGA'); my @indexes; while (my $tag = shift @tags) { my $position = index($string, $tag, $found + 2); push @indexes, $position if ($position != -1); } # hash slice, we destroy the array above so we need to replace it @hash{@indexes} = ('TAG', 'TAA', 'TGA'); # sort has based on the lowest key first my @sorted = (sort {$a <=> $b} keys %hash); # remove the rest of the keys as we only want first occurence my $array_size = @sorted; delete $hash{$_} for @sorted [1..$array_size - 1]; return \%hash; } while (<>) { chomp; next if $. < 2; # Skip first line my $found = index($_, $substring); while ($found != -1) { my $hash_result = get_first_index( $found, $_ ); # choose one or the other what ever you prefer $HoH{"Found $substring at $found"} = $hash_result if (%$hash_resul +t); push @AoH, "Found $substring at $found" ,$hash_result if (%$hash_r +esult); my $offset = $found + 1; $found = index( $_, $substring, $offset ); } } continue { close ARGV if eof; # reset $. } my @keys = keys %HoH; print scalar @keys . "\n"; # print Dumper \@AoH; # print Dumper \%HoH;

    Hope this helps and that you are still following your own question :D

    Seeking for Perl wisdom...on the process of learning...not there...yet!

      first of all: thanks you spend that much effort on this. I need some time to go through this :D

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1192179]
Approved by sundialsvc4
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-18 08:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found