ic23oluk has asked for the wisdom of the Perl Monks concerning the following question:
Dear PerlMonks,
I try to find all possible open reading frames in a sequence and print them out.
I have already written a working program, but I want an alternative that is more efficient (I currently have two loops).
The idea is to store the position of starts, store the position of the stops and then look for pairs that are in frame. The latter point is what makes it difficult to me. How can I look for in-frame pairs?
here's the code for my first solution that works:
my $sequence = "sequence.fa";
my @starts;
open (FASTA, "$sequence") || die "Cannot open $sequence: $!.\n";
chomp (my @seq = <FASTA>);
close FASTA;
shift @seq;
$sequence = join ('', @seq);
@seq = split ('', $sequence);
for (my $i=0; $i<=$#seq-5; $i++){ ## -5 könnte man weglassen
# start codon: ATG
# stopp codon: TAG, TAA, TGA
# multiple of 3 between start and stop
if ($seq[$i] eq 'A' && $seq[$i+1] eq 'T' && $seq[$i+2] eq 'G')
+{
push (@starts, $i);
for (my $j=$i+3; $j<=$#seq-2; $j=$j+3){
if (($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2
+] eq 'A') ||
($seq[$j] eq 'T' && $seq[$j+1] eq 'G' && $seq[$j+2] eq
+ 'A') ||
($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq
+ 'G')){
print "ORF: $i-", ($j+2), "\n";
last; ##lasts the j loop
}
}
}
}
I would be very pleased if anyone wrote an alternative algorithm according to my aforementioned demands. Thanks in advance!
Re: finding open reading frames
by tybalt89 (Monsignor) on Jun 05, 2017 at 15:16 UTC
|
Lacking a test case, I'm guessing a little...
#!/usr/bin/perl
# http://perlmonks.org/?node_id=1192179
use strict;
use warnings;
my @seq = map qw( A C G T )[rand 4], 1..1e2; # fake for missing test c
+ase
$_ = join '', @seq;
print "$_\n\n";
print "ORF: $-[0]-", $+[1] - 1, " $&$1\n"
while /ATG(?=((?:.{3})*?(?:TAG|TAA|TGA)))/g;
I'm also printing a little extra for checking purposes.
| [reply] [d/l] |
Re: finding open reading frames
by thanos1983 (Parson) on Jun 05, 2017 at 14:53 UTC
|
Hello ic23oluk,
Can we have a sample of your data (couple of lines) not the whole file so we can replicate the script and see how we can improve it?
Update: I see from your script $sequence = join ('', @seq); @seq = split ('', $sequence); it looks you are joining and then splitting the array. I am not sure what are you doing there this is why I need to see your input.
Update2: Another possible approach instead of the complex if conditions that you have you could use grep but I need to practice with your code to see what is the expected output as you said that the script works.
Update3: From a quick search online I think you should be looking into getting familiar with BioPerl.
Update4: I think you should be looking for something like this Bio::SeqIO? You can also find some relevant information from similar question Parse DNA fasta file.
Update5: I noticed that you have not provided us yet with a sample of input data and desired output, so I can not really try tackle and help you. But as I was thinking about your problem, I think that you might be interested in using the index function. It will return the sub string position withing your string of data. I case you want to check for multiple sub strings just use the last location and after as a position. Read more about it on the documentation.
Looking forward to your update.
Hope this helps.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
|
the input file in Fasta format has always a one line header followed by a sequence, i.e.:
>NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT
GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA
ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG
TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA
GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC
TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT
TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT
CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT
TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG
TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC
GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA
CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG
TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT
AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC
CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA
TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC
TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA
CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA
the join and split operations provide an array in which every Nucleotide (letter) is an element (instead of every line).
thanks
| [reply] |
|
#!usr/bin/perl
use strict;
use warnings;
my @starts = ();
while (<>) {
chomp;
next if $. < 2; # Skip first line
my @seq = split '', $_;
for (0..$#seq-5){
if ($seq[$_] eq 'A' && $seq[$_+1] eq 'T' && $seq[$_+2] eq 'G') {
push (@starts, $_);
for (my $j=$_+3; $j<=$#seq-2; $j=$j+3){
if ( ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'A
+') ||
($seq[$j] eq 'T' && $seq[$j+1] eq 'G' && $seq[$j+2] eq 'A
+') ||
($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'G
+') ) {
print "ORF: $_-", ($j+2), "\n";
last; ##lasts the j loop
}
}
}
}
} continue {
close ARGV if eof; # reset $.
}
__END__
$ perl bio_test.pl sequence.fa
ORF: 16-75
ORF: 50-136
ORF: 133-357
ORF: 232-357
ORF: 252-290
ORF: 287-334
ORF: 305-334
ORF: 363-380
ORF: 394-450
ORF: 489-623
ORF: 575-724
ORF: 651-797
ORF: 689-724
ORF: 724-936
ORF: 854-859
ORF: 859-936
ORF: 954-959
ORF: 1051-1200
ORF: 1145-1255
ORF: 1228-1275
ORF: 1249-1275
Update: Another possible way, this is how I would approach it. This is 3/4 of the solution, I do not want to give you the full solution as this looks like school work. It should not be really difficult to continue from here.
As I said before use index for multiple occurrences in the string.
#!usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $substring = 'ATG';
while (<>) {
chomp;
next if $. < 2; # Skip first line
my @seq = split ' ', $_;
# print Dumper \@seq; # Remove hashtag and see how the input looks
for (@seq) {
my $found = index($_, $substring);
while ($found != -1) { # check for multiple occurrences in the sam
+e string
print "Found $substring at $found\n"; # Do your calculations h
+ere
my $offset = $found + 1; # keep track of offset in your counte
+r as you do on your primary script and should have it
$found = index($_, $substring, $offset);
}
}
} continue {
close ARGV if eof; # reset $.
}
__END__
$ perl bio_test.pl sequence.fa
Found 'ATG' at 16
Found 'ATG' at 50
Found 'ATG' at 62
Found 'ATG' at 19
Found 'ATG' at 39
Found 'ATG' at 3
Found 'ATG' at 21
Found 'ATG' at 8
Found 'ATG' at 39
Found 'ATG' at 63
Found 'ATG' at 7
Found 'ATG' at 12
Found 'ATG' at 50
Found 'ATG' at 14
Found 'ATG' at 2
Found 'ATG' at 7
Found 'ATG' at 31
Found 'ATG' at 20
Found 'ATG' at 50
Found 'ATG' at 57
Found 'ATG' at 9
Found 'ATG' at 21
Found 'ATG' at 42
Found 'ATG' at 65
Notice that the index is going back to zero, keep track of it and it should give you the same results as your primary script. Let us know if there is something that is not clear.
Update2: If the space in between the sequences in normal you do not need to split and you can loop through like this:
#!usr/bin/perl
use strict;
use warnings;
my $substring = 'ATG';
while (<>) {
chomp;
next if $. < 2; # Skip first line
my $found = index($_, $substring);
while ($found != -1) { # check for multiple occurrences in the sam
+e string
print "Found $substring at $found\n"; # Do your calculations here
my $offset = $found + 1; # keep track of offset in your counter as
+ you do on your primary script and should have it
$found = index($_, $substring, $offset);
}
} continue {
close ARGV if eof; # reset $.
}
__END__
$ perl bio.pl sequence.fa
Found ATG at 16
Found ATG at 50
Found ATG at 133
Found ATG at 232
Found ATG at 252
Found ATG at 287
Found ATG at 305
Found ATG at 363
Found ATG at 394
Found ATG at 489
Found ATG at 575
Found ATG at 651
Found ATG at 689
Found ATG at 724
Found ATG at 854
Found ATG at 859
Found ATG at 954
Found ATG at 1014
Found ATG at 1044
Found ATG at 1051
Found ATG at 1145
Found ATG at 1228
Found ATG at 1249
Found ATG at 1272
Now you just need to apply an if condition and voila. :D
Hope this helps.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
|
|
|
|
Re: finding open reading frames
by Anonymous Monk on Jun 05, 2017 at 15:20 UTC
|
Ok, so it looks like the rules are:
- ATG can be be a start codon, or it can code for Methionine.
- * TAG, TAA, and TGA are always end codons.
- * A paired start and end have to be separated by a multiple of three bases.
Here's a one-pass solution:
my @start = ([],[],[]);
my $sequence = 'ATGATGAATGAATAGTAGATAG';
for my $i (0 .. length($sequence)-3) {
my $triplet = substr($sequence, $i, 3);
if ($triplet eq 'ATG') {
push @{$start[$i%3]}, $i;
}
elsif ($triplet eq 'TAG' || $triplet eq 'TAA' || $triplet eq 'TGA')
+ {
for my $j (@{$start[$i%3]}) {
print $j, '..', $i+2, "\n";
}
@{$start[$i%3]} = ();
}
}
Output:
0..14
3..14
7..21
Explanation of the output:
ATGATGAATGAATAGTAGATAG
STA123123123END
STA123123END
STA123123123END
| [reply] [d/l] [select] |
Re: finding open reading frames
by glasswalk3r (Friar) on Jun 06, 2017 at 13:57 UTC
|
I'll not even try to understand what your code does (and other already provided some material for you on that), but here are some general guidelines:
- You should run a profiler over it: see Devel::NYTProf and identify the areas that are using more time/CPU so you can start with them.
- Since you are working inside loops, check out if you can get ride of them. If not, check some algorithms like Schwartzian_transform.
- Benchmark your code before and after the changes you implement to see if they are effective. See Dumbbench.
Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill
| [reply] |
Re: finding open reading frames
by johngg (Canon) on Jun 06, 2017 at 18:30 UTC
|
A one loop solution that uses regexen to find the positions of all starters and stoppers then, in a loop over starters, uses List::Util::first() to find the next stopper that fits the criteria.
use strict;
use warnings;
use 5.018;
use List::Util qw{ first };
open my $fastaFH, q{<}, \ <<EOD or die $!;
>NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT
GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA
ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG
TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA
GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC
TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT
TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT
CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT
TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG
TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC
GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA
CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG
TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT
AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC
CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA
TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC
TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA
CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA
EOD
chomp( my @seqLines = <$fastaFH> );
close $fastaFH or die $!;
my $sequence = join q{}, @seqLines[ 1 .. $#seqLines ];
my @starts;
push @starts, pos $sequence
while $sequence =~ m{(?=ATG)}g;
my @stops;
push @stops, pos $sequence
while $sequence =~ m{(?=(?:TAG|TAA|TGA))}g;
foreach my $start ( @starts )
{
my $firstStop =
first {
$_ > ( $start + 3 )
and not
( ( $_ - $start ) % 3 )
} @stops;
next unless $firstStop;
say qq{ORF: $start - $firstStop};
shift @stops while $stops[ 0 ] < $start;
}
The output:
ORF: 16 - 133
ORF: 50 - 218
ORF: 132 - 390
ORF: 229 - 328
ORF: 249 - 390
ORF: 283 - 328
ORF: 301 - 328
ORF: 358 - 373
ORF: 389 - 845
ORF: 483 - 498
ORF: 567 - 840
ORF: 642 - 840
ORF: 680 - 845
ORF: 714 - 840
ORF: 842 - 944
ORF: 847 - 961
ORF: 941 - 1193
ORF: 1037 - 1193
ORF: 1211 - 1256
ORF: 1232 - 1256
I hope this is of interest.
| [reply] [d/l] [select] |
|
Clever, but it's still a nested-loop solution. You just hid one of the loops in a function call. An efficient solution could be constructed along these lines, but as it is, it also chokes on
$sequence = 'ATG' x 1e6;
| [reply] [d/l] |
Re: finding open reading frames
by Anonymous Monk on Jun 05, 2017 at 14:43 UTC
|
It would help us understand what you are talking about if you made it clear that you are talking about DNA sequences. | [reply] |
Re: finding open reading frames
by thanos1983 (Parson) on Jun 06, 2017 at 23:58 UTC
|
#!usr/bin/perl
use strict;
use warnings;
my %hash;
my $substring = 'ATG';
my @tags = ('TAG', 'TAA', 'TGA');
while (<>) {
chomp;
next if $. < 2; # Skip first line
my $found = index($_, $substring);
while ($found != -1) {
foreach my $tag (@tags) {
my $position = index($_, $tag, $found + 2);
$hash{$position} = $tag if ($position != -1);
}
my $offset = $found + 1;
$found = index($_, $substring, $offset);
}
} continue {
close ARGV if eof; # reset $.
}
foreach my $position (sort {$a <=> $b} keys %hash) {
print "Position: $position String: $hash{$position}\n";
}
__END__
$ perl bio.pl sequence.fa
Position: 29 String: TAA
Position: 48 String: TGA
Position: 73 String: TAA
Position: 134 String: TGA
Position: 201 String: TAA
Position: 221 String: TGA
Position: 248 String: TAA
Position: 288 String: TGA
Position: 309 String: TAG
Position: 324 String: TGA
Position: 332 String: TAA
Position: 378 String: TAA
Position: 381 String: TAG
Position: 395 String: TGA
Position: 408 String: TGA
Position: 448 String: TAA
Position: 505 String: TGA
Position: 512 String: TAA
Position: 621 String: TGA
Position: 700 String: TGA
Position: 722 String: TAA
Position: 743 String: TAA
Position: 752 String: TGA
Position: 857 String: TAA
Position: 864 String: TAA
Position: 957 String: TAG
Position: 974 String: TAA
Position: 1002 String: TGA
Position: 1019 String: TAA
Position: 1052 String: TGA
Position: 1198 String: TGA
Position: 1210 String: TAG
Position: 1253 String: TGA
Position: 1265 String: TAA
At this point I do not know if your script is giving you the correct output. I am not sure if mine also works perfectly but I tried to confirm that by using grep and the tags e.g. TAA to check if I should get an output as my script does and yours skips.
For example:
$ cat sequence.fa | grep TAA
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT
+ GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGA
+AA ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGAT
+TCTG TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACT
+GGTTTA GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGA
+TTACTATC TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAA
+TCCGTACGTT TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAAC
+CGAAGATGATTT CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTC
+GCTGCGTTGAGGCT TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTT
+GAGTTTATTGCTGCCG TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCAT
+GGAAGGCGCTGAATTTAC GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGT
+TCGCGTTTACCTTGCGTGTA CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGT
+CAAAAATTACGTGCGGAAGGAG TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGT
+CGTCCGCAGCCGTTGCGAGGTACT AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCA
+ACAATTTTAATTGCAGGGGCTTCGGC CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCG
+CCGAGCGTATGCCGCATGACCTTTCCCA TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACC
+ATTTCAACTACTCCGGTTATCGCTGGCGAC TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTC
+TCCATTGCGTCGTGGCCTTGCTATTGACTCTA CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTC
+ACGTTTATGGTGAACAGTGGATTAAGTTCATGAA
$ cat sequence.fa | grep -o TAA | wc -l
21
Based on that I can tell that I am getting confident that your script is not calculating something correctly.
Update: I took a look again on the bash search (remember to remove first line from file!):
$ cat sequence.fa | grep -b -o TAA
29:TAA
73:TAA
106:TAA
201:TAA
248:TAA
332:TAA
378:TAA
385:TAA
448:TAA
480:TAA
512:TAA
722:TAA
743:TAA
857:TAA
864:TAA
870:TAA
934:TAA
974:TAA
1008:TAA
1019:TAA
1265:TAA
The output of my script:
#!usr/bin/perl
use strict;
use warnings;
my $substring = 'TAA';
while (<>) {
chomp;
next if $. < 2; # Skip first line
my $found = index($_, $substring);
while ($found != -1) { # check for multiple occurrences in the sam
+e string
print "Found $substring at $found\n";
my $offset = $found + 1;
$found = index($_, $substring, $offset);
}
} continue {
close ARGV if eof; # reset $.
}
__END__
$ perl bio_2.pl sequence.fa
Found TAA at 29
Found TAA at 73
Found TAA at 106
Found TAA at 201
Found TAA at 248
Found TAA at 332
Found TAA at 378
Found TAA at 385
Found TAA at 448
Found TAA at 480
Found TAA at 512
Found TAA at 722
Found TAA at 743
Found TAA at 857
Found TAA at 864
Found TAA at 870
Found TAA at 934
Found TAA at 974
Found TAA at 1008
Found TAA at 1019
Found TAA at 1265
Update2: If you check also the ATG
$ perl bio_2.pl sequence.fa
Found ATG at 16
Found ATG at 50
Found ATG at 133
Found ATG at 232
Found ATG at 252
Found ATG at 287
Found ATG at 305
Found ATG at 363
Found ATG at 394
Found ATG at 489
Found ATG at 575
Found ATG at 651
Found ATG at 689
Found ATG at 724
Found ATG at 854
Found ATG at 859
Found ATG at 954
Found ATG at 1014
Found ATG at 1044
Found ATG at 1051
Found ATG at 1145
Found ATG at 1228
Found ATG at 1249
Found ATG at 1272
tinyos@tinyOS:~/Monks$ cat sequence.fa | grep -b -o ATG
16:ATG
50:ATG
133:ATG
232:ATG
252:ATG
287:ATG
305:ATG
363:ATG
394:ATG
489:ATG
575:ATG
651:ATG
689:ATG
724:ATG
854:ATG
859:ATG
954:ATG
1014:ATG
1044:ATG
1051:ATG
1145:ATG
1228:ATG
1249:ATG
1272:ATG
Now check one by one the index and the final output of my script:
$ perl bio.pl sequence.fa
Position: 29 String: TAA
Position: 48 String: TGA
Position: 73 String: TAA
Position: 134 String: TGA
Position: 201 String: TAA
Position: 221 String: TGA
Position: 248 String: TAA
Position: 288 String: TGA
Position: 309 String: TAG
Position: 324 String: TGA
Position: 332 String: TAA
Position: 378 String: TAA
Position: 381 String: TAG
Position: 395 String: TGA
Position: 408 String: TGA
Position: 448 String: TAA
Position: 505 String: TGA
Position: 512 String: TAA
Position: 621 String: TGA
Position: 700 String: TGA
Position: 722 String: TAA
Position: 743 String: TAA
Position: 752 String: TGA
Position: 857 String: TAA
Position: 864 String: TAA
Position: 957 String: TAG
Position: 974 String: TAA
Position: 1002 String: TGA
Position: 1019 String: TAA
Position: 1052 String: TGA
Position: 1198 String: TGA
Position: 1210 String: TAG
Position: 1253 String: TGA
Position: 1265 String: TAA
Observe that the script is searching after the ATG as you requested (check indexes). It looks correct to me.
Take a look and give it a try to verify it or not.
Hope this helps instead of making things more confusing.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
Re: finding open reading frames
by thanos1983 (Parson) on Jun 07, 2017 at 09:27 UTC
|
#!usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %HoH;
my @AoH;
my $substring = 'ATG';
sub get_first_index {
my %hash;
my ($found, $string) = @_;
my @tags = ('TAG', 'TAA', 'TGA');
my @indexes;
foreach my $tag (@tags) {
my $position = index($string, $tag, $found + 2);
push @indexes, $position if ($position != -1);
}
# hash slice
@hash{@indexes} = @tags;
# sort has based on the lowest key first
my @sorted = (sort {$a <=> $b} keys %hash);
# remove the rest of the keys as we only want first occurence
my $array_size = @sorted;
delete $hash{$_} for @sorted [1..$array_size - 1];
return \%hash;
}
while (<>) {
chomp;
next if $. < 2; # Skip first line
my $found = index($_, $substring);
while ($found != -1) {
my $hash_result = get_first_index( $found, $_ );
# choose one or the other what ever you prefer
$HoH{"Found $substring at $found"} = $hash_result if (%$hash_resul
+t);
push @AoH, "Found $substring at $found" ,$hash_result if (%$hash_r
+esult);
my $offset = $found + 1;
$found = index( $_, $substring, $offset );
}
} continue {
close ARGV if eof; # reset $.
}
my @keys = keys %HoH;
print scalar @keys . "\n";
# print Dumper \@AoH;
# print Dumper \%HoH;
__END__
$ perl bio.pl sequence.fa
23
Why this is the correct one? Because:
$ cat sequence.fa | grep -bo ATG
16:ATG
50:ATG
133:ATG
232:ATG
252:ATG
287:ATG
305:ATG
363:ATG
394:ATG
489:ATG
575:ATG
651:ATG
689:ATG
724:ATG
854:ATG
859:ATG
954:ATG
1014:ATG
1044:ATG
1051:ATG
1145:ATG
1228:ATG
1249:ATG
1272:ATG
tinyos@tinyOMN:~/Monks$ cat sequence.fa | grep -bo ATG | wc -l
24
As you can see from the sample above, we have 24 matches with the key word ATG. But why we get 23 results from the script above? Simply because there is nothing matching from the data file after the last ATG.
Update2: Since you said you want maximum speed you can replace the foreach loop with a while loop.
#!usr/bin/perl
use say;
use strict;
use warnings;
use Data::Dumper;
my %HoH;
my @AoH;
my $substring = 'ATG';
sub get_first_index {
my %hash;
my ($found, $string) = @_;
my @tags = ('TAG', 'TAA', 'TGA');
my @indexes;
while (my $tag = shift @tags) {
my $position = index($string, $tag, $found + 2);
push @indexes, $position if ($position != -1);
}
# hash slice, we destroy the array above so we need to replace it
@hash{@indexes} = ('TAG', 'TAA', 'TGA');
# sort has based on the lowest key first
my @sorted = (sort {$a <=> $b} keys %hash);
# remove the rest of the keys as we only want first occurence
my $array_size = @sorted;
delete $hash{$_} for @sorted [1..$array_size - 1];
return \%hash;
}
while (<>) {
chomp;
next if $. < 2; # Skip first line
my $found = index($_, $substring);
while ($found != -1) {
my $hash_result = get_first_index( $found, $_ );
# choose one or the other what ever you prefer
$HoH{"Found $substring at $found"} = $hash_result if (%$hash_resul
+t);
push @AoH, "Found $substring at $found" ,$hash_result if (%$hash_r
+esult);
my $offset = $found + 1;
$found = index( $_, $substring, $offset );
}
} continue {
close ARGV if eof; # reset $.
}
my @keys = keys %HoH;
print scalar @keys . "\n";
# print Dumper \@AoH;
# print Dumper \%HoH;
Hope this helps and that you are still following your own question :D
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
|
| [reply] |
|
|