Re: look for substrings and getting their location
by PodMaster (Abbot) on May 09, 2004 at 09:47 UTC
|
| [reply] [Watch: Dir/Any] [d/l] |
|
Excellent explanation of three functions that anyone doing biological research in Perl should know. :)
You should also check out using the BioPerl modules for doing your sequence input and output, it will make your program general enough to work with many different sequence formats, not just FASTA. Here's a quick example:
use Bio::SeqIO;
my $filename = 'test.seq';
my $format = 'fasta';
my $seqio = Bio::SeqIO->new( -file => $filename, -format => $format );
while ( my $seqobj = $seqio->next_seq() ) {
my $raw_sequence = $seqobj->seq;
# do your searching on this raw sequence
}
Hope this helps. :) | [reply] [Watch: Dir/Any] [d/l] |
Re: look for substrings and getting their location
by ozone (Friar) on May 09, 2004 at 11:55 UTC
|
A fun one is something like:
my $string = 'GUAUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUA
+UGUUAUGA';
my @pieces = split(/GUAUG/,$string,-1);
print "count [", @pieces - 1, "]\n";
There you have the count and with a bit of math you can figure out the locations :-D
| [reply] [Watch: Dir/Any] [d/l] |
Re: look for substrings and getting their location
by CombatSquirrel (Hermit) on May 09, 2004 at 12:08 UTC
|
And, in the spirit of TIMTOWTDI, here's one without RegExes:
#!perl
use strict;
use warnings;
my $seq
= 'GUAUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGA';
my $search = 'UUUAA';
my $max = length($seq) - length($search);
my $i = 0;
my @results;
while ($i < $max) {
$i = index($seq, $search, $i);
if ($i == -1) {
last;
} else {
push @results, $i;
}
++$i;
}
print @results . " match(es)\n\n";
print " $seq\n";
for (@results) {
printf "%02.2d: %s%s\n", $_, ' ' x $_, $search;
}
It might be faster or not ;-).
Hope this helped.
CombatSquirrel.
Entropy is the tendency of everything going to hell. | [reply] [Watch: Dir/Any] [d/l] |
Re: look for substrings and getting their location
by Not_a_Number (Prior) on May 09, 2004 at 14:25 UTC
|
how do I count the number of times a specific string like: "GUAUG" occures in each line
I have a question:
For a string in your file that looks like GUAUGUAUGUAUG, how many matches do you want to count?
dave
| [reply] [Watch: Dir/Any] [d/l] |
|
all of them + location of the beginng of each one
| [reply] [Watch: Dir/Any] |
|
/G(?=UAUG)/g
That way the "pointer" is advanced one character at a time rather than one keyword at a time, thus allowing for overlapping matches.
You can find the position (in list context) with pos. In scalar context, the special variables @+ and @- will be helpful. See perlvar for a description of them.
On the other hand, if you want to find only one match in cases where the sequence appears to overlap, you'll have to define whether you want the left-side of the overlap to match, or the right side.
| [reply] [Watch: Dir/Any] [d/l] |
Re: look for substrings and getting their location
by ambrus (Abbot) on May 09, 2004 at 14:34 UTC
|
$count= @{[$string=~ /GUAUG/g]};
A bit inefficent if there are a lot of matches, using a /g regexp in scalar context would be better.
Update: The locations too? Than you'll need a /g regexp in scalar context and read $-[0] after each match.
| [reply] [Watch: Dir/Any] [d/l] [select] |
In Perl 6 that's just...
by TheDamian (Vicar) on May 10, 2004 at 20:29 UTC
|
$seq ~~ m:overlap/ GUAUG /;
say "Found ", +@$0, " at:";
say "\t", $_.pos for @$0;
So in Perl 5.8 and above you could write:
use Perl6::Rules;
use Perl6::Say;
$seq =~ m:overlap/ GUAUG /;
say "Found ", scalar(@$0), " at:";
say "\t", $_->pos for @$0;
And, as the modifier suggests, it correctly handles overlaps in the data. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
It can. I used the longer form because I was trying to keep the Perl 5 version as similar as possible to the Perl 6 version.
Damian
| [reply] [Watch: Dir/Any] |
Re: look for substrings and getting their location
by Anonymous Monk on May 29, 2004 at 03:14 UTC
|
It seems that a number of the posts took the original question and changed it somewhat, consequently, not giving full and thorough solutions. For instance, the original question states that the data are in the following format:
YBL027W
GUAUGUUUAACAGU...
Yet, a couple of the solutions begin by setting $var = 'GUAUGUUUAACAGU...'
How does one get the line name from the solution above? A solution which leaves the data in the original format and gives the line name, number of matches, and their zero-based offsets is as follows:
#!/usr/bin/perl
use warnings;
use strict;
my $pat = 'GUAUG';
my ($line, $times, @at);
while (<DATA>) {
if (/^[CGUA]+$/) {
$times = () = m/$pat/g;
if ($times) {
eval('/^' . ('.*?($pat)' x $times) . '.*?$/; @at = @-;');
shift @at;
}
} else {
($line) = /^(\w+)$/;
}
if ($line and $times) {
print "$line: $times match", $times>1 ? 'es' : ' ', " at @at\n";
$line = $times = 0;
}
}
__DATA__
YBL027W
GUAUGUUUAACAGUGAUAGUAUGUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGA
BBL111C
UAUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAGUAUGGUAUGAAUAUGUUAUGAG
ABC456T
AUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGAGU
DEF789U
UGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGAGUA
GHI012V
GUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGUAUGU
Perl was created to manipulate text. A solution to a problem such as this should be compact and easy to understand.
I made a few assumptions:
• All DNA sequences comprise CGUA. (I thought it was CGAT. I am not a scientist but I play one on TV.)
• The search strings do NOT overlap.
• The line name has at least one character that is not C, G, U, or A.
• All lines alternate between line name and DNA sequence with the former before the latter.
| [reply] [Watch: Dir/Any] [d/l] |