Re: BioInformatics - polyA tail search
by halley (Prior) on Sep 02, 2003 at 17:52 UTC
|
We are not all bioinformatics specialists, but we know Perl. Could you please let us know (1) what your data files contain, (2) what a polyA tail would look like in that data, (3) what you've tried so far.
How to ask questions the smart way.
-- [ e d @ h a l l e y . c c ] | [reply] |
|
I do apologize for the sparse details.
A polyA tail can be defined as a string of length 10 or greater containing only 'A' or 'N'.
For example,
AAAANAAAAANNNAANANANN or
AAAAAAAAAAAAAAAANNAAA
The files to search contain 80 character lines terminated with a control-M, as follows :
ACGGAAAATCGGATCTGAATGTCTAGAGGGGTTCTCTCCCTTGGTGTGAGTCTAGCCCTGAAAGTTGCCACTCATTGAGC^M
CGTTGCCGACTGAGGCTTTGGACTCCAAGGGTAAGGAGCAGACGATGGAGGACGATTTGCTTTGGGGCATCACGCAACCA^M
TCCCACTCTCGCGAAGCCAAATTTGTCGAGAGTACTCTGGGGGGAAGAGATCAGAATTGTGCAGACTAATCCGTAACTGC^M
CAAGTACTATTGGCCCTGTTCCAACCATCTAACCTCCTTATGATAACCATGCCACTAAATGGGTTCCTGGATCTGCACCT^M
CATTCGCTCGCCTTATGGCCTCGGCTCTCTGCGTATCCACCCTCCTCGTCACCGCCATGCCCTTCGACCTTCAGCGGGGG^M
Thus far, I have used only grep, but I think the control-M may be imbedded in some of the poly A stretches.
Thank you again!
| [reply] |
|
This is completely untested, but you can use this as a start. Warning: This could be memory intensive for large files!
# Open the file and slurp the contents to a string.
open FILE, "File_To_Read" || die "Cannot open 'File_To_Read' for readi
+ng: $!\n";
my $file = do { $\ = undef; <FILE> };
close FILE;
# Remove all the characters we don't care about.
$file =~ s/[^ANGTC]//g;
# Walk through the string, looking for matches.
while ($file =~ /[AN]{10}/g)
{
print "$1\n";
}
You're going to have to add the loop around the files, add any letters you want to be allowed into the substitution, etc. You're also going to have to add handling if you don't want to see overlapping sequences. Good luck!
------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified. | [reply] [d/l] |
|
|
|
|
| [reply] |
|
|
|
|
|
Are you sure that's what you want? You know your problem-space, so I'm just checking. There are three issues I see:
- Fasta doesn't have to be 80 characters per-line
- Fasta doesn't have to be ctrl-M delimited
- Are you sure you don't want to check strand identity (e.g. look for poly-T sequences)?
Either way, the simple answer is to parse the file using SeqIO (BioPerl). Extract the sequence, then run a reg-ex to look for what you want. In your case it's $seq =~ /[AN]{10,}/ I think.
| [reply] [d/l] |
|
If only there was a bioinformatics monks!
| [reply] |
Re: BioInformatics - polyA tail search
by blokhead (Monsignor) on Sep 02, 2003 at 19:32 UTC
|
my $tail_length = 10;
sub contains_polytail {
my $file = shift;
open my $fh, "<$file" or die "Couldn't open $file: $!";
## only read the last 10 + 2 bytes
seek $fh, -( $tail_length + 2), 2;
my $tail = do { local $/; <$fh> };
## ignore newlines and ^M's
$tail =~ s/[\r\n]//g;
return ($tail =~ /[AN]{$tail_length}$/);
}
print contains_polytail('test.dna') ? "yes\n" : "no\n";
Important note: this won't tell you the entire polyA tail, just whether there is one. The polyA tail might be 600 characters, it might be 15. If you need the entire tail's contents, you either have to keep reading backwards through the file, or else simply slurp the entire file into memory and use one of the other solutions in this thread. But now you only need to slurp the entire file if you know it contains a polyA tail.
blokhead | [reply] [d/l] |
|
I wrote a simple program for something similar once. Seemed like you needed to account for inexact matches, if it was a real world and not an academic problem.
You might want to look at the CPAN module String::Approx. It aint a speed demon, but it seemed to work.
http://search.cpan.org/author/JHI/String-Approx-3.20/Approx.pm
Please forgive if I am wrong, I know a little Perl and even less Genetics.
| [reply] |
Re: BioInformatics - polyA tail search
by biosysadmin (Deacon) on Sep 03, 2003 at 10:00 UTC
|
Dear Perlmonks, I have not yet had time to delve into BioPerl (although I can't wait!).
Here's a brief introduction to Bioperl for you:
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my @files = <*.seq>;
my $pattern = '[AN]{10}';
foreach my $file (@files) {
my $seqio = Bio::SeqIO->new( -file => $file );
while ( my $seqobj = $seqio->next_seq() ) {
my $raw_seq = $seqobj->seq();
if ( $raw_seq =~ /$pattern/o ) {
print "Found a polyA tail in $file.\n";
}
}
}
This seems to work for me.
If you have any questions about how it works, just ask. I made a few assumptions:
Also, is there community interest in an Intro to Bioperl Tutorial? The documentation at doc.bioperl.org seems to be very geared towards developers and other experienced Bioperl users. If there is interest I'd be happy to write it, alhough I might need some help (I'm a long time reader of PM, short time member). | [reply] [d/l] [select] |
|
There *is* a BioPerl tutorial bptutorial.pl, along with a huge variety of sample scripts in the BioPerl distribution. I think additional BioPerl tutorial material is a great idea, but don't put it here: contribute it back to the main project. The core developers there are too overloaded to do this kind of stuff, I think.
| [reply] [d/l] |
Re: BioInformatics - polyA tail search
by halley (Prior) on Sep 02, 2003 at 18:54 UTC
|
Does a single file have a single sequence? You said these polyA tails are at (or near?) the end.
If the files have just one sequence and are small enough to fit in memory, slurp into a scalar, strip any irrelevant characters, and the regular expression qr/[AN]{10,}/ would find the tails.
If the files are too large, or have more than one sequence, you'll have to work a little harder to fit in memory. Reading a couple lines at a time, searching the tailing line(s) for a matching tail.
-- [ e d @ h a l l e y . c c ] | [reply] [d/l] |
Re: BioInformatics - polyA tail search
by johndageek (Hermit) on Sep 03, 2003 at 22:38 UTC
|
This may help. Not polished or thoroughly tested, but should get you to a starting point.
enjoy!
dageek
#!/usr/bin/perl -w
use strict;
## set minimum length for polytail to search for
my $polytail_min_len = 10;
my $polytail_curr_len = $polytail_min_len;
my $workline;
my $polya_str_base;
## read data record
for (<DATA>)
{
## remove exteranious characters
s/[\r\n]//g;
# add new data to end of working data string
$workline = $workline . $_;
while (length($workline) > $polytail_curr_len)
{
$polya_str_base = substr($workline,0,$polytail_curr_len);
## remove desired characters from string
$polya_str_base =~ s/[AN]//g;
## no characters left = all characters were in desired character set
if (length($polya_str_base) == 0)
{
## add a character from data set to string to test (ok - bump subscrip
+t)
$polytail_curr_len++;
}
else
{
## a polytail of at least minimum length was found
if ($polytail_curr_len > $polytail_min_len)
{
print substr($workline,0,$polytail_curr_len-1) . "\n";
## trim characters of found polytail from working string
$workline = substr($workline,$polytail_curr_len);
## reset length of string to test
$polytail_curr_len = $polytail_min_len;
}
## trim lead character from working field
$workline = substr($workline,1);
}
}
}
__DATA__
123456789abcdefghijklmnopqrstuvwxyz
ACGGAAAAAAAAAAATCGGATCTGAATGTCTAGAGGGGTTCTCTCCCTTGGTGTGAGTCTAGCCCTGAAA
+GTTGCANANAN
NANANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+AAAAAAAAAAAA
NTCCCACTCTANCGCGAAGCCAAATTTGTCGAGAGTACTCTGGGGGGAAGAGATCAGAATTGTGCAGACT
+AATCCGTAACTGC
CAAGTACTATTGGCCCTGTTCCAACCATCTAACCTCCTTATGATAACCATGCCACTAAATGGGTTCCTGG
+ATCTGCACCT
CATTCGCTCGCCTTATGGCCTCGGCTCTCTGCGTATCCACCCTCCTCGTCACCGCCATGCCCTTCGACCT
+TCAGCGGGGG
| [reply] [d/l] |