http://www.perlmonks.org?node_id=288362

MiamiGenome has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perlmonks, I have not yet had time to delve into BioPerl (although I can't wait!). I am seeking help with creating a perl script to identify sequence files containing polyA tails. I have about 2000 files to process. Thank you all for such a wonderful place to learn and write Perl! MiamiGenome

Replies are listed 'Best First'.
Re: BioInformatics - polyA tail search
by halley (Prior) on Sep 02, 2003 at 17:52 UTC
    We are not all bioinformatics specialists, but we know Perl. Could you please let us know (1) what your data files contain, (2) what a polyA tail would look like in that data, (3) what you've tried so far.

    How to ask questions the smart way.

    --
    [ e d @ h a l l e y . c c ]

      I do apologize for the sparse details.

      A polyA tail can be defined as a string of length 10 or greater containing only 'A' or 'N'.

      For example,

      AAAANAAAAANNNAANANANN

      or

      AAAAAAAAAAAAAAAANNAAA

      The files to search contain 80 character lines terminated with a control-M, as follows :

      ACGGAAAATCGGATCTGAATGTCTAGAGGGGTTCTCTCCCTTGGTGTGAGTCTAGCCCTGAAAGTTGCCACTCATTGAGC^M CGTTGCCGACTGAGGCTTTGGACTCCAAGGGTAAGGAGCAGACGATGGAGGACGATTTGCTTTGGGGCATCACGCAACCA^M TCCCACTCTCGCGAAGCCAAATTTGTCGAGAGTACTCTGGGGGGAAGAGATCAGAATTGTGCAGACTAATCCGTAACTGC^M CAAGTACTATTGGCCCTGTTCCAACCATCTAACCTCCTTATGATAACCATGCCACTAAATGGGTTCCTGGATCTGCACCT^M CATTCGCTCGCCTTATGGCCTCGGCTCTCTGCGTATCCACCCTCCTCGTCACCGCCATGCCCTTCGACCTTCAGCGGGGG^M

      Thus far, I have used only grep, but I think the control-M may be imbedded in some of the poly A stretches.

      Thank you again!

        This is completely untested, but you can use this as a start. Warning: This could be memory intensive for large files!
        # Open the file and slurp the contents to a string. open FILE, "File_To_Read" || die "Cannot open 'File_To_Read' for readi +ng: $!\n"; my $file = do { $\ = undef; <FILE> }; close FILE; # Remove all the characters we don't care about. $file =~ s/[^ANGTC]//g; # Walk through the string, looking for matches. while ($file =~ /[AN]{10}/g) { print "$1\n"; }

        You're going to have to add the loop around the files, add any letters you want to be allowed into the substitution, etc. You're also going to have to add handling if you don't want to see overlapping sequences. Good luck!

        ------
        We are the carpenters and bricklayers of the Information Age.

        The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        Color me confused, but

        a) I thought that genome sequences consisted of ACGT.

        b) Your example sequences do not contain any 'N's.

        ?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        If I understand your problem, I can solve it! Of course, the same can be said for you.

        Are you sure that's what you want? You know your problem-space, so I'm just checking. There are three issues I see:

        1. Fasta doesn't have to be 80 characters per-line
        2. Fasta doesn't have to be ctrl-M delimited
        3. Are you sure you don't want to check strand identity (e.g. look for poly-T sequences)?

        Either way, the simple answer is to parse the file using SeqIO (BioPerl). Extract the sequence, then run a reg-ex to look for what you want. In your case it's $seq =~ /[AN]{10,}/ I think.

      If only there was a bioinformatics monks!
Re: BioInformatics - polyA tail search
by blokhead (Monsignor) on Sep 02, 2003 at 19:32 UTC
    If the files are very large, you don't want to load the whole file into memory. A more efficient solution would be to only read the last (10 + epsilon) characters of the file (the epsilon "fudge factor" is just in case there is a \r or \r\n within the last 10 characters.

    Here is some code I came up with to do this search efficiently:

    my $tail_length = 10; sub contains_polytail { my $file = shift; open my $fh, "<$file" or die "Couldn't open $file: $!"; ## only read the last 10 + 2 bytes seek $fh, -( $tail_length + 2), 2; my $tail = do { local $/; <$fh> }; ## ignore newlines and ^M's $tail =~ s/[\r\n]//g; return ($tail =~ /[AN]{$tail_length}$/); } print contains_polytail('test.dna') ? "yes\n" : "no\n";
    Important note: this won't tell you the entire polyA tail, just whether there is one. The polyA tail might be 600 characters, it might be 15. If you need the entire tail's contents, you either have to keep reading backwards through the file, or else simply slurp the entire file into memory and use one of the other solutions in this thread. But now you only need to slurp the entire file if you know it contains a polyA tail.

    blokhead

      I wrote a simple program for something similar once. Seemed like you needed to account for inexact matches, if it was a real world and not an academic problem.

      You might want to look at the CPAN module String::Approx. It aint a speed demon, but it seemed to work.

      http://search.cpan.org/author/JHI/String-Approx-3.20/Approx.pm

      Please forgive if I am wrong, I know a little Perl and even less Genetics.
Re: BioInformatics - polyA tail search
by biosysadmin (Deacon) on Sep 03, 2003 at 10:00 UTC
    Dear Perlmonks, I have not yet had time to delve into BioPerl (although I can't wait!).

    Here's a brief introduction to Bioperl for you:
    #!/usr/bin/perl -w use strict; use Bio::SeqIO; my @files = <*.seq>; my $pattern = '[AN]{10}'; foreach my $file (@files) { my $seqio = Bio::SeqIO->new( -file => $file ); while ( my $seqobj = $seqio->next_seq() ) { my $raw_seq = $seqobj->seq(); if ( $raw_seq =~ /$pattern/o ) { print "Found a polyA tail in $file.\n"; } } }
    This seems to work for me.

    If you have any questions about how it works, just ask. I made a few assumptions:
    • If there were a sequence file with 4 sequences, two of which had poly-A tails, the program would print:
      Found a polyA tail in myfile Found a polyA tail in myfile
    • I'm assuming that you're using some sort of format that Bioperl supports.
    Also, is there community interest in an Intro to Bioperl Tutorial? The documentation at doc.bioperl.org seems to be very geared towards developers and other experienced Bioperl users. If there is interest I'd be happy to write it, alhough I might need some help (I'm a long time reader of PM, short time member).
      There *is* a BioPerl tutorial bptutorial.pl, along with a huge variety of sample scripts in the BioPerl distribution. I think additional BioPerl tutorial material is a great idea, but don't put it here: contribute it back to the main project. The core developers there are too overloaded to do this kind of stuff, I think.
Re: BioInformatics - polyA tail search
by halley (Prior) on Sep 02, 2003 at 18:54 UTC
    Does a single file have a single sequence? You said these polyA tails are at (or near?) the end.

    If the files have just one sequence and are small enough to fit in memory, slurp into a scalar, strip any irrelevant characters, and the regular expression qr/[AN]{10,}/ would find the tails.

    If the files are too large, or have more than one sequence, you'll have to work a little harder to fit in memory. Reading a couple lines at a time, searching the tailing line(s) for a matching tail.

    --
    [ e d @ h a l l e y . c c ]

Re: BioInformatics - polyA tail search
by johndageek (Hermit) on Sep 03, 2003 at 22:38 UTC
    This may help. Not polished or thoroughly tested, but should get you to a starting point.

    enjoy!
    dageek

    #!/usr/bin/perl -w use strict; ## set minimum length for polytail to search for my $polytail_min_len = 10; my $polytail_curr_len = $polytail_min_len; my $workline; my $polya_str_base; ## read data record for (<DATA>) { ## remove exteranious characters s/[\r\n]//g; # add new data to end of working data string $workline = $workline . $_; while (length($workline) > $polytail_curr_len) { $polya_str_base = substr($workline,0,$polytail_curr_len); ## remove desired characters from string $polya_str_base =~ s/[AN]//g; ## no characters left = all characters were in desired character set if (length($polya_str_base) == 0) { ## add a character from data set to string to test (ok - bump subscrip +t) $polytail_curr_len++; } else { ## a polytail of at least minimum length was found if ($polytail_curr_len > $polytail_min_len) { print substr($workline,0,$polytail_curr_len-1) . "\n"; ## trim characters of found polytail from working string $workline = substr($workline,$polytail_curr_len); ## reset length of string to test $polytail_curr_len = $polytail_min_len; } ## trim lead character from working field $workline = substr($workline,1); } } } __DATA__ 123456789abcdefghijklmnopqrstuvwxyz ACGGAAAAAAAAAAATCGGATCTGAATGTCTAGAGGGGTTCTCTCCCTTGGTGTGAGTCTAGCCCTGAAA +GTTGCANANAN NANANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA +AAAAAAAAAAAA NTCCCACTCTANCGCGAAGCCAAATTTGTCGAGAGTACTCTGGGGGGAAGAGATCAGAATTGTGCAGACT +AATCCGTAACTGC CAAGTACTATTGGCCCTGTTCCAACCATCTAACCTCCTTATGATAACCATGCCACTAAATGGGTTCCTGG +ATCTGCACCT CATTCGCTCGCCTTATGGCCTCGGCTCTCTGCGTATCCACCCTCCTCGTCACCGCCATGCCCTTCGACCT +TCAGCGGGGG