Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Re: BioInformatics - polyA tail search

by MiamiGenome (Sexton)
on Sep 02, 2003 at 18:17 UTC ( #288371=note: print w/replies, xml ) Need Help??


in reply to Re: BioInformatics - polyA tail search
in thread BioInformatics - polyA tail search

I do apologize for the sparse details.

A polyA tail can be defined as a string of length 10 or greater containing only 'A' or 'N'.

For example,

AAAANAAAAANNNAANANANN

or

AAAAAAAAAAAAAAAANNAAA

The files to search contain 80 character lines terminated with a control-M, as follows :

ACGGAAAATCGGATCTGAATGTCTAGAGGGGTTCTCTCCCTTGGTGTGAGTCTAGCCCTGAAAGTTGCCACTCATTGAGC^M CGTTGCCGACTGAGGCTTTGGACTCCAAGGGTAAGGAGCAGACGATGGAGGACGATTTGCTTTGGGGCATCACGCAACCA^M TCCCACTCTCGCGAAGCCAAATTTGTCGAGAGTACTCTGGGGGGAAGAGATCAGAATTGTGCAGACTAATCCGTAACTGC^M CAAGTACTATTGGCCCTGTTCCAACCATCTAACCTCCTTATGATAACCATGCCACTAAATGGGTTCCTGGATCTGCACCT^M CATTCGCTCGCCTTATGGCCTCGGCTCTCTGCGTATCCACCCTCCTCGTCACCGCCATGCCCTTCGACCTTCAGCGGGGG^M

Thus far, I have used only grep, but I think the control-M may be imbedded in some of the poly A stretches.

Thank you again!

  • Comment on Re: Re: BioInformatics - polyA tail search

Replies are listed 'Best First'.
Re3: BioInformatics - polyA tail search
by dragonchild (Archbishop) on Sep 02, 2003 at 18:24 UTC
    This is completely untested, but you can use this as a start. Warning: This could be memory intensive for large files!
    # Open the file and slurp the contents to a string. open FILE, "File_To_Read" || die "Cannot open 'File_To_Read' for readi +ng: $!\n"; my $file = do { $\ = undef; <FILE> }; close FILE; # Remove all the characters we don't care about. $file =~ s/[^ANGTC]//g; # Walk through the string, looking for matches. while ($file =~ /[AN]{10}/g) { print "$1\n"; }

    You're going to have to add the loop around the files, add any letters you want to be allowed into the substitution, etc. You're also going to have to add handling if you don't want to see overlapping sequences. Good luck!

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      In the definition of a ployA tail it says "as a string of length 10 or greater containing only 'A' or 'N' if you erase all unwanted characters then some 'A's and 'N's that weren't together before might come together. Also note that you probably want to match against [AN]{10,} so that if there are more than 10 A's or N's in a row the match does not fail. Also, MiamiGenome wanted the filenames. This modified version of your code might work a little better:
      # if your file extension is not .txt change it to whatever is approria +te while (my $filename=<*.txt>){ # Open the file and slurp the contents to a string. open FILE, $filename || die "Cannot open '$filename' for reading: $! +\n"; my $file = do { $\ = undef; <FILE> }; close FILE; # If a 'polyA' sequence is found print the file name. if ($file =~ /[AN]{10,}/) { print "$filename has a polyA tail sequence\n"; } }
        Also note that you probably want to match against [AN]{10,} so that if there are more than 10 A's or N's in a row the match does not fail.
        If there are more than ten, then {10} will match just fine.
Re: Re: Re: BioInformatics - polyA tail search
by BrowserUk (Pope) on Sep 02, 2003 at 18:27 UTC

    Color me confused, but

    a) I thought that genome sequences consisted of ACGT.

    b) Your example sequences do not contain any 'N's.

    ?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      You are correct, but when the automated sequencers can not unambiguously choose the A,C,T,or G, they assign 'N'.

      My chosen example sequences were from the beginning of a sequence file. PolyA stretches are found at the end.

      Cheers, and thank you in advance!

        Something like this will get you started.

        perl -nle" print "$ARGV:($./$+[0]): $1" if m[([AN]{10,}]g;" file*

        This will print lines like

        filename:(10/50): ANNNANANAAN

        where the first number is theline in the file and the second is the offset within the line.

        For unix you need to swap "s for 's, and under Win32 you would need to add BEGIN{ @ARGV=map{ glob } @ARGV } to expand the wildcard filespec supplied on the comand line.

        See perlrun for the switches used, and perlre for the regex.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        If I understand your problem, I can solve it! Of course, the same can be said for you.

      Genome sequence do consist of A's, C's, G's and T's, but sequencing a section of DNA can sometimes give ambiguous results, especially near the end of a sequence. If you can't tell what nucleotide occured at what position in the sequence, it's common practice to use N to denote an ambiguous base.

      Lately, sequencing has advanced to the point that while there is still ambiguity in sequencing, the sequencer can narrow down the possible bases. There is a way of denoting this ambiguity with a one-letter code from the IUB ambiguity codes table.

      So, like anything in biology there is ambiguity (sometimes). One might even go as far as to say that there is ambiguity about when there is ambiguity, because most sequences that I work on don't contain ambiguous bases. However, the concept of ambiguity can be very useful in a laboratory setting for reasons that are way beyond the scope of this discussion.

      Hopefully this cleared up some of the confusion. :)

        Another important use of ambiguity is that sometimes you search sequence ambiguously. So even though the raw DNA only contains {GCAT}, sometimes you are looking for sequences that, at some point, can have restricted sets like {GC} or {AT}.

Re: Re: Re: BioInformatics - polyA tail search
by Anonymous Monk on Sep 03, 2003 at 12:18 UTC

    Are you sure that's what you want? You know your problem-space, so I'm just checking. There are three issues I see:

    1. Fasta doesn't have to be 80 characters per-line
    2. Fasta doesn't have to be ctrl-M delimited
    3. Are you sure you don't want to check strand identity (e.g. look for poly-T sequences)?

    Either way, the simple answer is to parse the file using SeqIO (BioPerl). Extract the sequence, then run a reg-ex to look for what you want. In your case it's $seq =~ /[AN]{10,}/ I think.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://288371]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2020-05-26 13:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If programming languages were movie genres, Perl would be:















    Results (150 votes). Check out past polls.

    Notices?