Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: Recognize DNA and amino acid sequence

by InfiniteSilence (Curate)
on Apr 23, 2011 at 17:22 UTC ( #900983=note: print w/replies, xml ) Need Help??

in reply to Recognize DNA and amino acid sequence

Solution in three easy steps:

  • Uno: I think you are going to need to clearly describe what you mean by a sequence. For argument's sake I'll say you mean something like this, a sequence of capitalized letters (AGCTURYKMSWBDHVN) , one after another, followed by a single white space character (I borrowed this from here).
  • Dos: You may run into some problems using Perl with extremely large files. Try reading up more about this so you can divide up your problem (either the files themselves, rewriting some things in C and using XS, etc.). A really simple example using the file format from the previous link is here:
    use strict; my $seqNum = 0; my %sequences = (); open(H,qq|$ARGV[0]|) or die $!; while(<H>) { while (m/\b([AGCTURYKMSWBDHVN]+)\b/g) { $sequences{++$seqNum} = $1; } } close(H); for (sort {$a <=> $b} keys %sequences){print qq|$_\t$sequences{$_}\n|}
  • Tres: Here is the kicker. Just because these letters satisfy the regex doesn't mean that they necessarily are valid sequences. You will need to compare them against a powerful sequence database like BLAST. There are modules to perform searches written in Perl, but you should first become acquainted with a suite of tools specifically built for these kinds of problems called Bioperl.

Celebrate Intellectual Diversity

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://900983]
and the fire pops...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2018-06-23 01:10 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (125 votes). Check out past polls.