Do you know where your variables are?

Re: Recognize DNA and amino acid sequence

by InfiniteSilence (Curate)
on Apr 23, 2011 at 17:22 UTC

in reply to Recognize DNA and amino acid sequence

Solution in three easy steps:

  • Uno: I think you are going to need to clearly describe what you mean by a sequence. For argument's sake I'll say you mean something like this, a sequence of capitalized letters (AGCTURYKMSWBDHVN) , one after another, followed by a single white space character (I borrowed this from here).
  • Dos: You may run into some problems using Perl with extremely large files. Try reading up more about this so you can divide up your problem (either the files themselves, rewriting some things in C and using XS, etc.). A really simple example using the file format from the previous link is here:
    use strict; my $seqNum = 0; my %sequences = (); open(H,qq|$ARGV[0]|) or die $!; while(<H>) { while (m/\b([AGCTURYKMSWBDHVN]+)\b/g) { $sequences{++$seqNum} = $1; } } close(H); for (sort {$a <=> $b} keys %sequences){print qq|$_\t$sequences{$_}\n|}
  • Tres: Here is the kicker. Just because these letters satisfy the regex doesn't mean that they necessarily are valid sequences. You will need to compare them against a powerful sequence database like BLAST. There are modules to perform searches written in Perl, but you should first become acquainted with a suite of tools specifically built for these kinds of problems called Bioperl.

Node Type: note [id://900983]
