Furor has asked for the wisdom of the Perl Monks concerning the following question:
Hello,
been working myself in into Perl since about a week. I need it to write scripts for manipulating DNA-sequences. In
first instance, I need to cut off part of a string. This might be a trivial thing to do for an experienced
programmer but currently, I don't know where to go from this point ...
I've heard of BioPerl, but since I'm quite new to Perl AND programming in general, I get this huge avalanche of
information
over me and I really don't know where to look first. Plus I need to get some results rather quickly.
So, what I want it to do is read in a file (fasta format) and compare (the beginning of) every sequence to some
specified strings (called a primer) and when there's a match, remove the matching string from the sequence.
The remaining trimmed sequences should be stored in a new file, and perhaps as a control, store non-matched
sequences in another file. This leads to another question. Do you have to store the processed data in an array
before writing it to an output-file, or can this be done directly?
Normally, the primers to be compared start at the beginning of the sequence, but, to exclude possible errors, it
might be useful to delete anything before the primer.
(btw, each sequence is preceded by a unique identifier key (indicated by ">"), so these should always remain
together)
Also, I was wondering if it'd make a difference in speed if it'd check the entire file (couple of thousand rows)
separately for each specific primer, or if you go row by row and check every primer against it (so only going
through the file once). Does this make sense? :o)
Don't mind the comments too much :)
Thanks in advance!
#! C:/Perl/bin
use strict;
use warnings;
use File::Path;
# This script processes a fasta file containing DNA sequences
# Part 1: declare variables, constants, ...
# forward (F) barcodes
my @forward = ("AGCCTAAGCT",
"TCAAGTTAGC",
"AGCCTGGCAT",
"ACGGTCCATG",
"ACTTGCCGAT",
"ACGGTGGATC",
"ATCCGCCTAG",
"ATGGCGGTAC");
# reverse (R) barcodes
my @reverse = ("AGCTTAGGCT",
"TAGCCTAAGC",
"AGCTTGCCAT",
"ACGTTCAATG",
"ACTGGCGGAT",
"ACGTTGAATC",
"ATCGGCAAGT",
"ATGCCGTTAC");
# primers used for Variable Region 1 (V1) and Variable Region 3 (V3) o
+f 16S rRNA
# forward primer (V1 region)
my $V1 = 'AGAGTTTGATCCTGGCTCAG';
# reverse primer (V3 region)
my $V3 = 'GTATTACCGCGGCTGCTGGCA';
# locate the import-file with data
my $input_file = "C:/../input.txt";
# name the filehandler: FASTA_IN
open (FASTA_IN, $input_file);
# import data (fasta formatted style) as array to read all sequences
my @raw_DNA = <FASTA_IN>;
#test imported data
#print "@raw_DNA\n";
# close the import-file
close FASTA_IN;
# Part 3: start processing sequences
# 3.1 Create arrays to hold processed results
my @Processed_Sequences = ();
my @Rejected_Sequences = ();
# 3.2 concatenate each barcode with apropriate primer
for my $current_barcode(0..$#forward)
{
my $F = "$forward[$current_barcode]$V1\n";
#test concatenation
# print $F;
#test current concatenated barcode.primer against sequences and i
+f match,
#remove the barcode and primer
# =~ m/$F/;
#if match print match $F
}
Re: remove part of string (DNA)
by moritz (Cardinal) on Apr 12, 2011 at 12:19 UTC
|
Since reading and writing is probably the slowest part of your program, it makes sense to read the file only once. And because the file might be big, you shouldn't read it all at once into memory, but rather line by line.
Here's something to get you started. Since you didn't include any example input and expected output data, I can't really help you with the processing step, but maybe it helps anyway:
#! C:/Perl/bin
use strict;
use warnings;
use File::Path;
# This script processes a fasta file containing DNA sequences
# Part 1: declare variables, constants, ...
# forward (F) barcodes
my @forward = ("AGCCTAAGCT",
"TCAAGTTAGC",
"AGCCTGGCAT",
"ACGGTCCATG",
"ACTTGCCGAT",
"ACGGTGGATC",
"ATCCGCCTAG",
"ATGGCGGTAC");
# reverse (R) barcodes
my @reverse = ("AGCTTAGGCT",
"TAGCCTAAGC",
"AGCTTGCCAT",
"ACGTTCAATG",
"ACTGGCGGAT",
"ACGTTGAATC",
"ATCGGCAAGT",
"ATGCCGTTAC");
# primers used for Variable Region 1 (V1) and Variable Region 3 (V3) o
+f 16S rRNA
# forward primer (V1 region)
my $V1 = 'AGAGTTTGATCCTGGCTCAG';
# reverse primer (V3 region)
my $V3 = 'GTATTACCGCGGCTGCTGGCA';
# locate the import-file with data
my $input_file = "C:/../input.txt";
# concatenate primer to bar codes:
my @processed = map { $_ . $V1 } @forwards;
# construct a regex to search for
my $search_for = join '|', @processed;
# compile it:
$search_for = qr{$search_for};
open my $FASTA_IN, '<', $input_file or die $!;
open my $MATCHED_OUT, '>', 'matched.txt' or die $!
open my $NOT_MATCHED_OUT, '>', 'notmatched.txt' or die $'
# don't read all the data at once,
# rather process it line by line
while (my $line = <$FASTA_IN>) {
if ($line =~ $search_for) {
print $MATCHED_OUT $line;
} else {
print $NOT_MATCHED_OUT $line;
}
}
close $FASTA_IN;
close $MATCHED_OUT;
close $NOT_MATCHED_OUT;
| [reply] [d/l] |
|
my $search_for = '('.join( '|', @processed).')';
i.e. add round brackets around the pattern group, right?
| [reply] [d/l] |
Re: remove part of string (DNA)
by tospo (Hermit) on Apr 12, 2011 at 16:39 UTC
|
I would recommend to use BioPerl for reading sequences. Even a simple format like FASTA has a few caveats and you make your life easier in the long run if you use those libraries.
This is the canonical way of readin sequences from a FASTA file:
use Bio::SeqIO;
my $seqio_obj = Bio::SeqIO->new(-file => "sequence.fasta", -format =>
+"fasta" );
while (my $seq_obj = $seqio_obj->next_seq){
print "Sequence id: ".$seq_obj->display_id."\n";
print "Sequence: "$seq_obj->seq."\n";
}
An important question is whether you need both primer/adapters (forward and reverse) to match (presumably either side of a sequenced insert)? If so, you will also need to reverse-complement one of the primers or you have to check both in both orientations unless you are sequencing from the forward primer.
To match and remove a string you can use s//. Here is a Perl one-liner that illsutrates how you can test if something matches and remove it at the same time:
perl -e '$s="get_rid_of_some_string";print "remaining string: $s\n" if
+ $s=~s/^get_rid_//;'
Just run the above line in a terminal to see it in action.
It will reduce "get_rid_of_some_string" to "of_some_string" if "get_rid_" can be found (and removed) from teh beginning.
To remove anything before the pattern as well (e.g. any sequence before a primer), you can do it like this:
perl -e '$s="get_rid_of_some_string";print "remaining string: $s\n" if
+ $s=~s/^.*rid_//;'
This will have the same result as above but the pattern was just "rid_" and the regular expression says: remove everything from teh beginning of the string, optionally followed by some characters, then followed by "rid_", with nothing (i.e. just strip it off).
Hope this helps to get you started.
| [reply] [d/l] [select] |
|
oh and I forgot to mention: I wouldn't worry too much about memory and efficiency of the algorithm unless you plan to implement this on a web server or something where it will run all the time. It's more likely to be a one-off and, especially as a beginner, you can easily spend much more time optimising it than you will ever save. If it takes a bit longer, use the time to brew a nice cup of coffee or go home and find the shiny new results waiting on your disk for you in the morning... :-)
| [reply] |
|
Thanks to all for the replies! I'll check them out under close scrutiny. I'm sure it'll give me a boost.
(to be continued ;) )
| [reply] |
|
Re: remove part of string (DNA)
by Generoso (Prior) on Apr 12, 2011 at 15:58 UTC
|
#!/usr/bin/perl
#
use strict;
use warnings;
# forward (F) barcodes
my @forward = ("AGCCTAAGCT",
"TCAAGTTAGC",
"AGCCTGGCAT",
"ACGGTCCATG",
"ACTTGCCGAT",
"ACGGTGGATC",
"ATCCGCCTAG",
"ATGGCGGTAC");
# reverse (R) barcodes
my @reverse = ("AGCTTAGGCT",
"TAGCCTAAGC",
"AGCTTGCCAT",
"ACGTTCAATG",
"ACTGGCGGAT",
"ACGTTGAATC",
"ATCGGCAAGT",
"ATGCCGTTAC");
my @dnsstr1 = ();
my $fr= 0;
#read the first header >
$dnsstr1[0] = <DATA>;
foreach my $line(<DATA>) {
#Read rest of lines until nest header
if ($line =~ /^>/) {
print "Revese\n" if $fr == 0;
print @dnsstr1,"\n";
@dnsstr1 = ();
$fr = 0;
push @dnsstr1,$line;
}
else {push @dnsstr1,$line;
# check if dna string is in the Forward ones
foreach my $f (@forward){
if ($line =~ $f) { $fr =1; print "Forward\n";}}
}
#print $line;
}
#print the last dna string
print "Revese\n" if $fr == 0;
print @dnsstr1,"\n";
__DATA__
>HSBGPG Human gene for bone gla protein (BGP)
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAG
+AGGGT
ATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAG
+ACACC
ATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGT
+GCCCC
CACCTCCCCTCAGGCCGCATTGCAGTGGGGGCTGAGAGGAGGAAGCACCATGGCCCACCTCTTCTCACCC
+CTTTG
GCTGGCAGTCCCTTTGCAGTCTAACCACCTTGTTGCAGGCTCAATCCATTTGCCCCAGCTCTGCCCTTGC
+AGAGG
GAGAGGAGGGAAGAGCAAGCTGCCCGAGACGCAGGGGAAGGAGGATGAGGGCCCTGGGGATGAGCTGGGG
+TGAAC
CAGGCTCCCTTTCCTTTGCAGGTGCGAAGCCCAGCGGTGCAGAGTCCAGCAAAGGTGCAGGTATGAGGAT
+GGACC
TGATGGGTTCCTGGACCCTCCCCTCTCACCCTGGTCCCTCAGTCTCATTCCCCCACTCCTGCCACCTCCT
+GTCTG
GCCATCAGGAAGGCCAGCCTGCTCCCCACCTGATCCTCCCAAACCCAGAGCCACCTGATGCCTGCCCCTC
+TGCTC
CACAGCCTTTGTGTCCAAGCAGGAGGGCAGCGAGGTAGTGAAGAGACCCAGGCGCTACCTGTATCAATGG
+CTGGG
GTGAGAGAAAAGGCAGAGCTGGGCCAAGGCCCTGCCTCTCCGGGATGGTCTGTGGGGGAGCTGCAGCAGG
+GAGTG
GCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGG
+AGAGG
AGGGATGGGCATTTTGCACGGGGGCTGATGCCACCACGTCGGGTGTCTCAGAGCCCCAGTCCCCTACCCG
+GATCC
CCTGGAGCCCAGGAGGGAGGTGTGTGAGCTCAATCCGGACTGTGACGAGTTGGCTGACCACATCGGCTTT
+CAGGA
GGCCTATCGGCGCTTCTACGGCCCGGTCTAGGGTGTCGCTCTGCTGGCCTGGCCGGCAACCCCAGTTCTG
+CTCCT
CTCCAGGCACCCTTCTTTCCTCTTCCCCTTGCCCTTGCCCTGACCTCCCAGCCCTATGGATGTGGGGTCC
+CCATC
ATCCCAGCTGCTCCCAAATAAACTCCAGAAG
>HSGLTH1 Human theta 1-globin gene
CCACTGCACTCACCGCACCCGGCCAATTTTTGTGTTTTTAGTAGAGACTAAATACCATATAGTGAACACC
+TAAGA
CGGGGGGCCTTGGATCCAGGGCGATTCAGAGGGCCCCGGTCGGAGCTGTCGGAGATTGAGCGCGCGCGGT
+CCCGG
GATCTCCGACGAGGCCCTGGACCCCCGGGCGGCGAAGCTGCGGCGCGGCGCCCCCTGGAGGCCGCGGGAC
+CCCTG
GCCGGTCCGCGCAGGCGCAGCGGGGTCGCAGGGCGCGGCGGGTTCCAGCGCGGGGATGGCGCTGTCCGCG
+GAGGA
CCGGGCGCTGGTGCGCGCCCTGTGGAAGAAGCTGGGCAGCAACGTCGGCGTCTACACGACAGAGGCCCTG
+GAAAG
GTGCGGCAGGCTGGGCGCCCCCGCCCCCAGGGGCCCTCCCTCCCCAAGCCCCCCGGACGCGCCTCACCCA
+CGTTC
CTCTCGCAGGACCTTCCTGGCTTTCCCCGCCACGAAGACCTACTTCTCCCACCTGGACCTGAGCCCCGGC
+TCCTC
ACAAGTCAGAGCCCACGGCCAGAAGGTGGCGGACGCGCTGAGCCTCGCCGTGGAGCGCCTGGACGACCTA
+CCCCA
CGCGCTGTCCGCGCTGAGCCACCTGCACGCGTGCCAGCTGCGAGTGGACCCGGCCAGCTTCCAGGTGAGC
+GGCTG
CCGTGCTGGGCCCCTGTCCCCGGGAGGGCCCCGGCGGGGTGGGTGCGGGGGGCGTGCGGGGCGGGTGCAG
+GCGAG
TGAGCCTTGAGCGCTCGCCGCAGCTCCTGGGCCACTGCCTGCTGGTAACCCTCGCCCGGCACTACCCCGG
+AGACT
TCAGCCCCGCGCTGCAGGCGTCGCTGGACAAGTTCCTGAGCCACGTTATCTCGGCGCTGGTTTCCGAGTA
+CCGCT
GAACTGTGGGTGGGTGGCCGCGGGATCCCCAGGCGACCTTCCCCGTGTTTGAGTAAAGCCTCTCCCAGGA
+GCAGC
CTTCTTGCCGTGCTCTCTCGAGGTCAGGACGCGAGAGGAAGGCGC
>seq1
AGCCTAAGCTASTPGHTIIYEAVCLHNDRTTIP
>seq2 optional comment
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK
NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES
Results
perl "D:\Citi Midleware\db\perl\perl5data.pl" stocks 10 5
Process started >>>
Revese
>HSBGPG Human gene for bone gla protein (BGP)
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAG
+AGGGT
ATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAG
+ACACC
ATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGT
+GCCCC
CACCTCCCCTCAGGCCGCATTGCAGTGGGGGCTGAGAGGAGGAAGCACCATGGCCCACCTCTTCTCACCC
+CTTTG
GCTGGCAGTCCCTTTGCAGTCTAACCACCTTGTTGCAGGCTCAATCCATTTGCCCCAGCTCTGCCCTTGC
+AGAGG
GAGAGGAGGGAAGAGCAAGCTGCCCGAGACGCAGGGGAAGGAGGATGAGGGCCCTGGGGATGAGCTGGGG
+TGAAC
CAGGCTCCCTTTCCTTTGCAGGTGCGAAGCCCAGCGGTGCAGAGTCCAGCAAAGGTGCAGGTATGAGGAT
+GGACC
TGATGGGTTCCTGGACCCTCCCCTCTCACCCTGGTCCCTCAGTCTCATTCCCCCACTCCTGCCACCTCCT
+GTCTG
GCCATCAGGAAGGCCAGCCTGCTCCCCACCTGATCCTCCCAAACCCAGAGCCACCTGATGCCTGCCCCTC
+TGCTC
CACAGCCTTTGTGTCCAAGCAGGAGGGCAGCGAGGTAGTGAAGAGACCCAGGCGCTACCTGTATCAATGG
+CTGGG
GTGAGAGAAAAGGCAGAGCTGGGCCAAGGCCCTGCCTCTCCGGGATGGTCTGTGGGGGAGCTGCAGCAGG
+GAGTG
GCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGG
+AGAGG
AGGGATGGGCATTTTGCACGGGGGCTGATGCCACCACGTCGGGTGTCTCAGAGCCCCAGTCCCCTACCCG
+GATCC
CCTGGAGCCCAGGAGGGAGGTGTGTGAGCTCAATCCGGACTGTGACGAGTTGGCTGACCACATCGGCTTT
+CAGGA
GGCCTATCGGCGCTTCTACGGCCCGGTCTAGGGTGTCGCTCTGCTGGCCTGGCCGGCAACCCCAGTTCTG
+CTCCT
CTCCAGGCACCCTTCTTTCCTCTTCCCCTTGCCCTTGCCCTGACCTCCCAGCCCTATGGATGTGGGGTCC
+CCATC
ATCCCAGCTGCTCCCAAATAAACTCCAGAAG
Revese
>HSGLTH1 Human theta 1-globin gene
CCACTGCACTCACCGCACCCGGCCAATTTTTGTGTTTTTAGTAGAGACTAAATACCATATAGTGAACACC
+TAAGA
CGGGGGGCCTTGGATCCAGGGCGATTCAGAGGGCCCCGGTCGGAGCTGTCGGAGATTGAGCGCGCGCGGT
+CCCGG
GATCTCCGACGAGGCCCTGGACCCCCGGGCGGCGAAGCTGCGGCGCGGCGCCCCCTGGAGGCCGCGGGAC
+CCCTG
GCCGGTCCGCGCAGGCGCAGCGGGGTCGCAGGGCGCGGCGGGTTCCAGCGCGGGGATGGCGCTGTCCGCG
+GAGGA
CCGGGCGCTGGTGCGCGCCCTGTGGAAGAAGCTGGGCAGCAACGTCGGCGTCTACACGACAGAGGCCCTG
+GAAAG
GTGCGGCAGGCTGGGCGCCCCCGCCCCCAGGGGCCCTCCCTCCCCAAGCCCCCCGGACGCGCCTCACCCA
+CGTTC
CTCTCGCAGGACCTTCCTGGCTTTCCCCGCCACGAAGACCTACTTCTCCCACCTGGACCTGAGCCCCGGC
+TCCTC
ACAAGTCAGAGCCCACGGCCAGAAGGTGGCGGACGCGCTGAGCCTCGCCGTGGAGCGCCTGGACGACCTA
+CCCCA
CGCGCTGTCCGCGCTGAGCCACCTGCACGCGTGCCAGCTGCGAGTGGACCCGGCCAGCTTCCAGGTGAGC
+GGCTG
CCGTGCTGGGCCCCTGTCCCCGGGAGGGCCCCGGCGGGGTGGGTGCGGGGGGCGTGCGGGGCGGGTGCAG
+GCGAG
TGAGCCTTGAGCGCTCGCCGCAGCTCCTGGGCCACTGCCTGCTGGTAACCCTCGCCCGGCACTACCCCGG
+AGACT
TCAGCCCCGCGCTGCAGGCGTCGCTGGACAAGTTCCTGAGCCACGTTATCTCGGCGCTGGTTTCCGAGTA
+CCGCT
GAACTGTGGGTGGGTGGCCGCGGGATCCCCAGGCGACCTTCCCCGTGTTTGAGTAAAGCCTCTCCCAGGA
+GCAGC
CTTCTTGCCGTGCTCTCTCGAGGTCAGGACGCGAGAGGAAGGCGC
Forward
>seq1
AGCCTAAGCTASTPGHTIIYEAVCLHNDRTTIP
Revese
>seq2 optional comment
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK
NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES
<<< Process finished.
================ READY ================
| [reply] [d/l] [select] |
|
#!/usr/bin/perl
#
use strict;
use warnings;
# forward (F) barcodes
my @forward = ("AGCCTAAGCT",
"TCAAGTTAGC",
"AGCCTGGCAT",
"ACGGTCCATG",
"ACTTGCCGAT",
"ACGGTGGATC",
"ATCCGCCTAG",
"ATGGCGGTAC");
# reverse (R) barcodes
my @reverse = ("AGCTTAGGCT",
"TAGCCTAAGC",
"AGCTTGCCAT",
"ACGTTCAATG",
"ACTGGCGGAT",
"ACGTTGAATC",
"ATCGGCAAGT",
"ATGCCGTTAC");
my @dnsstr1 = ();
my $fr= 0;
# read first header
$dnsstr1[0] = <DATA>;
foreach my $line(<DATA>) {
#read rest of lines in the dan sequece
if ($line =~ /^>/) {
# it is not a foward or a revers itentify as other
print "Other\n" if $fr == 0;
print @dnsstr1,"\n";
@dnsstr1 = ();
$fr = 0;
push @dnsstr1,$line;
}
else {push @dnsstr1,$line;
#check if it is a forward
foreach my $f (@forward){
if ($line =~ $f) { $fr =1; print "Forward\n";}}
#check if it is a reverse
foreach my $f (@reverse){
if ($line =~ $f) { $fr =2; print "Reverse\n";}}
}
}
# print last dns string
print "Other\n" if $fr == 0;
print @dnsstr1,"\n";
__DATA__
>HSBGPG Human gene for bone gla protein (BGP)
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAG
+AGGGT
AGCTTAGGCTCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAG
+ACACC
ATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGT
+GCCCC
CACCTCCCCTCAGGCCGCATTGCAGTGGGGGCTGAGAGGAGGAAGCACCATGGCCCACCTCTTCTCACCC
+CTTTG
GCTGGCAGTCCCTTTGCAGTCTAACCACCTTGTTGCAGGCTCAATCCATTTGCCCCAGCTCTGCCCTTGC
+AGAGG
GAGAGGAGGGAAGAGCAAGCTGCCCGAGACGCAGGGGAAGGAGGATGAGGGCCCTGGGGATGAGCTGGGG
+TGAAC
CAGGCTCCCTTTCCTTTGCAGGTGCGAAGCCCAGCGGTGCAGAGTCCAGCAAAGGTGCAGGTATGAGGAT
+GGACC
TGATGGGTTCCTGGACCCTCCCCTCTCACCCTGGTCCCTCAGTCTCATTCCCCCACTCCTGCCACCTCCT
+GTCTG
GCCATCAGGAAGGCCAGCCTGCTCCCCACCTGATCCTCCCAAACCCAGAGCCACCTGATGCCTGCCCCTC
+TGCTC
CACAGCCTTTGTGTCCAAGCAGGAGGGCAGCGAGGTAGTGAAGAGACCCAGGCGCTACCTGTATCAATGG
+CTGGG
GTGAGAGAAAAGGCAGAGCTGGGCCAAGGCCCTGCCTCTCCGGGATGGTCTGTGGGGGAGCTGCAGCAGG
+GAGTG
GCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGG
+AGAGG
AGGGATGGGCATTTTGCACGGGGGCTGATGCCACCACGTCGGGTGTCTCAGAGCCCCAGTCCCCTACCCG
+GATCC
CCTGGAGCCCAGGAGGGAGGTGTGTGAGCTCAATCCGGACTGTGACGAGTTGGCTGACCACATCGGCTTT
+CAGGA
GGCCTATCGGCGCTTCTACGGCCCGGTCTAGGGTGTCGCTCTGCTGGCCTGGCCGGCAACCCCAGTTCTG
+CTCCT
CTCCAGGCACCCTTCTTTCCTCTTCCCCTTGCCCTTGCCCTGACCTCCCAGCCCTATGGATGTGGGGTCC
+CCATC
ATCCCAGCTGCTCCCAAATAAACTCCAGAAG
>HSGLTH1 Human theta 1-globin gene
CCACTGCACTCACCGCACCCGGCCAATTTTTGTGTTTTTAGTAGAGACTAAATACCATATAGTGAACACC
+TAAGA
CGGGGGGCCTTGGATCCAGGGCGATTCAGAGGGCCCCGGTCGGAGCTGTCGGAGATTGAGCGCGCGCGGT
+CCCGG
GATCTCCGACGAGGCCCTGGACCCCCGGGCGGCGAAGCTGCGGCGCGGCGCCCCCTGGAGGCCGCGGGAC
+CCCTG
GCCGGTCCGCGCAGGCGCAGCGGGGTCGCAGGGCGCGGCGGGTTCCAGCGCGGGGATGGCGCTGTCCGCG
+GAGGA
CCGGGCGCTGGTGCGCGCCCTGTGGAAGAAGCTGGGCAGCAACGTCGGCGTCTACACGACAGAGGCCCTG
+GAAAG
GTGCGGCAGGCTGGGCGCCCCCGCCCCCAGGGGCCCTCCCTCCCCAAGCCCCCCGGACGCGCCTCACCCA
+CGTTC
CTCTCGCAGGACCTTCCTGGCTTTCCCCGCCACGAAGACCTACTTCTCCCACCTGGACCTGAGCCCCGGC
+TCCTC
ACAAGTCAGAGCCCACGGCCAGAAGGTGGCGGACGCGCTGAGCCTCGCCGTGGAGCGCCTGGACGACCTA
+CCCCA
CGCGCTGTCCGCGCTGAGCCACCTGCACGCGTGCCAGCTGCGAGTGGACCCGGCCAGCTTCCAGGTGAGC
+GGCTG
CCGTGCTGGGCCCCTGTCCCCGGGAGGGCCCCGGCGGGGTGGGTGCGGGGGGCGTGCGGGGCGGGTGCAG
+GCGAG
TGAGCCTTGAGCGCTCGCCGCAGCTCCTGGGCCACTGCCTGCTGGTAACCCTCGCCCGGCACTACCCCGG
+AGACT
TCAGCCCCGCGCTGCAGGCGTCGCTGGACAAGTTCCTGAGCCACGTTATCTCGGCGCTGGTTTCCGAGTA
+CCGCT
GAACTGTGGGTGGGTGGCCGCGGGATCCCCAGGCGACCTTCCCCGTGTTTGAGTAAAGCCTCTCCCAGGA
+GCAGC
CTTCTTGCCGTGCTCTCTCGAGGTCAGGACGCGAGAGGAAGGCGC
>seq1
AGCCTAAGCTASTPGHTIIYEAVCLHNDRTTIP
>seq2 optional comment
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK
NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES
Results
perl "D:\Citi Midleware\db\perl\perl5data.pl" stocks 10 5
Process started >>>
Reverse
>HSBGPG Human gene for bone gla protein (BGP)
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAG
+AGGGT
AGCTTAGGCTCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAG
+ACACC
ATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGT
+GCCCC
CACCTCCCCTCAGGCCGCATTGCAGTGGGGGCTGAGAGGAGGAAGCACCATGGCCCACCTCTTCTCACCC
+CTTTG
GCTGGCAGTCCCTTTGCAGTCTAACCACCTTGTTGCAGGCTCAATCCATTTGCCCCAGCTCTGCCCTTGC
+AGAGG
GAGAGGAGGGAAGAGCAAGCTGCCCGAGACGCAGGGGAAGGAGGATGAGGGCCCTGGGGATGAGCTGGGG
+TGAAC
CAGGCTCCCTTTCCTTTGCAGGTGCGAAGCCCAGCGGTGCAGAGTCCAGCAAAGGTGCAGGTATGAGGAT
+GGACC
TGATGGGTTCCTGGACCCTCCCCTCTCACCCTGGTCCCTCAGTCTCATTCCCCCACTCCTGCCACCTCCT
+GTCTG
GCCATCAGGAAGGCCAGCCTGCTCCCCACCTGATCCTCCCAAACCCAGAGCCACCTGATGCCTGCCCCTC
+TGCTC
CACAGCCTTTGTGTCCAAGCAGGAGGGCAGCGAGGTAGTGAAGAGACCCAGGCGCTACCTGTATCAATGG
+CTGGG
GTGAGAGAAAAGGCAGAGCTGGGCCAAGGCCCTGCCTCTCCGGGATGGTCTGTGGGGGAGCTGCAGCAGG
+GAGTG
GCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGG
+AGAGG
AGGGATGGGCATTTTGCACGGGGGCTGATGCCACCACGTCGGGTGTCTCAGAGCCCCAGTCCCCTACCCG
+GATCC
CCTGGAGCCCAGGAGGGAGGTGTGTGAGCTCAATCCGGACTGTGACGAGTTGGCTGACCACATCGGCTTT
+CAGGA
GGCCTATCGGCGCTTCTACGGCCCGGTCTAGGGTGTCGCTCTGCTGGCCTGGCCGGCAACCCCAGTTCTG
+CTCCT
CTCCAGGCACCCTTCTTTCCTCTTCCCCTTGCCCTTGCCCTGACCTCCCAGCCCTATGGATGTGGGGTCC
+CCATC
ATCCCAGCTGCTCCCAAATAAACTCCAGAAG
Other
>HSGLTH1 Human theta 1-globin gene
CCACTGCACTCACCGCACCCGGCCAATTTTTGTGTTTTTAGTAGAGACTAAATACCATATAGTGAACACC
+TAAGA
CGGGGGGCCTTGGATCCAGGGCGATTCAGAGGGCCCCGGTCGGAGCTGTCGGAGATTGAGCGCGCGCGGT
+CCCGG
GATCTCCGACGAGGCCCTGGACCCCCGGGCGGCGAAGCTGCGGCGCGGCGCCCCCTGGAGGCCGCGGGAC
+CCCTG
GCCGGTCCGCGCAGGCGCAGCGGGGTCGCAGGGCGCGGCGGGTTCCAGCGCGGGGATGGCGCTGTCCGCG
+GAGGA
CCGGGCGCTGGTGCGCGCCCTGTGGAAGAAGCTGGGCAGCAACGTCGGCGTCTACACGACAGAGGCCCTG
+GAAAG
GTGCGGCAGGCTGGGCGCCCCCGCCCCCAGGGGCCCTCCCTCCCCAAGCCCCCCGGACGCGCCTCACCCA
+CGTTC
CTCTCGCAGGACCTTCCTGGCTTTCCCCGCCACGAAGACCTACTTCTCCCACCTGGACCTGAGCCCCGGC
+TCCTC
ACAAGTCAGAGCCCACGGCCAGAAGGTGGCGGACGCGCTGAGCCTCGCCGTGGAGCGCCTGGACGACCTA
+CCCCA
CGCGCTGTCCGCGCTGAGCCACCTGCACGCGTGCCAGCTGCGAGTGGACCCGGCCAGCTTCCAGGTGAGC
+GGCTG
CCGTGCTGGGCCCCTGTCCCCGGGAGGGCCCCGGCGGGGTGGGTGCGGGGGGCGTGCGGGGCGGGTGCAG
+GCGAG
TGAGCCTTGAGCGCTCGCCGCAGCTCCTGGGCCACTGCCTGCTGGTAACCCTCGCCCGGCACTACCCCGG
+AGACT
TCAGCCCCGCGCTGCAGGCGTCGCTGGACAAGTTCCTGAGCCACGTTATCTCGGCGCTGGTTTCCGAGTA
+CCGCT
GAACTGTGGGTGGGTGGCCGCGGGATCCCCAGGCGACCTTCCCCGTGTTTGAGTAAAGCCTCTCCCAGGA
+GCAGC
CTTCTTGCCGTGCTCTCTCGAGGTCAGGACGCGAGAGGAAGGCGC
Forward
>seq1
AGCCTAAGCTASTPGHTIIYEAVCLHNDRTTIP
OTher
>seq2 optional comment
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK
NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES
<<< Process finished.
================ READY ================
| [reply] [d/l] [select] |
Re: remove part of string (DNA)
by educated_foo (Vicar) on Apr 15, 2011 at 16:31 UTC
|
A general FASTA-processing loop looks like this:
# Always read up to the next ">" at the beginning of a line:
$/ = "\n>";
getc(INPUT); # skip the first ">"
while (<INPUT>) {
# remove "\n>"
chomp;
# first line is the ID, the rest is the sequence
my ($id, $seq) = /^([^\n]*)\n(.*)/s;
# remove whitespace in the sequence
$seq =~ s/\s+//g;
# do something with $id and $seq
...
# output the result in 72-column format
print OUTPUT ">$id\n";
print OUTPUT "$1\n" while $seq =~ /(.{1,72})/g;
}
You'll want to read and write the data as you go instead of storing it in an array, and check for all the primers in one pass. | [reply] [d/l] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
|
|