>ENSG00000010072
MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ
LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVEVYHTFHDEVDEYR
RHWWRCNGPCQHRPPYYGYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGK
GKAKLGKEPVLAAENKGTFVYILLIFM*
>ENSG00000067082
Sequence unavailable
>ENSG00000147724
MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTESSSLHSA
CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS
EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP
>ENSG00000010072
MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ
LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV
TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY
GYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD
KPNRGEAQLVIPFSGKGYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI
KVKFEQNGSSKNSHLVSPAVSNSHQNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP
KNSVSSSSQRRVSSSKISLRNSSKVTESASVMPSQDVSGSEDTFPNKRPRLEDKTVFDNF
FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQNEVLESQINEHLDWCLEGDS
IKVKSEESL*
this is the input file, but without the spaces between the sequences
the output should has the same structure. for each ID I need to print the longest sequence (each ID can have from one up to 60 different sequences). I already wrote the code for how to select the longest one and it works. I am stuck on the previous part, where I store the sequences (of the same ID) in the array. I think there is a problem in the way I collect the sequences in the array, because I checked the data and they are not correct... | [reply] [d/l] |
It would help if you wrapped your sample data in <code> or <pre> tags, so we can see where lines actually break.
Update:
Thanks for making your data easier to read. I'm still not sure whether by "spaces between the sequences" you mean the sequences are really in a single line, instead of broken into multiple lines as you have them here, but it doesn't matter for my solution. Instead of reading the file line-by-line and trying to determine which lines are IDs and which are sequences, and (possibly) concatenating the sections of sequences, together, I think it's much simpler if you change the input record separator from the default newline to what's actually separating your records: the > character. Then you've got a pretty standard key-value layout, making it easy to break each record into its two parts and take out anything that shouldn't be in the second part (like newlines). And as Kenosis pointed out, if you only want the longest sequence for each ID, there's no need to build a hash of arrays and find the longest ones later. Just compare lengths as you go, and replace them when you find a longer one. Like so:
#!/usr/bin/env perl
use Modern::Perl;
my %seqs;
$/ = '>'; # break lines on this instead of newlin
+e
while(my $line = <DATA>){
chomp $line; # remove any trailing >
next unless $line; # skip leading blank record before firs
+t >
my($id, $seq) = split /\s+/, $line, 2;
$seq =~ s/[\r\n]//g; # strip newlines and/or carriage return
+s from sequence
unless($seqs{$id} and length($seqs{$id}) > length($seq)){
$seqs{$id} = $seq; # save it if it's a new ID or a longer
+sequence
}
}
say ">$_ $seqs{$_}" for keys %seqs;
__DATA__
>ENSG00000010072
MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ
LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVEVYHTFHDEVDEYR
RHWWRCNGPCQHRPPYYGYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGK
GKAKLGKEPVLAAENKGTFVYILLIFM*
>ENSG00000067082
Sequence unavailable
>ENSG00000147724
MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTESSSLHSA
CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS
EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP
>ENSG00000010072
MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ
LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV
TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY
GYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD
KPNRGEAQLVIPFSGKGYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI
KVKFEQNGSSKNSHLVSPAVSNSHQNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP
KNSVSSSSQRRVSSSKISLRNSSKVTESASVMPSQDVSGSEDTFPNKRPRLEDKTVFDNF
FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQNEVLESQINEHLDWCLEGDS
IKVKSEESL*
| [reply] [d/l] [select] |