Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^3: Saving different values for the same key by using Hash of Arrays

by aaron_baugher (Deacon)
on May 07, 2012 at 01:54 UTC ( #969184=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Saving different values for the same key by using Hash of Arrays
in thread Saving different values for the same key by using Hash of Arrays

It would help if you wrapped your sample data in <code> or <pre> tags, so we can see where lines actually break.

Update:

Thanks for making your data easier to read. I'm still not sure whether by "spaces between the sequences" you mean the sequences are really in a single line, instead of broken into multiple lines as you have them here, but it doesn't matter for my solution. Instead of reading the file line-by-line and trying to determine which lines are IDs and which are sequences, and (possibly) concatenating the sections of sequences, together, I think it's much simpler if you change the input record separator from the default newline to what's actually separating your records: the > character. Then you've got a pretty standard key-value layout, making it easy to break each record into its two parts and take out anything that shouldn't be in the second part (like newlines). And as Kenosis pointed out, if you only want the longest sequence for each ID, there's no need to build a hash of arrays and find the longest ones later. Just compare lengths as you go, and replace them when you find a longer one. Like so:

#!/usr/bin/env perl use Modern::Perl; my %seqs; $/ = '>'; # break lines on this instead of newlin +e while(my $line = <DATA>){ chomp $line; # remove any trailing > next unless $line; # skip leading blank record before firs +t > my($id, $seq) = split /\s+/, $line, 2; $seq =~ s/[\r\n]//g; # strip newlines and/or carriage return +s from sequence unless($seqs{$id} and length($seqs{$id}) > length($seq)){ $seqs{$id} = $seq; # save it if it's a new ID or a longer +sequence } } say ">$_ $seqs{$_}" for keys %seqs; __DATA__ >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVEVYHTFHDEVDEYR RHWWRCNGPCQHRPPYYGYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGK GKAKLGKEPVLAAENKGTFVYILLIFM* >ENSG00000067082 Sequence unavailable >ENSG00000147724 MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTESSSLHSA CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY GYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD KPNRGEAQLVIPFSGKGYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI KVKFEQNGSSKNSHLVSPAVSNSHQNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP KNSVSSSSQRRVSSSKISLRNSSKVTESASVMPSQDVSGSEDTFPNKRPRLEDKTVFDNF FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQNEVLESQINEHLDWCLEGDS IKVKSEESL*

Aaron B.
My Woefully Neglected Blog, where I occasionally mention Perl.


Comment on Re^3: Saving different values for the same key by using Hash of Arrays
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://969184]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (11)
As of 2014-10-31 22:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (225 votes), past polls