http://www.perlmonks.org?node_id=969163


in reply to Saving different values for the same key by using Hash of Arrays

Hi!

Maybe if you gave some (truncated) stuff from your input file, and what you would expect as output from said data, we would be more able to help you...

  • Comment on Re: Saving different values for the same key by using Hash of Arrays

Replies are listed 'Best First'.
Re^2: Saving different values for the same key by using Hash of Arrays
by beginner27 (Initiate) on May 06, 2012 at 19:39 UTC
    >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVEVYHTFHDEVDEYR RHWWRCNGPCQHRPPYYGYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGK GKAKLGKEPVLAAENKGTFVYILLIFM* >ENSG00000067082 Sequence unavailable >ENSG00000147724 MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTESSSLHSA CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY GYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD KPNRGEAQLVIPFSGKGYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI KVKFEQNGSSKNSHLVSPAVSNSHQNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP KNSVSSSSQRRVSSSKISLRNSSKVTESASVMPSQDVSGSEDTFPNKRPRLEDKTVFDNF FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQNEVLESQINEHLDWCLEGDS IKVKSEESL*

    this is the input file, but without the spaces between the sequences

    the output should has the same structure. for each ID I need to print the longest sequence (each ID can have from one up to 60 different sequences). I already wrote the code for how to select the longest one and it works. I am stuck on the previous part, where I store the sequences (of the same ID) in the array. I think there is a problem in the way I collect the sequences in the array, because I checked the data and they are not correct...

      It would help if you wrapped your sample data in <code> or <pre> tags, so we can see where lines actually break.

      Update:

      Thanks for making your data easier to read. I'm still not sure whether by "spaces between the sequences" you mean the sequences are really in a single line, instead of broken into multiple lines as you have them here, but it doesn't matter for my solution. Instead of reading the file line-by-line and trying to determine which lines are IDs and which are sequences, and (possibly) concatenating the sections of sequences, together, I think it's much simpler if you change the input record separator from the default newline to what's actually separating your records: the > character. Then you've got a pretty standard key-value layout, making it easy to break each record into its two parts and take out anything that shouldn't be in the second part (like newlines). And as Kenosis pointed out, if you only want the longest sequence for each ID, there's no need to build a hash of arrays and find the longest ones later. Just compare lengths as you go, and replace them when you find a longer one. Like so:

      #!/usr/bin/env perl use Modern::Perl; my %seqs; $/ = '>'; # break lines on this instead of newlin +e while(my $line = <DATA>){ chomp $line; # remove any trailing > next unless $line; # skip leading blank record before firs +t > my($id, $seq) = split /\s+/, $line, 2; $seq =~ s/[\r\n]//g; # strip newlines and/or carriage return +s from sequence unless($seqs{$id} and length($seqs{$id}) > length($seq)){ $seqs{$id} = $seq; # save it if it's a new ID or a longer +sequence } } say ">$_ $seqs{$_}" for keys %seqs; __DATA__ >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVEVYHTFHDEVDEYR RHWWRCNGPCQHRPPYYGYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGK GKAKLGKEPVLAAENKGTFVYILLIFM* >ENSG00000067082 Sequence unavailable >ENSG00000147724 MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTESSSLHSA CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY GYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD KPNRGEAQLVIPFSGKGYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI KVKFEQNGSSKNSHLVSPAVSNSHQNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP KNSVSSSSQRRVSSSKISLRNSSKVTESASVMPSQDVSGSEDTFPNKRPRLEDKTVFDNF FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQNEVLESQINEHLDWCLEGDS IKVKSEESL*

      Aaron B.
      My Woefully Neglected Blog, where I occasionally mention Perl.