Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Saving different values for the same key by using Hash of Arrays

by beginner27 (Initiate)
on May 06, 2012 at 17:34 UTC ( #969160=perlquestion: print w/ replies, xml ) Need Help??
beginner27 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I am dealing with a FASTA file with multiple IDs (the same ID can be repeated) and the corresponding sequence. My idea was to create an Hash with as keys the IDs, in order to have a list without repetitions, and as values an array containing the sequence/sequences of the same ID. I need that each sequence is an element of the array. My script works as long as it stores the first and the second sequence in the array, from the third the data are no longer correct...

Hope to have been clear enough! I tried everything but nothing works..and the best would be not using any modules.

Thanks

my $tot = 0; my $current_ID = 1; while ( my $line = <IN> ){ chomp( $line ); if ( $line =~ /^>/ ) { $title= $line; $tot++; }else{ $seq.= $line; shift @{$hash{$title}}; push (@{$hash{$title}}, $seq); } if ($tot > $current_ID) { $current_ID ++; push (@{$hash{$title}}, $seq); $seq = ""; } }

Comment on Saving different values for the same key by using Hash of Arrays
Download Code
Re: Saving different values for the same key by using Hash of Arrays
by Not_a_Number (Parson) on May 06, 2012 at 19:03 UTC

    Hi!

    Maybe if you gave some (truncated) stuff from your input file, and what you would expect as output from said data, we would be more able to help you...

      >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVEVYHTFHDEVDEYR RHWWRCNGPCQHRPPYYGYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGK GKAKLGKEPVLAAENKGTFVYILLIFM* >ENSG00000067082 Sequence unavailable >ENSG00000147724 MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTESSSLHSA CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY GYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD KPNRGEAQLVIPFSGKGYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI KVKFEQNGSSKNSHLVSPAVSNSHQNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP KNSVSSSSQRRVSSSKISLRNSSKVTESASVMPSQDVSGSEDTFPNKRPRLEDKTVFDNF FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQNEVLESQINEHLDWCLEGDS IKVKSEESL*

      this is the input file, but without the spaces between the sequences

      the output should has the same structure. for each ID I need to print the longest sequence (each ID can have from one up to 60 different sequences). I already wrote the code for how to select the longest one and it works. I am stuck on the previous part, where I store the sequences (of the same ID) in the array. I think there is a problem in the way I collect the sequences in the array, because I checked the data and they are not correct...

        It would help if you wrapped your sample data in <code> or <pre> tags, so we can see where lines actually break.

        Update:

        Thanks for making your data easier to read. I'm still not sure whether by "spaces between the sequences" you mean the sequences are really in a single line, instead of broken into multiple lines as you have them here, but it doesn't matter for my solution. Instead of reading the file line-by-line and trying to determine which lines are IDs and which are sequences, and (possibly) concatenating the sections of sequences, together, I think it's much simpler if you change the input record separator from the default newline to what's actually separating your records: the > character. Then you've got a pretty standard key-value layout, making it easy to break each record into its two parts and take out anything that shouldn't be in the second part (like newlines). And as Kenosis pointed out, if you only want the longest sequence for each ID, there's no need to build a hash of arrays and find the longest ones later. Just compare lengths as you go, and replace them when you find a longer one. Like so:

        #!/usr/bin/env perl use Modern::Perl; my %seqs; $/ = '>'; # break lines on this instead of newlin +e while(my $line = <DATA>){ chomp $line; # remove any trailing > next unless $line; # skip leading blank record before firs +t > my($id, $seq) = split /\s+/, $line, 2; $seq =~ s/[\r\n]//g; # strip newlines and/or carriage return +s from sequence unless($seqs{$id} and length($seqs{$id}) > length($seq)){ $seqs{$id} = $seq; # save it if it's a new ID or a longer +sequence } } say ">$_ $seqs{$_}" for keys %seqs; __DATA__ >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVEVYHTFHDEVDEYR RHWWRCNGPCQHRPPYYGYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGK GKAKLGKEPVLAAENKGTFVYILLIFM* >ENSG00000067082 Sequence unavailable >ENSG00000147724 MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTESSSLHSA CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY GYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD KPNRGEAQLVIPFSGKGYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI KVKFEQNGSSKNSHLVSPAVSNSHQNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP KNSVSSSSQRRVSSSKISLRNSSKVTESASVMPSQDVSGSEDTFPNKRPRLEDKTVFDNF FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQNEVLESQINEHLDWCLEGDS IKVKSEESL*

        Aaron B.
        My Woefully Neglected Blog, where I occasionally mention Perl.

Re: Saving different values for the same key by using Hash of Arrays
by Anonymous Monk on May 06, 2012 at 19:26 UTC
    Use Data::Dumper to print out what your final data structure actually contains.
Re: Saving different values for the same key by using Hash of Arrays
by Kenosis (Priest) on May 06, 2012 at 21:10 UTC

    You've chosen an effective use for a hash, but if you only need to find the longest (or longer or only) sequence for an ID that has one or more sequences, consider the following solution that doesn't use an array:

    use strict; use warnings; my %FASTAhash; open my $file, '<FASTA.txt' or die $!; while (<$file>) { next if !/(>[^ ]+) /; chomp( $FASTAhash{$1} = $' ) if !$FASTAhash{$1} or length $' > length $FASTAhash{$1}; } close $file; print "$_ $FASTAhash{$_}\n" for keys %FASTAhash;

    The regex matches the ID, which is placed into $1, leaving the remaining (unmatched) sequence in $'. The hash item whose key is the ID in $1 is assigned the sequence in $' and then chomped if that item's undefined (in this case) or the length of $' is greater than what's already there. When done, each ID is paired with its longest sequence. (Is it possible for two sequences of the same ID to be the same length? If so, do you need to code for that?)

    Output from processing your data:

    >ENSG00000147724 MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTE +SSSLHSA CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS +EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFN +DQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV +TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY GYVKRATN +REPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD KPNRGEAQLVIPFSGK +GYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI KVKFEQNGSSKNSHLVSPAVSNSH +QNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP KNSVSSSSQRRVSSSKISLRNSSKVTESASVM +PSQDVSGSEDTFPNKRPRLEDKTVFDNF FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQ +NEVLESQINEHLDWCLEGDS IKVKSEESL* >ENSG00000067082 Sequence unavailable

    Hope this helps!

      Thanks a lot for your quick and detailed answer..but the script doesn't actually print me anything! How is this possible?

        It's possible because my regex didn't work with your reformatted FASTA records. :) aaron_baugher's suggestion to repost your records using <code> or <pre> was spot on, and helped with crafting the following new-and-improved solution--after your re-posting:

        use strict; use warnings; my %FASTAhash; { local $/ = '>'; open my $file, '<FASTA.txt' or die $!; while (<$file>) { next if !/(.*?)\n/; chomp( $FASTAhash{$1} = $' ) if !$FASTAhash{$1} or length $' > length $FASTAhash{$1}; } } print ">$_\n$FASTAhash{$_}" for keys %FASTAhash;

        Within a block, we start by letting perl know that '>' is the new record separator, instead of the default "\n" (so we read the file a FASTA record at a time, instead of a line at a time), and then tweaked the regex a bit to grab the ID.

        You'll note that we don't use close $file; when we're done, since the file's automatically close when my $file falls out of scope (when the block ends).

        Here's the output:

        >ENSG00000147724 MSEIQGTVEFSVELHKFYNVDLFQRGYYQIRVTLKVSSRIPHRLSASIAGQTESSSLHSA CVHDSTVHSRVFQILYRNEEVPINDAVVFRVHLLLGGERMEDALSEVDFQLKVDLHFTDS EQQLRDVAGAPMVSSRTLGLHFHPRNGLHHQVP >ENSG00000067082 Sequence unavailable >ENSG00000010072 MDDDLMLALRLQEEWNLQEAERDHAQESLSLVDASWELVDPTPDLQALFVQFNDQFFWGQ LEAVEVKWSVRMTLCAGICSYEGKGGMCSIRLSEPLLKLRPRKDLVETLLHEMIHAYLFV TNNDKDREGHGPEFCKHMHRINSLTGANITVYHTFHDEVDEYRRHWWRCNGPCQHRPPYY GYVKRATNREPSAHDYWWAEHQKTCGGTYIKIKEPENYSKKGKGKAKLGKEPVLAAENKD KPNRGEAQLVIPFSGKGYVLGETSNLPSPGKLITSHAINKTQDLLNQNHSANAVRPNSKI KVKFEQNGSSKNSHLVSPAVSNSHQNVLSNYFPRVSFANQKAFRGVNGSPRISVTVGNIP KNSVSSSSQRRVSSSKISLRNSSKVTESASVMPSQDVSGSEDTFPNKRPRLEDKTVFDNF FIKKEQIKSSGNDPKYSTTTAQNSSSSSSQSKMVNCPVCQNEVLESQINEHLDWCLEGDS IKVKSEESL*

        Hope this version's helpful!

        Update: After posting the above, just noticed aaron_baugher's solution using $/ = '>' and I think this makes good sense, since this is the FASTA record delimiter.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://969160]
Approved by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (9)
As of 2014-12-28 14:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (181 votes), past polls