Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Number of values for each key in hash

by Sofie (Acolyte)
on Feb 29, 2020 at 11:06 UTC ( #11113569=perlquestion: print w/replies, xml ) Need Help??

Sofie has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have file containing two columns, Genes and Genetype. I would like to create a hash where the gene is the key and the genotype is the value. I then would like to count and print out how many genes have a specific genotype. So for example: CECR1 protein_coding IKBKGP1 pseudogene ADA protein_coding I would like to have an output: protein_coding 2 pseudogene 1 I have come this far:
#open the textfile GeneType.txt open (GENETYPE, "GeneType.txt") or die "Could not open file"; while (<GENETYPE>){ ($GeneName, $GeneType)= split (/\t/, $_); $GeneHash{$GeneName} = $GeneType; #create hash my $scalar = delete $GeneHash{GeneName}; #removes the first line w +hich is a header print (each %GeneHash); #printing out the has just as a check
I am guessing I should use a for or foreach loop to iterate over the hash, but can't understand how. Thanks /beginner

Replies are listed 'Best First'.
Re: Number of values for each key in hash
by haj (Curate) on Feb 29, 2020 at 12:26 UTC
    There are a few gotchas in your code... let me modify it like this:
    use strict; use warnings; my %GeneCount = (); #open the textfile GeneType.txt open (GENETYPE, "GeneType.txt") or die "Could not open file: '$!'"; my $header = <GENETYPE>; # read the header before entering the loop while (<GENETYPE>) { chomp; my ($GeneName, $GeneType)= split (/\t/, $_); $GeneCount{$GeneType}++; } for my $type (sort keys %GeneCount) { print "$type: $GeneCount{$type}\n"; }

    So what did I change?

    • I started the program with use strict; and use warnings; which is a good habit and will save a lot of time in the long run. The only downside is that I now have to declare my %GeneCount = () before using it.
    • In the open statement I included the reason why it failed into the error message. There's also the opportunity to use the three-parameter form of open and a lexical file handle, which I let pass, because your code is correct (but slightly out of fashion).
    • Instead of removing the header in every line of the loop, I just read the header before even entering the loop.
    • I added chomp which kills the newline which will otherwise be at the end of every gene type you read.
    • Most important for your logic: I changed the hash so that the types are the keys, and the count are the values.

    I seem to recall that older versions of Perl (I'm using 5.28) issued some warnings about uninitialized $GeneCount{pseudogene}. To get rid of these you can add the line no warnings "uninitialized" before entering the loop.

    And that's it. The rest is just typing out the collected values.

    If you are a beginner in Perl, you might also checkout https://learn.perl.org/books/: They are fun to read.

      Possible additional tweaks:

      • There is no need to initialize %GeneCount to (). That is the value it takes on anyway when declared.
      • You may want to cultivate the habit of using lexical variables as file handles (i.e. open my $genetype, ... or die ...). Bareword file handles are global.
      • You may want to cultivate the habit of using three-argument opens (i.e. open my $genetype, '<', 'GeneType.txt' or die .... This is the only way you can specify things like file encoding.
      • Purely as a style thing, built-ins like open() and split() do not need parentheses, except for precedence. Whoever wrote perlopentut uses parentheses because that author also chose to use the tightly-binding '||' operator rather than the loosely-binding or operator for error checking.

      None of these are required to make the presented script work.

      That works perfectly, thanks!
Re: Number of values for each key in hash
by hippo (Chancellor) on Feb 29, 2020 at 11:51 UTC
Re: Number of values for each key in hash
by bliako (Prior) on Feb 29, 2020 at 12:55 UTC

    The link provided by hippo seems a good place to start and the correspondence to your case can be deduced by:

    ($GeneName, $GeneType)= split (/\t/, $_); # your program my ($ip, $size) = split /:/; # the other program

    Once you practice building and searching the hash, consider this:

    • in a hash the most efficient search is by its keys. If you need to check by value then consider re-designing your hash and use the values for keys (of course this is not always possible because the keys of a hash must be unique).
    • In a situation where a key can be associated with multiple values (and must absolutely used as a key, i.e. can't be redesigned), we can use an array to hold all the values. Like: $hash{akey} = ['v1', 'v2', 'v3']; or even another hash like $hash{akey} = {'k1'=>['v1k1','v2k1'], 'k2' => ['v1k2','v2k2']};. There is also the possibility of arrays-of-hashes, arrays-of-arrays etc. etc. etc. With nesting data structures the possibilities are endless.

    I mentioned this because in your case I think the key should be the genotype (and not the genename) and the value should be an array of gene names.

    Also, my $scalar = delete $GeneHash{GeneName}; will remove the genename you just added to your hash! You will end up with nothing. You probably wanted to skip the first line of the file. And do that just once, i.e. before the loop, like open (GENETYPE, "GeneType.txt") or die "Could not open file"; my $header = <GENETYPE>; which skips the first line of the file and saved it in that variable.

    Finally, shouldn't print (each %GeneHash); be outside the file-reading-hash-generation loop? And this would do just fine: while( my ($k,$v) = each %GeneHash ){ print "$k=>$v\n"; }

    Having said all these, and you going through them in order to get some experience, I want to mention the existence of BioPerl which is especially designed for bio-informatics and does tasks like yours pretty well, here is something relevant: https://bioperl.org/howtos/Beginners_HOWTO.html#item19 . But even with BioPerl you will need to know your hashes.

    bw, bliako

Re: Number of values for each key in hash
by BillKSmith (Prior) on Feb 29, 2020 at 21:20 UTC
    use grep;
    my $count = grep /$specific/ values(%GeneHash);
    Bill
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11113569]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2021-06-20 10:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (95 votes). Check out past polls.

    Notices?