Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

grep keys in hash and retrieve values

by AWallBuilder (Beadle)
on Mar 16, 2012 at 09:51 UTC ( #959948=perlquestion: print w/ replies, xml ) Need Help??
AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

I have a hash that contains locus names as the keys and taxids as the values. I have another list of genomes that are partial matches of the locus names, therefore I want to grep the keys and if there is a match retrieve the corresponding value (taxid). At the moment I think my $matching_key, is returning the number of keys that match the grep and not the actual hash key. Any help with the code is appreciated.

open (IN,$tax2locus_file); while(<IN>){ my($taxid,$locus)=split(/\t/,$_); $tax2loc{$locus}=$taxid; } close(IN); print "there are\t".scalar(keys %tax2loc)."\tlocus_ids as key in hash\ +n"; ############### Now read in sharedTab file with pairwise overlap info my $sharedTab_file=$ARGV[0]; my @columns; my $prophageA; my $prophageB; my $outfile="$sharedTab_file.hostinfo"; my $hostA; my $PFnumA; my $hostB; my $PFnumB; my $regex; my $matching_key; my $taxidA; my $taxidB; open (OUT,">$outfile"); open(IN,$sharedTab_file); print OUT "#prophageA\tprophageB\thostA\ttaxidA\thostB\ttaxidB\tjacc\n +"; while(<IN>){ chomp; next if (/^#/); # ignore comments @columns=split(/\t/,$_); $prophageA=$columns[0]; ($hostA,$PFnumA)=split(/\./,$prophageA); if ($hostA =~ /^NZ/){ ## for wgs genomes just match first 7 characters + as only NZ_XXXX000000 are in tax2locus my $hostA=substr $hostA, 0, 7; } $regex=qr/$hostA/; $matching_key=grep { $_ =~ /$regex/ } keys %tax2loc; $taxidA=$tax2loc{$matching_key}; $prophageB=$columns[1]; ($hostB,$PFnumB)=split(/\./,$prophageB); if ($hostB =~ /^NZ/){ ## for wgs genomes just match first 7 characters + as only NZ_XXXX000000 are in tax2locus my $hostB=substr $hostB, 0, 7; } $regex=qr/$hostB/; $matching_key=grep { $_ =~ /$regex/ } keys %tax2loc; $taxidB=$tax2loc{$matching_key}; my $jacc=$columns[5]; print OUT join("\t",$prophageA,$prophageB,$hostA,$taxidA,$hostB,$taxid +B,$jacc)."\n";

Comment on grep keys in hash and retrieve values
Download Code
Re: grep keys in hash and retrieve values
by Anonymous Monk on Mar 16, 2012 at 10:07 UTC

    I think my $matching_key, is returning the number of keys that match the grep and not the actual hash key.

    Then you should write a program to check

    $ perl -le " print scalar grep /./, qw/ a b c /; " 3

    Yup, grep in scalar context returns the number of matches.

    Now in list context it returns a list :)

    $ perl -le " print for grep /./, qw/ a b c /; " a b c

    So to get the first from the list, we use parens

    $ perl -le " ( $foo ) = grep /./, qw/ a b c /; print $foo " a $ perl -le " print( ( grep /./, qw/ a b c / )[0] ); " a

    See Tutorials: Context in Perl: Context tutorial

    And see again Re: help with loop and start making functions, your program is very hard to read :)

      thank you. yes the context tutorial will help. Now I can retrieve the value, but only for exact matches of the keys. But I want a match if it contains hostB I also tried. $regex=qr/$hostB*/ and $regex=qr/^$hostB/ I also read through a few regular expression tutorials and am even more confused.

Re: grep keys in hash and retrieve values
by moritz (Cardinal) on Mar 16, 2012 at 10:08 UTC

    If you want the list of matches, use an array:

    my @matches = grep /$regex/, keys %tax2loc

    This puts grep into list context, thus returning a list of all matches (and not just the number of matches).

Re: grep keys in hash and retrieve values
by GrandFather (Cardinal) on Mar 16, 2012 at 10:49 UTC

    Others have addressed your immediate problem, but there is plenty of other help to be provided. Consider the following:

    #!/usr/bin/perl use strict; use warnings; my $tax2locus_file; my %tax2loc; open my $in, '<', $tax2locus_file or die "Can't open $tax2locus_file: +$!\n"; while (<$in>) { chomp; my ($taxid, $locus) = split /\t/; $tax2loc{$locus} = $taxid; } close ($in); print "there are\t" . keys (%tax2loc) . "\tlocus_ids as key in hash\n" +; ############### Now read in sharedTab file with pairwise overlap info my $sharedTab_file = $ARGV[0]; my $outfile = "$sharedTab_file.hostinfo"; open my $out, '>', $outfile or die "Can't create $outfile: $!\n"; open $in, '<', $sharedTab_file or die "Can't open $sharedTab_file: $!\ +n"; print $out "#prophageA\tprophageB\thostA\ttaxidA\thostB\ttaxidB\tjacc\ +n"; while (<$in>) { chomp; next if (/^#/); # ignore comments my @columns = split (/\t/, $_); my ($prophageA, $hostA, $taxidA) = getTaxId($columns[0]); my ($prophageB, $hostB, $taxidB) = getTaxId($columns[0]); print $out join ("\t", $prophageA, $prophageB, $hostA, $taxidA, $hostB, $taxidB, $col +umns[5]), "\n"; } sub getTaxId { my ($prophage, $lu) = @_; my ($host, $PFnum) = split /\./, $prophage; ## for wgs genomes just match first 7 characters as only NZ_XXXX00 +0000 are ## in tax2locus $host =~ s/^(NZ.{5}).*/$1/; my @matches = grep {$_ =~ /$host/} keys %$lu; die "Expected exactly one match for $host. Got " . scalar @matches + . "\n"; return $prophage, $host, $matches[0]; }

    Note that the code is completely untested so may suffer from typos and egregious errors of all sorts, however points to note are:

    • use three parameter open with lexical file handles and check the result
    • declare variables at their first point of use so their life time and scope are clear
    • use indentation to make flow control and other code structures clear
    • avoid duplication of code
    • check that assumptions made by the code are correct

    Note that this code doesn't check to ensure the input data are correctly formatted as I'm not entirely sure what the format ought to be, but "production" code would ensure that sensible values were passed into getTaxId for $prophage for example.

    True laziness is hard work

      thank you. this is great, I thought of using a subroutine but was getting it wrong. But your script isn't working. To me it looks as if you are only passing the $prophage to the subroutine, but you must also pass the hash? I tried editing it as follows. But I am recieving an error about passing a string to a hash reference.

       my ($prophageB, $hostB, $taxidB) = getTaxId($columns[0],%tax2loc);

        Err, I did say it was untested didn't I?

        There are two errors related to the hash. The second one is that the return statement needs to be changed to:

        return $prophage, $host, $lu->{$matches[0]};

        to return the value instead of the key.

        The first problem is as you noticed, the hash needs to be passed to the sub, but it needs to be passed by reference because of the way the code in the sub works:

        my ($prophageB, $hostB, $taxidB) = getTaxId($columns[0], \%tax2loc +);

        The pass by reference is an optimisation to save passing all the keys and values of the hash in a list which is what would happen otherwise. Note that the point of passing the hash at all is to avoid treating it as a global variable which is generally a "bad thing"™ (although this is such a small program it's not an issue except as a style thing).

        True laziness is hard work
Re: grep keys in hash and retrieve values
by tobyink (Abbot) on Mar 16, 2012 at 11:41 UTC

    Yes, this:

    $matching_key=grep { $_ =~ /$regex/ } keys %tax2loc;

    ... returns the number of matching keys. If you want to get the actual text of the key, you need to call grep in list context:

    @all_matching_keys = grep { $_ =~ /$regex/ } keys %tax2loc; $matching_key = $all_matching_keys[0];

    If you only care about the first match, then you can simply parenthesise $matching_key like this:

    ($matching_key) = grep { $_ =~ /$regex/ } keys %tax2loc;

    The parentheses force list context on the assignment, and thus on grep.

    That said, the above code is wasteful. If the very first key happens to match the regexp, then Perl will still waste time checking every key in %tax2loc. Using first from List::Util will probably be more efficient.

    $matching_key = first { $_ =~ /$regex/ } keys %tax2loc;

    Also note that "$_ =~" is totally superfluous here. You could just do:

    $matching_key = first { /$regex/ } keys %tax2loc;
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://959948]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2014-09-01 21:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (17 votes), past polls