Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Wildcard for key in hash lookup to skip over level

by ZWcarp (Beadle)
on Dec 04, 2012 at 17:33 UTC ( #1007127=perlquestion: print w/replies, xml ) Need Help??

ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

#Update switched s[0] and s2's order in push

Hello monks, I am wondering if there is a way to do what I have posted below. There are two files, each have gene, sample, and position. I want to first count the unique sample/gene pairs in file2 for a given gene in file 1. Then I want to check the presence of a specific position/gene pair in file 2 for a given position/gene pair in file 1. For the pos/gene the samples do not matter, so in my hash I was hoping for some sort of wildcard character so that it just tells me if this exists or not. Can I use only one hash to do this, or do I need to use two separate hashes... I was hoping there might be some wildcard character to do this if (exists($Pos_overlap{$s[5]}{*}{$s[2]})) { thanks for your help!

use strict; use Data::Dumper; my %Gene_overlap; my %Pos_overlap; open (MYFILE,$ARGV[0]); my @file2 =<MYFILE>; close MYFILE; open (MYFILE,$ARGV[1]); my @file1 =<MYFILE>; close MYFILE; foreach(@file2){ chomp; my @s = split (/\t/, $_); #Splitting the Validations file for + gene name and amino acid push(@{$Pos_overlap{$s[5]}{$s[0]}{$s[2]}},$s[2]); # + pushes all sample/postion/gene/ combos into a hash } foreach(@file1){ chomp; my @s = split (/\t/, $_); # Splitting the file to get the s +ample/ position / gene if (exists($Pos_overlap{$s[5]})) { # Check to see if this +gene is also found in the file2 print $_ ."\t" . (keys %{$Pos_overlap{$s[5]}}); # +Prints how many times the exact combination of Gene and a unique samp +le is seen (but samples identity across files does not matter, just h +ow many unique ones there are ### This is the part I can't get to work###### if (exists($Pos_overlap{$s[5]}{$s[2]})) { # Check +s if the exact variant/position combination is present in both files + # print "\t" . (keys %{$Pos_overlap{$s[5]}{$s[2 +]}}) . "\n"; #prints how many times variant seen or would also be acc +eptable to just print a "1", saying that it does exist across both fi +les } else {print "\t0\n";} # prints 0 if no gene/positi +on found } else {print $_ . "\t0\t0\n";} #if no gene overlap

File1 structure

P15    1    17085713    C    S     MST1P9

file2 structure

005 1 17085712 C S MST1P9 006 1 17085712 C S MST1P9 006 1 17085713 C S MST1P9 007 1 17085712 C S MST1P9 006 1 17085713 C S MST1P9 006 1 17085713 C S MST1P9

Replies are listed 'Best First'.
Re: Wildcard for key in hash lookup to skip over level
by roboticus (Chancellor) on Dec 04, 2012 at 17:38 UTC


    There aren't any wildcards, but you can get a similar effect as the one you stated like this:

    my $fl=0; for my $k (keys %{$Pos_overlap{$s[5]}) { ++$fl if exists $Pos_overlap{$s[5]}{$k}{$s[2]}; } if ($fl) {

    Not as nice as one would like, but it's the best I can offer at this time.


    When your only tool is a hammer, all problems look like your thumb.

Re: Wildcard for key in hash lookup to skip over level
by rjt (Deacon) on Dec 04, 2012 at 18:04 UTC


    I spent a little time refactoring your code so I could make better sense of it. I still don't know exactly how the columns map to your text description. That being said, it looks like original suggestion, while often helpful, might not work in your case, if you will be re-using %Pos_overlap for something else.

    If that's the case, you might indeed be better served with a more complex data structure (which could indeed mean two hashes (trees, actually), or sub-trees of the current hash). It will essentially be a memory/performance tradeoff; storing multiple representations takes more memory, but can reduce operations to ~O(1) that would otherwise be O(n).

    It's looking like memory is probably not a concern, as you already read and store the complete contents both files in memory, in addition to the hash. If you convert your @var = (<FILE>) loops to while (<FILE>) { ... } loops, you can save a good deal of memory right now, for free.

    The usual way to accomplish something like this is to write your own hashing function. In Perl, this is roughly equivalent to passing your preliminary key through some sort of filter subroutine before you access it. You probably have done something like $names{lc($name)}++ without even thinking about it.

    This has a small cost (depending on how complex your function is), but if your potential wildcard expansion is more than a handful of elements (or even countably infinite...), it's a huge win.

    By the way, your code was a bit hard to follow with the 200+ character lines, and would have made more sense if you would have labeled the column names like so:

    my ($foo, $bar, $baz, $qux) = split /\t/;

    (Of course replacing those names with whatever your columns should be called.)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1007127]
Approved by GrandFather
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (10)
As of 2019-10-18 13:47 GMT
Find Nodes?
    Voting Booth?