Re^3: Memory issue with large cancer gene data structure

Here is the next iteration. This now includes some code to avoid double counting of the same site of mutation for the same patient (looks very similar to your original code...). You can chose between printing the full matrix or only the relevant ones by uncommenting one of two lines close to the end of the code. Hope this is helpful.

use strict;
use warnings;

my %site_length_catch;
my %sites;
my $maxsite = 0;

<DATA>; # skip header
foreach (<DATA>) {
  chomp;

    # split and give meaningful names
  my( $gene, $patient, $diagnosis, $mut_and_sit, $length ) = split /\s
++/;

    # clean the site
    my $sit = $mut_and_sit;
    $sit =~ s/\D//g;

    # store patient to avoid double counting
  $site_length_catch{$gene}{$sit}{$patient} = 1; 

    # store all sites with mutations
    $sites{$sit} = 1;
    $maxsite = $sit > $maxsite ? $sit : $maxsite;
}

# now remove double counted patients from the data structure
foreach my $gene ( values %site_length_catch) {
        for my $count ( values %$gene ) {
                $count = keys %$count; # in scalar context you get the
+ number of keys
        }
}

# print table in desired format
# uncomment one of the following two lines
my @sitesprinted = sort { $a <=> $b } keys %sites;   # sparse printing
#my @sitesprinted = 1..$maxsite;                      # full printing

# header first
print "Gene";
print "\tsite $_" for @sitesprinted;
print "\n";

# now the data
foreach my $gene (keys %site_length_catch) {
        print $gene;
        print "\t", $site_length_catch{$gene}{$_} // 0 for @sitesprint
+ed; 
        print "\n";
}

__DATA__
Gene Name    Patient ID    Patient Diagnosis    Ammino Acid Mutation a
+nd Sit    Protein Length 
AAK1    19679    adenocarcinoma    L661I    21265
AAK1    19679    adenocarcinoma    L664T    21265
AAK1    19679    adenocarcinoma    L664T    21265
AAK1    19679    adenocarcinoma    L664T    21265
AAK1    19679    adenocarcinoma    L664T    21265
AAK1    19679    adenocarcinoma    L664T    21265
AAK1    19676    adenocarcinoma    L664T    21265
AAK1    19677    adenocarcinoma    L64F    21265
AAK1    19678    adenocarcinoma    L64R    21265
FKT1    101063    ER-PR-sitive_carcinoma    p.L52R    2773
FKT1    103872    ER-PR-sitive_carcinoma    p.E17K    2773
FKT1    107590    ER-PR-sitive_carcinoma    p.E17K    2773
FKT1    107600    ER-PR-sitive_carcinoma    p.E17K    2773
FKT1    1135911    NS    E17K    2773
TET3    152    chronic_lymocytic_leukaemia    p.R401H    10982
TET3    587220    adenocarcinoma    M935V    10982
TET3    587220    adenocarcinoma    R1534Q    10982
TET3    587256    adenocarcinoma    G1356R    10982
TET3    587338    adenocarcinoma    G1356W    10982
[download]

As per your second table, I am optimistic that it is a relatively simple modification only.

Comment on Re^3: Memory issue with large cancer gene data structure Download Code

Replies are listed 'Best First'.
Re^4: Memory issue with large cancer gene data structure by ZWcarp (Beadle) on Jul 30, 2013 at 18:36 UTC
Can't thank you enough this is awesome!	[reply]
Re^4: Memory issue with large cancer gene data structure by ZWcarp (Beadle) on Aug 08, 2013 at 15:29 UTC
Thanks again for your help. Would you mind explaining how this section is working? `# now remove double counted patients from the data structure foreach my $gene ( values %site_length_catch) { for my $count ( values %$gene ) { $count = keys %$count; # in scalar context you get the + number of keys` [download] I get that you've created a hash of a hash of a hash `$site_length_catch{$gene}{$sit}{$patient} = 1;` and initialized the bottom value array to 1 ...correct? but then with this part how are you accessing the values of the next level... why would you use `values %site_length_catch` instead of `keys %site_length_catch` Thanks again for your time, the code works great, I just want to fully understand whats happening.	[reply] [d/l] [select]
Re^5: Memory issue with large cancer gene data structure by hdb (Monsignor) on Aug 21, 2013 at 13:06 UTC
Apologies for the late reply, I have been away for a while. To answer your question: keys iterates through the keys of a hash while values iterates through the associated values in the same order. So if you find you are writing code like: `foreach my $key ( keys %hash ) { my $val = $hash{$key}; # do something with $val ... }` [download] and not use `$key` otherwise you can write directly `foreach my $val ( values %hash ) { # do something with $val ... }` [download] If `$val` is a reference to a hash as in `%site_length_catch`, then `%$val` is a hash and the game can start again for the inner loop. The final line `$count = keys %$count;` takes the hash reference `$count`, counts its keys and overwrites the hash reference with the number of keys, in this case the number of patients. Hope this is helpful even if you have worked it out yourself already...	[reply] [d/l] [select]
Re^4: Memory issue with large cancer gene data structure by ZWcarp (Beadle) on Aug 08, 2013 at 15:30 UTC
Thanks again for your help. Would you mind explaining how this section is working? `# now remove double counted patients from the data structure foreach my $gene ( values %site_length_catch) { for my $count ( values %$gene ) { $count = keys %$count; # in scalar context you get the + number of keys` [download] I get that you've created a hash of a hash of a hash `$site_length_catch{$gene}{$sit}{$patient} = 1;` and initialized the bottom value array to 1 ...correct? but then with this part how are you accessing the values of the next level... why would you use `values %site_length_catch` instead of `keys %site_length_catch` Thanks again for your time, the code works great, I just want to fully understand whats happening.	[reply] [d/l] [select]


We don't bite newbies here... much
	PerlMonks