Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^2: Memory issue with large cancer gene data structure

by ZWcarp (Beadle)
on Jul 25, 2013 at 20:27 UTC ( #1046417=note: print w/ replies, xml ) Need Help??


in reply to Re: Memory issue with large cancer gene data structure
in thread Memory issue with large cancer gene data structure

Thats is extremely close to what I want, thank you so much for your time. The only difference there is another level which is patient ID ..... if two mutations are the same site but they are in the same patient its not important. What is interesting is when 2 mutations are in the same site/gene of two different patients. Pool enough of these and you can identify a drug target to treat the cancer. If the site and the Amino acid change .. ( note L64F vs L64R would be different ) then that is an even better target. Here is the input lines of the data

The file I need to read in is this structure. These 19 lines are taken from the real file which has 5,772,080 lines in the same format :

Gene Name Patient ID Patient Diagnosis Ammino Acid Mutation a +nd Sit Protein Length AAK1 19679 adenocarcinoma L661I 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19676 adenocarcinoma L664T 21265 AAK1 19677 adenocarcinoma L64F 21265 AAK1 19678 adenocarcinoma L64R 21265 FKT1 101063 ER-PR-sitive_carcinoma p.L52R 2773 FKT1 103872 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 107590 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 107600 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 1135911 NS E17K 2773 TET3 152 chronic_lymocytic_leukaemia p.R401H 10982 TET3 587220 adenocarcinoma M935V 10982 TET3 587220 adenocarcinoma R1534Q 10982 TET3 587256 adenocarcinoma G1356R 10982 TET3 587338 adenocarcinoma G1356W 10982
Now I need to count all positions that match in Amino Acid Site (the number but not the letters of the 4th column) but are in different samples. Note : Patient ID19679 and AA mutation L664T only corresponds to a count of 2 because all of them are in the same patient except one in patient 19676.

The out put needs to be in this format, where you have rows as genes and columns are 1-Length(the fifth column above). L is different for every gene. I've listed spans as no1.....no2 just for sake of space, but in the real file all these numbers in between have to be filled with 0's:

1-Largest Gene Length AA site -1 AA site -2 AA site -3 4 +16 AA site -17 18..51etc AA site 52 AA site 64 654 +00 AA site 401 402.660 AA site 661 AA site 664 AA sit +e 935 AA site 1356 AA site 1534 AAK1 0 0 0 0 0 0 0 2 0 1 2 + 0 0 0 FKT1 0 0 0 0 4 0 1 0 0 0 0 + 0 0 0 TET3 0 0 0 0 0 0 0 0 1 0 0 + 1 2 1

I'm simplifying because I also need to calculate a second table but this time with Amino acid position and mutation ( thus numbers and letters of Column 4) matching in different patients. Thats why my script is so elaborate, the $key3=$key4 is to remove the letters etc. I know i've done a poor job scripting it. Any advice would be fantastic!! Thanks so much for helping.


Comment on Re^2: Memory issue with large cancer gene data structure
Select or Download Code
Re^3: Memory issue with large cancer gene data structure
by hdb (Parson) on Jul 26, 2013 at 07:27 UTC

    Here is the next iteration. This now includes some code to avoid double counting of the same site of mutation for the same patient (looks very similar to your original code...). You can chose between printing the full matrix or only the relevant ones by uncommenting one of two lines close to the end of the code. Hope this is helpful.

    use strict; use warnings; my %site_length_catch; my %sites; my $maxsite = 0; <DATA>; # skip header foreach (<DATA>) { chomp; # split and give meaningful names my( $gene, $patient, $diagnosis, $mut_and_sit, $length ) = split /\s ++/; # clean the site my $sit = $mut_and_sit; $sit =~ s/\D//g; # store patient to avoid double counting $site_length_catch{$gene}{$sit}{$patient} = 1; # store all sites with mutations $sites{$sit} = 1; $maxsite = $sit > $maxsite ? $sit : $maxsite; } # now remove double counted patients from the data structure foreach my $gene ( values %site_length_catch) { for my $count ( values %$gene ) { $count = keys %$count; # in scalar context you get the + number of keys } } # print table in desired format # uncomment one of the following two lines my @sitesprinted = sort { $a <=> $b } keys %sites; # sparse printing #my @sitesprinted = 1..$maxsite; # full printing # header first print "Gene"; print "\tsite $_" for @sitesprinted; print "\n"; # now the data foreach my $gene (keys %site_length_catch) { print $gene; print "\t", $site_length_catch{$gene}{$_} // 0 for @sitesprint +ed; print "\n"; } __DATA__ Gene Name Patient ID Patient Diagnosis Ammino Acid Mutation a +nd Sit Protein Length AAK1 19679 adenocarcinoma L661I 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19676 adenocarcinoma L664T 21265 AAK1 19677 adenocarcinoma L64F 21265 AAK1 19678 adenocarcinoma L64R 21265 FKT1 101063 ER-PR-sitive_carcinoma p.L52R 2773 FKT1 103872 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 107590 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 107600 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 1135911 NS E17K 2773 TET3 152 chronic_lymocytic_leukaemia p.R401H 10982 TET3 587220 adenocarcinoma M935V 10982 TET3 587220 adenocarcinoma R1534Q 10982 TET3 587256 adenocarcinoma G1356R 10982 TET3 587338 adenocarcinoma G1356W 10982

    As per your second table, I am optimistic that it is a relatively simple modification only.

      Can't thank you enough this is awesome!
      Thanks again for your help. Would you mind explaining how this section is working?
      # now remove double counted patients from the data structure foreach my $gene ( values %site_length_catch) { for my $count ( values %$gene ) { $count = keys %$count; # in scalar context you get the + number of keys

      I get that you've created a hash of a hash of a hash $site_length_catch{$gene}{$sit}{$patient} = 1; and initialized the bottom value array to 1 ...correct? but then with this part how are you accessing the values of the next level... why would you use  values %site_length_catch instead of keys %site_length_catch

      Thanks again for your time, the code works great, I just want to fully understand whats happening.

        Apologies for the late reply, I have been away for a while.

        To answer your question: keys iterates through the keys of a hash while values iterates through the associated values in the same order. So if you find you are writing code like:

        foreach my $key ( keys %hash ) { my $val = $hash{$key}; # do something with $val ... }

        and not use $key otherwise you can write directly

        foreach my $val ( values %hash ) { # do something with $val ... }

        If $val is a reference to a hash as in %site_length_catch, then %$val is a hash and the game can start again for the inner loop. The final line $count = keys %$count; takes the hash reference $count, counts its keys and overwrites the hash reference with the number of keys, in this case the number of patients.

        Hope this is helpful even if you have worked it out yourself already...

      Thanks again for your help. Would you mind explaining how this section is working?
      # now remove double counted patients from the data structure foreach my $gene ( values %site_length_catch) { for my $count ( values %$gene ) { $count = keys %$count; # in scalar context you get the + number of keys

      I get that you've created a hash of a hash of a hash $site_length_catch{$gene}{$sit}{$patient} = 1; and initialized the bottom value array to 1 ...correct? but then with this part how are you accessing the values of the next level... why would you use  values %site_length_catch instead of keys %site_length_catch

      Thanks again for your time, the code works great, I just want to fully understand whats happening.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1046417]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2014-08-27 11:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (237 votes), past polls