Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Memory issue with large cancer gene data structure

by ZWcarp (Beadle)
on Jul 25, 2013 at 14:23 UTC ( #1046352=perlquestion: print w/ replies, xml ) Need Help??
ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I'm a biologist and novice with perl so bare with me.

I'm running into memory issues with a script I'm trying to run. First I read in a file of genes (5772080 unique gene names) into a HoHoH0A corresponding to push(@{$hash{gene name} { sample ID} {Mutation site (a number)});

Then I read in a second file and initialize a second hash with values that are arrary references to arrays with length of the genes in amino acids. ie. 200AA = 200 cells.

Eventually I get the the point where I am trying to place counts of the sample unique mutations for each gene at the corresponding position in the array eof the second hash for that respective gene.

Heres the main parts of the code
foreach (@file) { chomp; my @r = split (/\t/, $_); push(@{$genes{$r[0]}{$r[3]}{$r[13]}},$r[13]); ## Pus +hing a hash that includes {Gene name}{SampleIdentifier}{AA site} } foreach (@file2){ # Initialize arrays w/ 0's for each gene with siz +e of length for each gene chomp; my @r=split(/\t/,$_); my @zeros = (0) x $r[1]; @{$s_l{$r[0]}} = @zeros; } foreach my $key1 ( sort keys %genes ) { ### this part stores the mutat +ions with and with out Amino acid (string info) the second hash site +is only numbers otherwise ignore this section foreach my $key2 ( sort keys %{$genes{$key1}} ) { foreach my $key3 ( sort keys %{$genes{$key1}{$ +key2}} ) { push(@{$AA{$key1}{$key3}},$key3); my $key4=$key3; $key4=~s/\D\.\D([0-9]+)\D/$1/; $key4=~ +s/(\*|\?|s\d+)//; push (@{$site{$key1}{$key4}},$key4); } } } #this is the problem section foreach my $key1 ( sort keys %AA ) { foreach my $key4 ( sort keys %{$site{$key1}} ) { $site_length_catch{$key1}[$key4] = scalar ( @ +{$site{$key1}{$key4}} ); } }

Comment on Memory issue with large cancer gene data structure
Download Code
Re: Memory issue with large cancer gene data structure
by Anonymous Monk on Jul 25, 2013 at 14:28 UTC

    $site_length_catch{$key1}[$key4]

    Did you mean for that to be an array ref instead of a hash ref for $key4? Also, what values does $key4 take?

    If $key4 happens to be a large number, then you'll blow away all your memory in one step as the array attempts to grow to ludicrous size.

      $key4 = the amino acid position. It is derived from the first file and edited to remove non digits so that its just a numerical position

      cell $key4 is originally initialized as 0 and at this step it needs to be changed to the number of mutations at this site (the array length corresponds to the gene length of gene $key1 so each cell corresponds to an amino acid position in the gene). It is set equal to scalar ( @{$site{$key1}{$key4}} ); because this calculates the number of values at a mutation site, thus the size of this essentially opperates as a count for recurrent positions across different samples. The numbers for this are mostly 0-10 there are a few large like one 35K and a few below that.
Re: Memory issue with large cancer gene data structure
by RichardK (Priest) on Jul 25, 2013 at 15:07 UTC

    Do you really need to keep a list of $key4 ? I think you could just count them, $key4 as both a key and a value looks at bit odd, and you only seem to be counting them by keeping an array of repeated values. So, you could get rid of the useless arrays and replace them with a simple integer.

    #so replace this push (@{$site{$key1}{$key4}},$key4); #with $site{$key1}{$key4}++;

    BTW In the code you posted as the problem area the sorts are doing nothing useful, just taking up time, but maybe you're doing other stuff in there too.

Re: Memory issue with large cancer gene data structure
by kcott (Abbot) on Jul 25, 2013 at 16:24 UTC

    G'day ZWcarp,

    Firstly, you're doing yourself no favours whatsoever by littering your code with meaningless variable names. While you may remember, while you're writing it and it's fresh in your mind, what $r[3] or $key2 refer to, you won't next month when you have to come back to it to make a modification.

    From the code and text you've provided: the keys of %AA are gene names (if I've got that right, $gene_name would be a meaningful name); the keys of %{$site{$gene_name}} are mutation sites (again $mutation_site would be meaningful); and so on throughout your code.

    I don't see any purpose to any of the sorting you're doing in your "problem section" (it's wasted processing and chews up even more memory) and I agree with AM about the [$key4].

    Putting all that together, I think your "problem section" could be written as (untested):

    for my $gene_name (keys %AA) { for (keys %{$site{$gene_name}}) { $site_length_catch{$gene_name}{$_} = @{$site{$gene_name}{$_}}; } }

    And later you can access that data as:

    my $mutation_count = $site_length_catch{$gene_name}{$mutation_site};

    There's other parts of your code that seem dubious (e.g. the my $key4=$key3; assignment) which perhaps will become obvious to you when you apply better names. You're working with very large amounts of data and loops nested three-deep: you need to keep all the code (but especially the innermost loops) as efficient as possible: go through your code and remove unnecessary assignments, sorting and so on.

    -- Ken

Re: Memory issue with large cancer gene data structure
by hdb (Parson) on Jul 25, 2013 at 17:47 UTC

    Not sure if I fully understand but I'd like to venture a guess what you want:

    use strict; use warnings; my %site_length_catch; my $max = 0; foreach (@file) { chomp; my @r = split /\t/; # cleaning from your second loop $r[13] =~ s/\D\.\D([0-9]+)\D/$1/; $r[13] =~ s/(\*|\?|s\d+)//; $site_length_catch{$r[0]}{$r[13]}++; $max = $r[13]>$max?$r[13]:$max; } foreach my $gene (keys %site_length_catch) { print $site_length_catch{$gene}{$_} // 0, "\t" for 1..$max; print "\n"; }

    The hash %site_length_catch is a sparse matrix containing the name of the gene as the first dimension and the site of mutations as the second dimension. Each cell in the matrix contains the number of mutations at that site for that gene.

    When printing the empty spaces are filled with zeros (this is what  // 0 does). I have added the regexes from your second loop as they seem to be applied to the "Mutation site". Just remove them if I have guessed wrongly.

    Feedback would be appreciated, along with a few lines of your input, if possible.

      Thats is extremely close to what I want, thank you so much for your time. The only difference there is another level which is patient ID ..... if two mutations are the same site but they are in the same patient its not important. What is interesting is when 2 mutations are in the same site/gene of two different patients. Pool enough of these and you can identify a drug target to treat the cancer. If the site and the Amino acid change .. ( note L64F vs L64R would be different ) then that is an even better target. Here is the input lines of the data

      The file I need to read in is this structure. These 19 lines are taken from the real file which has 5,772,080 lines in the same format :

      Gene Name Patient ID Patient Diagnosis Ammino Acid Mutation a +nd Sit Protein Length AAK1 19679 adenocarcinoma L661I 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19676 adenocarcinoma L664T 21265 AAK1 19677 adenocarcinoma L64F 21265 AAK1 19678 adenocarcinoma L64R 21265 FKT1 101063 ER-PR-sitive_carcinoma p.L52R 2773 FKT1 103872 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 107590 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 107600 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 1135911 NS E17K 2773 TET3 152 chronic_lymocytic_leukaemia p.R401H 10982 TET3 587220 adenocarcinoma M935V 10982 TET3 587220 adenocarcinoma R1534Q 10982 TET3 587256 adenocarcinoma G1356R 10982 TET3 587338 adenocarcinoma G1356W 10982
      Now I need to count all positions that match in Amino Acid Site (the number but not the letters of the 4th column) but are in different samples. Note : Patient ID19679 and AA mutation L664T only corresponds to a count of 2 because all of them are in the same patient except one in patient 19676.

      The out put needs to be in this format, where you have rows as genes and columns are 1-Length(the fifth column above). L is different for every gene. I've listed spans as no1.....no2 just for sake of space, but in the real file all these numbers in between have to be filled with 0's:

      1-Largest Gene Length AA site -1 AA site -2 AA site -3 4 +16 AA site -17 18..51etc AA site 52 AA site 64 654 +00 AA site 401 402.660 AA site 661 AA site 664 AA sit +e 935 AA site 1356 AA site 1534 AAK1 0 0 0 0 0 0 0 2 0 1 2 + 0 0 0 FKT1 0 0 0 0 4 0 1 0 0 0 0 + 0 0 0 TET3 0 0 0 0 0 0 0 0 1 0 0 + 1 2 1

      I'm simplifying because I also need to calculate a second table but this time with Amino acid position and mutation ( thus numbers and letters of Column 4) matching in different patients. Thats why my script is so elaborate, the $key3=$key4 is to remove the letters etc. I know i've done a poor job scripting it. Any advice would be fantastic!! Thanks so much for helping.

        Here is the next iteration. This now includes some code to avoid double counting of the same site of mutation for the same patient (looks very similar to your original code...). You can chose between printing the full matrix or only the relevant ones by uncommenting one of two lines close to the end of the code. Hope this is helpful.

        use strict; use warnings; my %site_length_catch; my %sites; my $maxsite = 0; <DATA>; # skip header foreach (<DATA>) { chomp; # split and give meaningful names my( $gene, $patient, $diagnosis, $mut_and_sit, $length ) = split /\s ++/; # clean the site my $sit = $mut_and_sit; $sit =~ s/\D//g; # store patient to avoid double counting $site_length_catch{$gene}{$sit}{$patient} = 1; # store all sites with mutations $sites{$sit} = 1; $maxsite = $sit > $maxsite ? $sit : $maxsite; } # now remove double counted patients from the data structure foreach my $gene ( values %site_length_catch) { for my $count ( values %$gene ) { $count = keys %$count; # in scalar context you get the + number of keys } } # print table in desired format # uncomment one of the following two lines my @sitesprinted = sort { $a <=> $b } keys %sites; # sparse printing #my @sitesprinted = 1..$maxsite; # full printing # header first print "Gene"; print "\tsite $_" for @sitesprinted; print "\n"; # now the data foreach my $gene (keys %site_length_catch) { print $gene; print "\t", $site_length_catch{$gene}{$_} // 0 for @sitesprint +ed; print "\n"; } __DATA__ Gene Name Patient ID Patient Diagnosis Ammino Acid Mutation a +nd Sit Protein Length AAK1 19679 adenocarcinoma L661I 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19676 adenocarcinoma L664T 21265 AAK1 19677 adenocarcinoma L64F 21265 AAK1 19678 adenocarcinoma L64R 21265 FKT1 101063 ER-PR-sitive_carcinoma p.L52R 2773 FKT1 103872 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 107590 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 107600 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 1135911 NS E17K 2773 TET3 152 chronic_lymocytic_leukaemia p.R401H 10982 TET3 587220 adenocarcinoma M935V 10982 TET3 587220 adenocarcinoma R1534Q 10982 TET3 587256 adenocarcinoma G1356R 10982 TET3 587338 adenocarcinoma G1356W 10982

        As per your second table, I am optimistic that it is a relatively simple modification only.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1046352]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (9)
As of 2014-07-30 06:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (229 votes), past polls