### Memory issue with cancer data (analogy)

 on Jul 25, 2013 at 14:56 UTC Need Help??
ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

Hello all thanks for your help! I am reposting a previous problem to simplify through analogy. I have a very complicated biologic data set. I am having memory issues, what is the best way to solve the problem below:

Attempt at a non biology analogy: ... you're in a warehouse with X closets.

Each closet has H tie hangers (H can be different for each closet) .

Each tie hanger has M rungs or hooks (M can be different as well). Then by some process ties are placed on some of the hooks (there are only as many hooks as ties placed the only important number is how many ties per hanger).

I need to read in a file with the data of ties/mutations per hanger/amino_acid_position per closet/gene, and then fill an AoA which is rows X and columns H1-Hn_sub_x. In each cell I need M ties/mutations (([x][N_sub_x]=M) ) . This structure will allow me perform the right statistical test.

Replies are listed 'Best First'.
Re: Memory issue with cancer data (analogy)
by BrowserUk (Pope) on Jul 25, 2013 at 16:51 UTC

You are constructing a 3-dimensional array, and you are running out of memory. Assuming \$data[ 0..X ][ 0..Y ][ 0..Z ];

What we need from you is:

1. the maximum size of those three dimensions X, Y, Z?
2. Are the dimensions contiguous or sparse?

If sparse, the approximate density?

If one or more of X,Y & Z can run say 3000 .. 4000; or if instead of using every number between 0 ..m; you only use every 10th or 100th; then you can save substantial space by using a hash instead of an array for that dimension of the structure.

3. Do you need to build the entire dataset before you can calculate your statistics>

Could you build (say) all of \$data[1][Y][Z] for X=1; calculate the stats; and then discard that before building all \$data[2][Y][Z] for X=2?

4. What are you storing in each element of that 3d array?

Is it just a number? If so, how big will that number get?

If, for example, each element of the array held an integer < 255, the you can easily substitute a string for the 3 level arrays and save huge amounts of memory.

Eg. This constructs a 100x100x100 3d array of small integers which requires 33MB of memory.

```@data = map[ map[ map int( rand 256), 0..99 ],0..99 ], 0..99;;
print total_size \@data;;
33454784

This on the other hand construct 100x100x100 2D array of strings. It contains the exact same information, but it only requires 1.6MB:

```@data = map[ map pack( 'C*', map int( rand 256), 0..99 ), 0..99 ], 0..
+99;;

print total_size \@data;;
1614784

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Sorry I know my description sucks I'll try to answer your questions by providing some of the actual data and the output which I made manually. There will never be set dimensions as ever dataset will change depending on the patients tumors coming in. The idea is to have a script that can handle any size dimensions.

The file I need to read in is this structure. These 19 lines are taken from the real file which has 5,772,080 lines in the same format :

```Gene Name    Patient ID    Patient Diagnosis    Ammino Acid Mutation a
+nd Sit    Protein Length
FKT1    101063    ER-PR-sitive_carcinoma    p.L52R    2773
FKT1    103872    ER-PR-sitive_carcinoma    p.E17K    2773
FKT1    107590    ER-PR-sitive_carcinoma    p.E17K    2773
FKT1    107600    ER-PR-sitive_carcinoma    p.E17K    2773
FKT1    1135911    NS    E17K    2773
TET3    152    chronic_lymocytic_leukaemia    p.R401H    10982
Now I need to count all positions that match in Amino Acid Site (the number but not the letters of the 4th column) but are in different samples. Note : Patient ID19679 and AA mutation L664T only corresponds to a count of 2 because all of them are in the same patient except one in patient 19676.

The out put needs to be in this format, where you have rows as genes and columns are 1-Length(the fifth column above). L is different for every gene. I've listed spans as no1.....no2 just for sake of space, but in the real file all these numbers in between have to be filled with 0's:

```1-Largest Gene Length    AA site -1    AA site -2    AA site -3    4……
+……16    AA site -17    18…..51etc    AA site 52    AA site 64    65…4
+00    AA site 401    402….660    AA site 661    AA site 664    AA sit
+e 935    AA site 1356    AA site 1534
AAK1    0    0    0    0    0    0    0    2        0        1    2
+ 0    0    0
FKT1    0    0    0    0    4    0    1    0        0        0    0
+ 0    0    0
TET3    0    0    0    0    0    0    0    0        1        0    0
+ 1    2    1

I'm simplifying because I also need to calculate a second table but this time with Amino acid position and mutation ( thus numbers and letters of Column 4) matching in different patients. Thats why my script is so elaborate, the \$key3=\$key4 is to remove the letters etc. I know i've done a poor job scripting it. Any advice would be fantastic!! Thanks so much for helping.

See how you get on with this:

```#! perl -sw
use strict;

my %table;
my %lengths;
while( <> ) {
my( \$gene, \$id, undef, \$site, \$len ) = split;
my( \$pos ) = \$site =~ m[(\d+)];       ## extract the digits from t
+he site
undef \$table{ \$gene }{ \$pos }{ \$id }; ## adds the id as a key with
+ no value (saves space!)
\$lengths{ \$gene } = \$len;             ## Save the gene lengths for
+ later
}

#print 'output header line here if required';
for my \$gene ( sort keys %table ) {
print "\$gene";
my \$p = 1;
for my \$pos ( sort{ \$a <=> \$b } keys %{ \$table{ \$gene } } ) {
print "\t0" x ( \$pos - \$p ), "\t", scalar keys %{ \$table{ \$gen
+e }{ \$pos } };
\$p = \$pos + 1;
}
print "\t0" x ( \$lengths{ \$gene } - \$p ), "\n";
}

Invoke it as thisScript.pl < theInputFile > theOutputFile. It shouldn't take more than a minute or two to run.

It'll probably need tweaking. Like adding an appropriate header line if that is a requirement. I couldn't work out what would be needed as all the output lines will be different lengths, as the genes are different lengths.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Memory issue with cancer data (analogy)
by hdb (Monsignor) on Jul 25, 2013 at 15:09 UTC

So you want to translate a dataset

```X  H
1  3
3  1
3  4
2  8
1  1
3  1
etc...

into

```X\H 1  2  3  4  5  6  7  8 etc...
1   1  0  1  0  0  0  0  0
2   0  0  0  0  0  0  0  1
3   2  0  0  1  0  0  0  0
etc...

? Just like an Excel pivot table?

I think it depends on whether your hooks and hangers and ties etc. have 'names', or whether you could reliably track the data by index alone. Could you post a small sample of the real data you're trying to manipulate?

EDIT: Sorry hdb, was aiming for OP.

Only the closets need names - these are the genes, the next to levels are the amino acid positions in each gene of length L (hangers), and the number of mutations at each amino acid position (number of ties on hanger ). I think for the last two index would be sufficient?

Create A New User
Node Status?
node history
Node Type: perlquestion [id://1046362]
Approved by NetWallah
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?