http://www.perlmonks.org?node_id=1046401


in reply to Memory issue with large cancer gene data structure

Not sure if I fully understand but I'd like to venture a guess what you want:

use strict; use warnings; my %site_length_catch; my $max = 0; foreach (@file) { chomp; my @r = split /\t/; # cleaning from your second loop $r[13] =~ s/\D\.\D([0-9]+)\D/$1/; $r[13] =~ s/(\*|\?|s\d+)//; $site_length_catch{$r[0]}{$r[13]}++; $max = $r[13]>$max?$r[13]:$max; } foreach my $gene (keys %site_length_catch) { print $site_length_catch{$gene}{$_} // 0, "\t" for 1..$max; print "\n"; }

The hash %site_length_catch is a sparse matrix containing the name of the gene as the first dimension and the site of mutations as the second dimension. Each cell in the matrix contains the number of mutations at that site for that gene.

When printing the empty spaces are filled with zeros (this is what  // 0 does). I have added the regexes from your second loop as they seem to be applied to the "Mutation site". Just remove them if I have guessed wrongly.

Feedback would be appreciated, along with a few lines of your input, if possible.