Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Out of Memory when generating large matrix

by cathyyihao (Novice)
on Mar 05, 2018 at 02:41 UTC ( #1210332=perlquestion: print w/replies, xml ) Need Help??
cathyyihao has asked for the wisdom of the Perl Monks concerning the following question:

So I'm working with a kmer project and I needed to generate a 4**21 by 197 matrix to record the information I need for each kmer since I'm doing 21mer right now. However, it shows out of memory when I try to generate the matrix. Is there any way I can get around with it?

Here is the script I wrote:

#!/usr/bin/perl use strict; use warnings; #initialize matrix my @matrix; my $i; #column my $j; #roll my @sum; for ($i=0; $i<197; $i++){ for ($j=0; $j<4**21; $j++){ $matrix[$i][$j]=0; if($i==0) { $sum[$j]=0; } } } #inport file names my $fn = $ARGV[0]; chomp $fn; unless (open(FILE, $fn)){ print "Can't open file \"$fn\"\n\n"; exit; } chomp(my @line = <FILE>); my $n; for ($n=0; $n<scalar @line; $n++){ #open each kmer file unless (open(KMER, $line[$n])){ print "Can't open file \"$line[$n]\"\n\n"; exit; } my @kmers = <KMER>; my @all_kmer; my $kmer; foreach $kmer (@kmers){ my @split = split(/\s+/,$kmer); #obtain kmer reference + number $matrix[$n][$split[0]] = 1; #print OUT "$matrix[$n][$split[0]]\t"; $sum[$split[0]]=$sum[$split[0]]+1; } close $line[$n]; } close $fn; #creat outfile my $outfile = $ARGV[1]; unless(open OUT, '>' .$outfile){ die "\nUnable to create $outfile\n"; } #sum up for ($j=0; $j<4**21; $j++){ print OUT "$sum[$j]\n"; }

Replies are listed 'Best First'.
Re: Out of Memory when generating large matrix
by BrowserUk (Pope) on Mar 05, 2018 at 04:56 UTC

    4**21 = 4,398,046,511,104. If each of those array elements took just 1 byte that would need 4000GB (or 4TB) of ram.

    But each element of a Perl array requires a minimum of 32bytes on a 64-bit system, so that increases the programs memory requirement to at least 32,000GB.

    And that's before you multiply by 197!

    Even today, very few computers can access Terabytes of memory -- the latest greatest SkyLake will only address 128GB.

    The bottom line is that you are going to have to tackle your problem a different way.

    If you can describe what you are trying to do -- in computer terms, with a minimum of bio-lingo -- there is probably an approach that will allow you to achieve it without needing £432,000 worth of memory and a processor that can address it; if such an animal currently exists, it would be in the $millions price range.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

      Hi! Thank you so much for your reply! So the goal I try to achieve here is that as for the 21mer I'm working on, it is a sequence of 21 letters, each position can have 4 variable "AGCT". Therefore, in total, it should have 4**21 combination. I have total 197 kmer files, each of them contains around 8000 of these 21mers, and I'm trying to count how many time each 21mers has shown among these 197 files. That's why I need a matrix to do so.

        Therefore, in total, it should have 4**21 combination.

        Yes. But not all of them -- in fact a very small percentage -- will appear; so rather than reserving space for trillions of counts that will 99% never be used, you can choose to only count those that do appear. That will save huge amounts of space and speed up your processing.

        With 197 files * 8000 lines, there are at most 1.5 million 21kmers -- if they are all unique -- so there in no need to reserve space for all 4.3 trillion possibilities. 197*8000 / 4.3e12 * 100 = 0.0000358%

        I'm trying to count how many time each 21mers has shown among these 197 files.That's why I need a matrix to do so.

        Fair enough, but you don't need to count them all in one huge matrix. You can process the 197 files one by one and only store the nonzero counts across files.

        Could you post a small sample (say 20 lines) of what is in your kmer files?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

        197 * 8000 = 1576000. If you only have 1.5e6 data points why are you trying to match them against 4**21 possible combinations when all bar an infinitessimal fraction of those won't be present?

        Just compare your 1.5e6 against themselves and mark everything else as zero. No matrices required.

        For fun, see The 10**21 Problem (Part I) and the subsequent parts.

        (Edit: BrowserUk managed to construct a fuller and therefore better response in the time it took me to deduce and type this. Still, great minds and all that ;-)

Re: Out of Memory when generating large matrix
by alexander_lunev (Scribe) on Mar 05, 2018 at 07:07 UTC

    Hello!

    It's hard to see whole picture without sample of data, but i will try.

    As far as i can see, you have files with data something like

    1 the only 2 significant 3 part 4 is 5 a 6 number 7 before 8 the 9 first 10 space 11 character

    Then you pick the leading number in every string and use it as an index of array @sum to sum the number of occurrences of this number. I don't see why you even need the @matrix array at this point, maybe you will use it later. Will you really need 4**21 size of an array? Will there be 4**21 different numbers in those files? And if not, if your memory actually can contain all numbers in your files (but not the range of these numbers), then maybe you should use a hash instead of an array? Something like this:

    #!/usr/bin/perl use strict; use warnings; my %matrix; my %sum; #inport file names my $fn = $ARGV[0]; chomp $fn; open (my $FILE, $fn) or die "Can't open file \"$fn\"\n\n"; chomp(my @line = <$FILE>); my $n; for ($n=0; $n<scalar @line; $n++){ #open each kmer file open(my $KMER, $line[$n]) or die "Can't open file \"$line[$n]\"\n\ +n"; my @kmers = <$KMER>; my @all_kmer; my $kmer; foreach $kmer (@kmers){ my @split = split(/\s+/,$kmer); #obtain kmer reference number # for now we don't have to populate the big %matrix, commented + # $matrix{$n}{split[0]} = 1; $sum{$split[0]}++; } close $KMER; } close $FILE; #creat outfile my $outfile = $ARGV[1]; open (my $OUT, '>' .$outfile) or die "\nUnable to create $outfile\n"; #sum up foreach my $k (keys %sum){ print $OUT "$sum{$k}\n"; }

    Still, it's hard to tell will it suffice without the data. And if not, and count of numbers will not fit in your RAM, then you will have to use the interim DB, like MySQL or similar. I will use MySQL queries. First, create DB and table in it:

    CREATE DATABASE kmers; CREATE TABLE matrix (file int, number int, key (file), key (number));

    Then you will have to populate matrix table with SQL query like this:

    use DBI; my $dbh = DBI->new('DBI:mysql:database=kmers;host=127.0.0.1','mysql_us +er',''); my $query = "INSERT INTO matrix VALUES (?,?)"; my $sth = $dbh->prepare($query); for ($n=0; $n<scalar @line; $n++){ #open each kmer file open(my $KMER, $line[$n]) or die "Can't open file \"$line[$n]\"\n\ +n"; my @kmers = <$KMER>; my @all_kmer; my $kmer; foreach $kmer (@kmers){ my @split = split(/\s+/,$kmer); #obtain kmer reference number # for now we don't have to populate the big %matrix, commented + # $matrix{$n}{split[0]} = 1; # $sum{$split[0]}++; $sth->execute($n,$split[0]); } close $KMER; } $sth->finish();

    And after matrix table populated you can query sum from it

    my $count_query = "SELECT number,COUNT(number) FROM matrix GROUP BY nu +mber ORDER BY number"; $sth = $dbh->prepare($count_query);

    But still, even with interim DB, if the count of numbers is greater than your memory, you will have to fetch and print result to file by one row at a time:

    #sum up #foreach my $k (keys %sum){ # print $OUT "$sum{$k}\n"; #} $sth->execute(); while (my $res = $sth->fetchrow_arrayref()) { print $OUT $res->[1]\n"; }

      Hi! the hash actually resolves my problem!! I was thinking too complicated but what I actually needed is just the reference number and how many time it has appeared!! Thank you so much!

Re: Out of Memory when generating large matrix
by AnomalousMonk (Chancellor) on Mar 05, 2018 at 04:10 UTC
Re: Out of Memory when generating large matrix
by thomas895 (Chaplain) on Mar 06, 2018 at 05:38 UTC

    If your matrix contains many zeros, you may wish to consider a Sparse Matrix.

    -Thomas
    "Excuse me for butting in, but I'm interrupt-driven..."
Re: Out of Memory when generating large matrix
by Anonymous Monk on Mar 06, 2018 at 03:07 UTC
    Shamelessly hiding-out behind Anonymous Monk, now, just like everybody else does, you-know-who will simply take this opportunity to remind the entire lot of you that a technical requirement such as this one actually requires no(!) custom programming whatsoever. Any sort of statistics software ... or perhaps even a stats-plugin for whatever might be your favorite spreadsheet ... will scarcely blink at this requirement or this volume of data.

      That's exciting. Since it's so easy, please whip up a working example like the rest of us regularly do.

      –AnonymousMonkey

        I'm not the Anon above, or you-know-who, but I'll gladly oblige!

        Tally up the counts of tokens that appear on field #2.

        $ cat datafiles/* | sort -k2 | uniq -c -f1

        Can we please see your code now? Or maybe yet another diatribe on how the approach is simply wrong, does not scale, requires expensive disk i/o, is inelegant, hazardous, broken, politically incorrect or in some other way unsuitable.

Re: Out of Memory when generating large matrix
by sundialsvc4 (Abbot) on Mar 05, 2018 at 20:08 UTC

    Yet another way to approach this sort of problem is to “do in the way that they did it when they were using punched cards.”   (For instance, how did Herman Hollerith help the US Census, all those years ago?)   The answer in this case would be to first concatenate the contents of all 197 files into a single file, then to sort that file.   (All of which you can probably do, so far, with no program-writing at all.)  

    I believe that you said that there will be about 197 * 8000 records in that file.   Well, after you have sorted it, all occurrences of any given 21mer will now be consecutive, and you can tally them all up with a trivial program that reads the file line-by-line and notices each time the 21mer-value changes.

    Yet another approach, though, is far easier:   place all of the data into a database table, then simply use an SQL query!   No programming is required to obtain the counts that you need, because the SQL engine will figure out what to do and simply do it.   (Spreadsheets can also readily take input from a query-result.)

    SELECT MER, COUNT(*) FROM MERS GROUP BY MER
    (SQLite is an excellent engine to use for such purposes because there is no server and its database is a simple file.)

      The problem is a classic use cases for counting with hashes, which others demonstrated very well.

      Delegating it to sort - be it the shell command or the Perl built-in - is a waste of resources, which doesn't scale well.

      Counting is O(n) but sort at best O(n log(n)) plus you suggest doubling the needed disc space, involving costly disc access time.

      The time needed for a) merging and b) reading again and c) writing the sorted file can easily take more time than just reading once and incrementing a hash.

      Additionally you are not saving much code, because the OP wants to chose the files dynamically.

      If this was a one-shot problem your approach could be OK'ish, but I doubt that's the case in Bio Inf.

      Last but not least

      If it's possible to use sort, why not just using wc (word count) ?

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Wikisyntax for the Monastery

        Delegating it to sort - be it the shell command or the Perl built-in - is a waste of resources, which doesn't scale well.

        Except of course once the hash has become too large for memory. Then a disk-based mergesort (well, mergecount maybe) is the only thing that will still scale, provided that you have enough disk space to write the merged results.

        Counting is O(n) but sort at best O(n log(n))

        Herr LanX! I believe you to be mistaken! I have it on the good authority of a 30-year veteran for whom the impossible is easy—and the legendary is been there done that—that sorting is, in point of factual experiences in those trenches of yore, actually and exactly more-or-less O(logn). I have no idea who you think you are to disagree with the wisdom of the ages! Harrumph!! Bully!!!1!

        The OP's problem is counting unique elements. Sorting is implicit in that problem, whether by hashing (sort into buckets according to arbitrary hash function), or straight up sort.

        Generic sort is O(n log n), but there is also Counting Sort, Radix Sort, and other forms of distribution sorting that are of high utility in real-world applications. It is important to understand that hashing is not algorithmically superior to sorting, indeed it is a specific form of sorting in disguise.

        For another example of a similar problem, and how sort can save the day, see Perl Hashes in C?.

      first concatenate the contents of all 197 files into a single file, then to sort that file

      In other words, you're unfamiliar with merge sort.

      Hardly surprising.

        Actually, this is a perfectly-boring statistical analysis requirement. The "R" programming language would consider it to be nothing more than an aperitif.

        Are you implying that merge sort can somehow work better when fed with separate (unsorted) inputs instead of one (unsorted) input?

        Now I wonder, who's the one unfamiliar with merge sort?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1210332]
Approved by Athanasius
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2018-07-22 18:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (455 votes). Check out past polls.

    Notices?