Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Out of Memory when generating large matrix

by sundialsvc4 (Abbot)
on Mar 05, 2018 at 20:08 UTC ( #1210376=note: print w/replies, xml ) Need Help??


in reply to Out of Memory when generating large matrix

Yet another way to approach this sort of problem is to “do in the way that they did it when they were using punched cards.”   (For instance, how did Herman Hollerith help the US Census, all those years ago?)   The answer in this case would be to first concatenate the contents of all 197 files into a single file, then to sort that file.   (All of which you can probably do, so far, with no program-writing at all.)  

I believe that you said that there will be about 197 * 8000 records in that file.   Well, after you have sorted it, all occurrences of any given 21mer will now be consecutive, and you can tally them all up with a trivial program that reads the file line-by-line and notices each time the 21mer-value changes.

Yet another approach, though, is far easier:   place all of the data into a database table, then simply use an SQL query!   No programming is required to obtain the counts that you need, because the SQL engine will figure out what to do and simply do it.   (Spreadsheets can also readily take input from a query-result.)

SELECT MER, COUNT(*) FROM MERS GROUP BY MER
(SQLite is an excellent engine to use for such purposes because there is no server and its database is a simple file.)

Replies are listed 'Best First'.
Re^2: Out of Memory when generating large matrix
by LanX (Archbishop) on Mar 06, 2018 at 00:30 UTC
    The problem is a classic use cases for counting with hashes, which others demonstrated very well.

    Delegating it to sort - be it the shell command or the Perl built-in - is a waste of resources, which doesn't scale well.

    Counting is O(n) but sort at best O(n log(n)) plus you suggest doubling the needed disc space, involving costly disc access time.

    The time needed for a) merging and b) reading again and c) writing the sorted file can easily take more time than just reading once and incrementing a hash.

    Additionally you are not saving much code, because the OP wants to chose the files dynamically.

    If this was a one-shot problem your approach could be OK'ish, but I doubt that's the case in Bio Inf.

    Last but not least

    If it's possible to use sort, why not just using wc (word count) ?

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

      Delegating it to sort - be it the shell command or the Perl built-in - is a waste of resources, which doesn't scale well.

      Except of course once the hash has become too large for memory. Then a disk-based mergesort (well, mergecount maybe) is the only thing that will still scale, provided that you have enough disk space to write the merged results.

        Sure, but the OP is talking about counting thru 197x8000 records, that's at max 1.5 million hash entries when every entry is just counted once, IIRC that'll result in 150 MB RAM (at max, that's a pathological case)

        IF the RAM wasn't even sufficient for counting, presorting a giant file wouldn't help. (wc could help but an output with 1.5 million colons should be avoided...)

        I'd surely opt for a DB like SQLite.

        But I suppose only some thousands of the most frequent K-mers are of interest.

        (I can imagine a solution with Hash Of Hashes for counting. Only the most relevant hashes are kept in memory while the others are swapped out, but I this would extent the scope of this thread.)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Wikisyntax for the Monastery

      Counting is O(n) but sort at best O(n log(n))

      Herr LanX! I believe you to be mistaken! I have it on the good authority of a 30-year veteran for whom the impossible is easy—and the legendary is been there done that—that sorting is, in point of factual experiences in those trenches of yore, actually and exactly more-or-less O(logn). I have no idea who you think you are to disagree with the wisdom of the ages! Harrumph!! Bully!!!1!

      The OP's problem is counting unique elements. Sorting is implicit in that problem, whether by hashing (sort into buckets according to arbitrary hash function), or straight up sort.

      Generic sort is O(n log n), but there is also Counting Sort, Radix Sort, and other forms of distribution sorting that are of high utility in real-world applications. It is important to understand that hashing is not algorithmically superior to sorting, indeed it is a specific form of sorting in disguise.

      For another example of a similar problem, and how sort can save the day, see Perl Hashes in C?.

        > but there is also Counting Sort, Radix Sort, and other forms of distribution sorting that are of high utility in real-world applications.

        These algorithms are all limited to elements of fixed length (which is the case), but Sundial suggested unix sort which knows nothing about the nature of the input.

        And did you try to look into the space complexity of your suggestions?

        Counting Sort requires 4**21 buckets. Seriously ???

        Radix Sort is still the best in this respect, but has O(l*n) with l length of alphabet (here 21). Even worse identical elements are in worst case only identified in the last step, requiring to be able to hold all elements in memory, while shifting them all l times from bucket to bucket.

        A hash count can go thru different files without keeping them all in memory, because identical elements collapse into one count.

        Last but not least, a hash count is much easier implemented.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Wikisyntax for the Monastery

        > Generic sort is O(n log n), but there is

        The best generic sorts are O(n log n) (worst case matters), SD was seemingly talking about the command line utility sort , which is generic .

        > It is important to understand that hashing is not algorithmically superior to sorting, indeed it is a specific form of sorting in disguise.

        I disagree because as I already said counting by hashing is one pass and is loosing any order information.

        Sort is about ordering and I don't see a way to make this in one pass.

        But I'd be interested to see your evidence about how the information loss from hashing can be compensated ...

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Wikisyntax for the Monastery

        It is important to understand that hashing is not algorithmically superior to sorting, indeed it is a specific form of sorting in disguise.

        What utter baloney! Come out from your cloak and let's argue?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
        > counting unique elements

        What is that supposed to mean?

        The elements are by definition unique IFF their count is one!

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Wikisyntax for the Monastery

Re^2: Out of Memory when generating large matrix
by Anonymous Monk on Mar 05, 2018 at 20:22 UTC
    first concatenate the contents of all 197 files into a single file, then to sort that file

    In other words, you're unfamiliar with merge sort.

    Hardly surprising.

      Actually, this is a perfectly-boring statistical analysis requirement. The "R" programming language would consider it to be nothing more than an aperitif.

        Sounds neat; and trivial. Let's see the code!

      Are you implying that merge sort can somehow work better when fed with separate (unsorted) inputs instead of one (unsorted) input?

      Now I wonder, who's the one unfamiliar with merge sort?

        > instead of one (unsorted) input?

        The merging doesn't happen at zero cost and you need to calculate the total.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1210376]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2019-07-16 05:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?