http://www.perlmonks.org?node_id=1210450


in reply to Re^4: Out of Memory when generating large matrix (space complexity)
in thread Out of Memory when generating large matrix

Well, I glanced "(space complexity)" in the title and thought there is a glimmer of hope for you, yet.

You have identified the problem area, but again deftly avoid seeing the light. Sorting (or deduplicating) is a problem with O(n log n) time complexity. If you have a hash function that successfully distributes the keys, you can cut down the problem and move some of the complexity into space domain.

Hashes are O(n) both in time and space complexity (list insertion). Streaming merge is O(1) in space complexity. Partial hashing is possible. Using a hash table of size k, you can modify the algorithm to achieve O(n log(n/k)) in time complexity, and O(k) in space complexity. The k scales well until you break out of the CPU caches, after which it scales rather poorly. I referenced another thread where someone run into a brick wall trying to hash just 36M elements. Sort|uniq proved to be greatly superior in that case.

So far, you have

By the way, I never argued that a hash count was unsuitable. By all means, ++$count{$key} if that works. But you chose to attack a broken clock, and forgot that a broken clock, too, is right two times a day.