Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^4: Out of Memory when generating large matrix (space complexity)

by LanX (Saint)
on Mar 07, 2018 at 01:31 UTC ( [id://1210440]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Out of Memory when generating large matrix
in thread Out of Memory when generating large matrix

> but there is also Counting Sort, Radix Sort, and other forms of distribution sorting that are of high utility in real-world applications.

These algorithms are all limited to elements of fixed length (which is the case), but Sundial suggested unix sort which knows nothing about the nature of the input.

And did you try to look into the space complexity of your suggestions?

Counting Sort requires 4**21 buckets. Seriously ???

Radix Sort is still the best in this respect, but has O(l*n) with l length of alphabet (here 21). Even worse identical elements are in worst case only identified in the last step, requiring to be able to hold all elements in memory, while shifting them all l times from bucket to bucket.

A hash count can go thru different files without keeping them all in memory, because identical elements collapse into one count.

Last but not least, a hash count is much easier implemented.

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^5: Out of Memory when generating large matrix (space complexity)
by Anonymous Monk on Mar 07, 2018 at 13:25 UTC

    Well, I glanced "(space complexity)" in the title and thought there is a glimmer of hope for you, yet.

    You have identified the problem area, but again deftly avoid seeing the light. Sorting (or deduplicating) is a problem with O(n log n) time complexity. If you have a hash function that successfully distributes the keys, you can cut down the problem and move some of the complexity into space domain.

    Hashes are O(n) both in time and space complexity (list insertion). Streaming merge is O(1) in space complexity. Partial hashing is possible. Using a hash table of size k, you can modify the algorithm to achieve O(n log(n/k)) in time complexity, and O(k) in space complexity. The k scales well until you break out of the CPU caches, after which it scales rather poorly. I referenced another thread where someone run into a brick wall trying to hash just 36M elements. Sort|uniq proved to be greatly superior in that case.

    So far, you have

    • veered the topic into discussion of algorithmic complexity, where no-one really asked for it.
    • misapplied the big O notation. Big O does not say whether one solution is faster than other. It tells you about how a problem scales.
    • made the incorrect statement that a hash based solution would scale better than merge sort. In practice, hashing does not scale beyond memory limits.
    • made some ludicrously inappropriate suggestion of using wc; this suggests you did not invest the cycles necessary to understand the problem, let alone offer a solution.
    • applied some technique (hashing) as a magic bullet, without the fundamental grasp of subject matter. This is Cargo Cult by definition.

    By the way, I never argued that a hash count was unsuitable. By all means, ++$count{$key} if that works. But you chose to attack a broken clock, and forgot that a broken clock, too, is right two times a day.

      > glimmer of hope for you,

      Wow, you are so humble and your posts so clear and understandable.

      No wonder you prefer to post anonymously.

      (closed :)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1210440]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-03-29 08:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found