Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^5: Out of Memory when generating large matrix

by Anonymous Monk
on Mar 06, 2018 at 16:48 UTC ( #1210416=note: print w/replies, xml ) Need Help??


in reply to Re^4: Out of Memory when generating large matrix
in thread Out of Memory when generating large matrix

Hashing in a nutshell: apply hash function f() to the keys, bucket the data records accordingly. Where a radix sort would use part of the key directly (like a hash function that just masks bits), hashing picks a more complicated function. So there's a tradeoff. Your data is no longer sorted by the key, but by f(key). On the other hand, you get a flat distribution that makes the bucketing work.

Can you truly not see the similarity between distribution sort and hashing?

  • Comment on Re^5: Out of Memory when generating large matrix

Replies are listed 'Best First'.
Re^6: Out of Memory when generating large matrix
by BrowserUk (Pope) on Mar 06, 2018 at 18:06 UTC

    Once you move outside of academia and thesis, it isn't the algorithm, but the implementation that is important. A mergesort programmed badly can be much slower than a bubble sort done well.

    And once you recognise that in the real world, implementation is king, any kind of disk based sort is glacial compared to a memory-based hash.

    It isn't the similarities, but the differences that are important.

    A stately home and a plane both have wings, windows and seats, but the differences outweigh those similarities for most practical considerations.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      I liked this post from BrowserUK and up-voted it.

      Implementation is indeed "king"!
      One problem with theoritcal "O-n" notation is "how expensive is an O?"

      I remember one of my first programming assignments on 1960's hardware.
      We were using wire-wrap technolgy for H/W prototypes. The basic software task was to sort thousands of punch cards and produce an output.

      We had a port of our mainframe code that would run on our lab machine.
      But it took 6 hours to run!
      It used the minimum number of compares between card images, but it was very,very slow.

      Using a bi-directional indexed bubble sort and a fancy merge, I was able to reduce the time from 6 hours to 5 seconds!

      That doesn't seem possible, but it was possible.
      These ancient machines with 24K words of memory were slow. My coffee pot probably has a faster processor albiet with not as much memory?!

      I understood the problem very well.
      My code had no O/S or file system.
      Essentially, I wrote it on the "bare metal".
      Yes, this was a "one trick pony", but it could do its trick very, very well.
      I could calculate partial results as the punch cards were read in, while still allowing the card reader to run at full speed.
      On the output, I could calcuate results fast enough so that the ancient shuttle line printer ran at a maximum rate.
      The 5 second number is the "dead time" when no I/O is happening at the max rate.

        I have a very similar, before-the-dawn-of-time story -- that I'm sure I've mentioned here before and probably in response to a previous sundial "system sort" solution.

        (From long ago memory, so the details my be fuzzy.) 60 million records sorted on 7(or 9) keys taking 2 weeks on twin PDP-11/60s.

        Reverse the order of the keys reduced the total time to (I think) less than a day.

        The reason: the way the records were stored, the original key order meant doing a seek for every next record, and for almost every sub sort.

        Reversing the keys meant the first pass read the records sequentially. Having grouped records by that key, subsequent subsorts tended to only reorder within a small group of records that tended to be close to each other; hence far less disk/memory cache misses.

        Another big timesaver that happened before the big final mergesort, was to arrange for temporary spill files to be written to "the other" diskpack, to whichever disk pack the file being processed was on. It applied to pretty much every process, and cut most of their run times in half.

        It hard to believe now that in my working lifetime it could have taken a month (before both changes) to sort 60million records. (That was "big data" back then :) )

        It's like something out of a Victorian novel where they describe it taking 3 days from London to Bath and 10 days to York.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

        Great, interesting post!

        My coffee pot probably has a faster processor albiet with not as much memory?!

        I took the screwdriver to our coffee pot, but the wife wasn't too pleased that I was going to rip it apart and compare the memory size to that of some of my microcontrollers. By "wasn't too pleased", I mean she grabbed a hammer and said if I proceed, she's heading up to my lab and going to start doing her own "testing" :D

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1210416]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2018-07-23 10:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (462 votes). Check out past polls.

    Notices?