Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

When you are dealing with “huge” amounts of data, everything depends upon ... memory.   Do you have it, or do you not.   (And if it’s “virtual,” you don’t have it.)

Many computers these days have truly vast amounts of RAM and there are a roomful of computers thusly equipped.   Under these circumstances, the chances are quite good that a tremendous structure can be built in RAM and that all of those pages will be (and will remain) present.   If that is known to be the case, then “in-memory” solutions work just fine, and yes they do behave very nicely as BrowserUK points out (with his characteristic I-love big-fonts flair ...).

What becomes truly insidious about “in-memory” solutions, especially those based upon random-access data structures such as hashes, is when virtual-memory is constrained such that the entire block of data cannot fit into available physical RAM without incurring page faults.   A hash data-structure does not exhibit any locality of reference; quite the opposite.   Any reference to that structure could (worst-case) incur a page-fault, which suddenly transforms the entire algorithm from what you think is a fast, virtually I/O-free operation, into one that hammers your paging-device to death and brings the entire system to a screeching halt along with it.

If you plot the performance curve of a virtual-storage system as the stress which is placed upon it increases, you will observe a line that basically increases in a nice, more-or-less linear fashion u-n-t-i-l it “hits the wall,” the so-called thrash point.   At this instant, the performance curve suddenly becomes exponential.   And that, as I’ve said before (from Ghostbusters), is “real wrath-of-God stuff.”

BrowserUK is therefore entirely correct as long as you are well away from the thrash-point.   (And today, you might well be able to “throw cheap silicon at it” and thereby avoid the thrash-point entirely.   There is a reason why we have 64-bit systems now; soon to be 128.   Chips are cheap.)   But the punishment that can be inflicted, when and if it happens, is severe because it is exponential.

In passing ... it is quite interesting that sorting a multi-million record file should take “ten minutes,” which is quite inexcusable.   There are interesting-looking articles here and also here.   Also specifically to our point, A Fresh Look at Efficient Perl Sorting, although it does not concern disk-sorts.

A similar situation can happen with regard to accessing indexed files.   Once again we are dealing with a random-access data structure which may require some n physical I/O operations to retrieve the data, and which rewards locality-of-reference by virtue of cacheing recently-used index pages in RAM while discarding others.   Once again we have the “thrashing” phenomenon, albeit of a different kind and source.   Plentiful memory tends to mask the problem once again.   (Operating systems will dedicate leftover memory to file-buffering when there is no other competition for the space.)

When and if you hit a thrash-point problem, you will know.   The difference can be a matter of many hours, or the difference between a job that finishes and one that does not.   “Ten minutes” (or more...) becomes an acceptable price to pay when for example you are talking about a massive runs-through-the-night production batch job.   And those, really, are the kind of situations I am talking about.   Not the size of problem that can be effectively dealt-with by buying more chips.   Obviously, “if you’ve got the RAM, flaunt it.”


In reply to Re: "Just use a hash": An overworked mantra? by sundialsvc4
in thread "Just use a hash": An overworked mantra? by davido

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others having an uproarious good time at the Monastery: (5)
    As of 2014-07-23 04:44 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (133 votes), past polls