Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
Welcome to the Monastery
 
PerlMonks  

Re^2: sorting type question- space problems

by BrowserUk (Pope)
on Sep 14, 2013 at 14:27 UTC ( #1054115=note: print w/ replies, xml ) Need Help??


in reply to Re: sorting type question- space problems
in thread sorting type question- space problems

That will never work. It will run the machine out of memory.

It will attempt to build 4 huge lists:

  1. a list of integers, one for every record in the file.

    Input to the first map.

  2. a list of anonymous arrays -- each containing two elements -- one for every record in the file.

    Output from the map, input to sort.

  3. Same again, but reordered.

    Output from sort, input to second map.

  4. An ordered list of all the records.

    Output from second map, input to for.

If the OPs records average 64 bytes, that would require:

  • 4 * 20GB = 80GB
  • + 6 * 335544320 * 24 bytes (for the scalars) = 45GB
  • + 335544320 * 232 bytes (for the anonymous arrays ) = 72.5GB
  • + lots of other bits and pieces of overhead within Tie::File and Perl's internals stacks = ??

For a grand total memory requirement of at least 197.5 Gigabytes


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^2: sorting type question- space problems
Re^3: sorting type question- space problems
by kcott (Abbot) on Sep 15, 2013 at 03:36 UTC

    Thankyou for this analysis. I have some points to raise and some questions. At the outset, I'd like to say that this isn't intended as a nitpicking exercise (although it may look like that in parts); instead, I'm more interested in learning from it.

    "It will attempt to build 4 huge lists:"

    Possibly wishful thinking on my part, but I had imagined that as each list was used up, its data would no longer persist and the memory it used could be reclaimed.

    To clarify, the first map generates a list then passes the elements of that list onto sort which, in turn, generates another list which is passed to the second map, and so on. When, say, the second map is being processed, $_ is an alias into the sort list (so sort's data is still in use); however, at this point, the list generated by the first map is not being used by anything and its memory could, at least in theory, be reused without causing problems. Anyway, that was my thinking: I haven't seen any documentation indicating whether this is how it works - I could be entirely wrong.

    "1. a list of integers, one for every record in the file. Input to the first map."

    Quite right. I did think about this when coding it, but clearly misremembered whay I'd previously read. Going back to "perlop: Range Operators", I see "... no temporary array is created when the range operator is used as the expression in foreach loops, ...": I had thought that was more of a general rule for generating lists with the ".." operator (clearly, it isn't). So in retrospect, "for (0 .. $#input_records) {...}", instead of "... 0 .. $#input_records;", would have removed this overhead and been a better choice.

    "2. & 3. a list of anonymous arrays -- each containing two elements ... (map1 to sort and sort to map2)"

    Actually, that's three elements (numbers): integer, string, string. Looking back at it, I can see that changing "substr $_, 3" to "0 + substr $_, 3" would have produced three integers which would've saved some memory.

    "4. An ordered list of all the records. Output from second map, input to for."

    That's not a list of the records, it's a list of integers (i.e. (0 .. $#input_records) after being sorted). Possibly what you meant, but I thought I'd just clear that up in case it wasn't.

    "If the OPs records average 64 bytes, that would require:"

    The OP has indicated "there are abbout 5 mil records (lines) which sum up to 20GB in total." [sic]. I appreciate that was possibly written while you were composing your reply. Anyway, that gives an average closer to 4,295 bytes (20 * 1024**3 / 5e6); and your calculated ~335 million records (instead of the given ~5 million records) will mean the results will be out by a couple of orders of magnitude (where that figure is used).

    In the dot points that followed (with the various calculations), I have some questions regarding where some of the numbers came from. Where appropriate, if you could point me to documentation (e.g. some section in perlguts), that would be preferable as I'd be interested in reading about general cases rather than just specifics relating to this question: that might also be a lot less typing for you.

    • "4 * 20GB = 80GB": What does the 4 refer to? (I've assumed 20GB is the file size.)
    • "+ 6 * 335544320 * 24 bytes (for the scalars) = 45GB": I've guessed the 6 is from the earlier numbered points (1+2+2+1); if so, that should be 8 (1+3+3+1). 335,544,320 is the number of records which, as noted above, should be 5,000,000. I thought 24 might be the size of a scalar, but some of those are integers (I thought they would've been 8), so I'm not sure on this (and it would be one I'd be interested in documentation for).
    • "+ 335544320 * 232 bytes (for the anonymous arrays ) = 72.5GB": 335544320 (number of records, already mentioned). 232 has me stumped (possibly another doco link here).
    • "+ lots of other bits and pieces ...": Agreed. Unknown.

    -- Ken

      (I'd be interested in documentation for ... possibly another doco link here)

      See Perlguts Illustrated. In the light of the OPs later information regarding the number and size of his records; and with careful consideration of the build version of Perl he is using, a more accurate estimate of the total memory requirement can be calculated.

      But the point is simply that even using rough estimates, it is easy to see that unless the OP (or you) have access to some particularly high end hardware; regardless of whether the memory overuse is 5x or 50x more than the typical 4GB to (say) 24GB available on most machines; it ain't gonna fly.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Thanks for the link.

        -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1054115]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2014-04-19 23:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls