Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^12: Strange memory leak using just threads (forks.pm)

by BrowserUk (Pope)
on Sep 22, 2010 at 10:05 UTC ( #861241=note: print w/replies, xml ) Need Help??


in reply to Re^11: Strange memory leak using just threads (forks.pm)
in thread Strange memory leak using just threads

I was talking about money actually - 256 cores system will be very expensive.

I'm not sure about the relevance of that?

When it is affordable--it's already available; the IBM Power 795 can have 256 cores and 4 hardware threads per core) giving a 1024 thread processor in a box--a threaded solution will port to it with the change of one number.

Two years or so ago, 4 cores were horribly expensive. I now have one sitting in front of me that cost about the same as a high end smartphone!

Next year will see the release of 16-core commodity processors, some with 2-way hyper threading. When my next "refresh" cycle comes around in 2012, I'll be looking for 16-cores for dirt cheap price. I'll probably have to wait until towards the end of the year, rather than the beginning. I'll also be looking to have 512/1024 core GPU in the same box for the same price.

I know POE can span clusters, but clusters don't make sense. Other than as a stop gap solution until a multi-core fitting your aspirations or price becomes available.

For why, there is a wonderful example of the problem cited in an article I read just today. The fourth para (starting "For example") is the crux of the cluster problem (albeit this example is gpus).

Just as scaling with processes tops out very quickly because of the costs of IPC; so clustering tops out even more quickly because of the even higher costs of inter-box network comms (INC). And as processors become more efficient at processing a given volume of data, so the ratio of non-productive IPC and INC to useful CPU work grows.

And the tests are not correct BTW. Time required for creating new thread depends on the size of all existing variables.

Yes. That is an annoying detail of the ithreads implementation. But, it is quite easy to avoid; you just spawn your workers early, and have them require rather than use what they (individually) need.

But forks face a similar problem--indeed, the copy problem exists in iThreads specifically because of the attempt to make threading look like fork. If you need to share data between forks, there is still a "duplication penalty", although it is disguised by coming at use-time, rather than spawn-time.

COW may appear to avoid the need to duplicate data memory, but it just means it gets duplicated piecemeal on use, rather than in one lump up front. Even if it is "read-only" in the sense of your program, it is often modified by simple "read-only" accesses.

Use $#array and a 4k page of COW'd memory get copied. Use a regex on a string that alters its pos, and another 4k page gets copied. Even if the string is only 4 characters. Use a single scalar, instantiated as a number, in a string context, and another 4k page gets copied. Use each, keys, or values on a hash, and (at least) another 4k gets copied.

These piecemeal requests for another single page of VM followed by a 4k copy through ring 0 add up to far more than an up-front single request for as many pages of VM as are required followed by a single ring 3 rep mov.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^13: Strange memory leak using just threads (forks.pm)
by tirwhan (Abbot) on Sep 22, 2010 at 13:25 UTC
    Use $#array and a 4k page of COW'd memory get copied...and (at least) another 4k gets copied

    This is only true if all these variables reside in different memory pages, otherwise just one page is copied (and afterwards, the copy is modified). As such, this example of events is misleading.

    These piecemeal requests for another single page of VM followed by a 4k copy through ring 0 add up to far more than an up-front single request for as many pages of VM as are required followed by a single ring 3 rep mov.

    Umm, when put as the general case, no. Moving stuff around in RAM is expensive. The CPU/MMU operations required to throw a page fault and allocate memory are insignificant compared to the time it takes to actually copy the data in memory. Which means that there is only a very slight difference (over the lifetime of a process) between copying the whole process address space in one move or doing so whenever a page is dirtied.

    Also, for most use cases, the process will not have to read/write access its entire memory throughout its lifetime. For the far more common usage scenario, when part of the memory remains untouched and therefore shared, copy-on-write is far more efficient than copy-all-at-once (even only talking about speed, this is obviously even more true for process memory consumption).


    All dogma is stupid.
      otherwise just one page is copied (and afterwards, the copy is modified). As such, this example of events is misleading.

      The problem with this view is it kind of implies that (say), iterating a small array will only entail copying one 4k block. The reality is that the AV is likely to be allocated in a different block to the xpvav; which is quite likely to be in a different block to the SV with # magic; which is in a different page to the SV*array containing the alloc; which is in a different place to the block containing all the SVs holding the scalars; which may each be in a different block to the xvpv, xviv or xvnv they point to; which in turn are quite likely to be in different pages to actual pv data if it has any; etc.

      And that was just an array. Take a close look at a hash; or better still a stash, and see all the separate allocations that go to making up the thing. And remember, perl's allocator tend to allocate like sized things from the same pools, so the individual elements of any single entity tend to be strewn around over several pages; not all together in one neat lump (page).

      So whilst it isn't (nor was I implying), a 1 for 1, 4k block copied for every internal change. Nor do changes to a single entity necessarily only involve the copying of a single block. push, shift, pop or unshift an element to a small array, and several blocks might need to be copied as a result.

      The CPU/MMU operations required to throw a page fault and allocate memory are insignificant compared to the time it takes to actually copy the data in memory.

      I'm sorry, but I disagree. A page fault is a hugely expensive operation, because of the ring3-ring0-ring3 transition. Copying (within the auspices of a single process) a 4k block or even a 64k block is less expensive.

      For an example of just how expensive page faults can be, see What could cause excessive page faults?.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^13: Strange memory leak using just threads (forks.pm)
by zwon (Abbot) on Sep 22, 2010 at 14:12 UTC
    an up-front single request for as many pages of VM as are required followed by a single ring 3 rep mov.

    I don't think so. Each variable should be copied separately, and you have to fix the references. So it's much more than single memory allocation.

    Yes. That is an annoying detail of the ithreads implementation. But, it is quite easy to avoid; you just spawn your workers early, and have them require rather than use what they (individually) need.

    Sorry, don't see how this can help. Workers usually all need the same set of modules.

      Sorry, don't see how this can help. Workers usually all need the same set of modules.

      The point of spawning early is to avoid there being much already in memory to cause duplication. Obviously, there's no point in require instead of use if all your threads need everything. (Hence (individually).)

      But, in many scenarios, whilst the threads require the same set of modules as each other, they don't need everything the main thread needs.

      Hence, for example, in a Tk app that uses some background threads for long running calculations or fetching stuff from the web etc., it makes sense to use the modules need by the workers, spawn the workers; then require Tk and anything else needed by the main thread. That way, the huge Tk doesn;t get needlessly replicated into all the workers.

      Similarly, if a threaded app need to use DBI, it make sense to spawn a DBI thread that requires DBI internally, and serialise DBI requests through it. It avoids duplicating DBI in all the apps other threads; and avoid complications with a DBs or DB libraries that use PIDs (rather than TIDs) for managing their internal memory.

      Another example is a threaded app that processes a large volume of work items read in from a file. Spawn the work threads before reading the file, otherwise the data structure holding the file contents gets replicated into all the threads even though the don't use it directly.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        The point of spawning early is to avoid there being much already in memory to cause duplication. Obviously, there's no point in require instead of use if all your threads need everything. (Hence (individually).)

        So I can use this optimisation if I need run a single specific thread, but if I need to start a bunch of identical worker threads it won't work.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://861241]
help
Chatterbox?
[LanX]: DBI: is there an easy method to get the content of a column as an array

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (7)
As of 2018-07-16 16:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?















    Results (344 votes). Check out past polls.

    Notices?