Design advice: Classic boss/worker program memory consumption

shadrack has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Design advice: Classic boss/worker program memory consumption by moritz (Cardinal) on May 20, 2014 at 19:35 UTC
A typical solution is to have the code for the boss and the worker in different executables. When spawning a worker, the boss does a fork, immediately followed by an exec, which is much more memory friendly (on modern Linux/Unix systems, copy-on-write means you need barely extra memory for a fork when neither of the programs modify the data).Then have some kind of IPC to transfer the work unit to the worker. That way the worker only needs modest amounts of memory, and spawning many of them isn't that expensive. No intermediate process either. Perl 6 - the future is here, just unevenly distributed	[reply]
Re^2: Design advice: Classic boss/worker program memory consumption by shadrack (Acolyte) on May 20, 2014 at 20:01 UTC
When spawning a worker, the boss does a fork, immediately followed by an exec, which is much more memory friendly (on modern Linux/Unix systems, copy-on-write means you need barely extra memory for a fork when neither of the programs modify the data). copy-on-write does indeed make a fork more efficient, but you lose all that efficiency the minute you do an exec because the exec'd process starts over from scratch, loading its own separate copy of the perl interpreter, its own separate copy of the script, all the modules, etc. -- all taking up memory which would otherwise have been shared with the parent. This makes it LESS memory-friendly than just forking.	[reply]
Re^3: Design advice: Classic boss/worker program memory consumption by moritz (Cardinal) on May 21, 2014 at 05:47 UTC
You also lose all that efficiency when you start to read a lot of data structures in perl, because reading data ca n increase the reference counters, thus a write actually happens in the background, causing the operating system to copy the pages. So, don't assume stuff is memory unfriendly. Measure it. In your context, where it matters. Also, starting a new perl interpreter takes about 0.017s "cold" on my machine, and 0.002s when cached. Compared to a 1 minute run time of a worker thread, that's like 0.03% (cold) or 0.003% (warm) - nothing to worry about. Perl 6 - the future is here, just unevenly distributed	[reply]
Re^2: Design advice: Classic boss/worker program memory consumption by Laurent_R (Canon) on May 20, 2014 at 21:30 UTC
The memory size is only part of the problem. The problem with spawning too many children also has to do with context switches. I am regularly working with seven separate databases, each having eight sub-databases (and huge amount of data). So, in principle, I could launch 56 parallel processes for extracting data from these 56 data sources. Most of the time, I don't use forks or threads, but simply launch parallel background processes under the shell. And the processes that I am referring to are sometimes in Perl, and sometimes in a variety of other languages (including a proprietary equivalent of PL/SQL), so that using the OS to fork background processes is often the easiest route: the shell and the OS manage the parallel tasks, the program manages the functional/business requirements. Our servers have usually 8 CPUs. We have been doing this for many years and have tried a number of options and configurations, and we have found that the optimal number of processes running in parallel is usually between 8 and 16, depending on the specifics of the individual programs (i.e. some are faster with 8 processes, some better with 12 and some better with 16, depending on what exactly they are doing and how). If we use less than 8 processes, we have an obvious under-utilization of the hardware; if we let 56 processes run in parallel, the overall process takes much longer to execute than when we have 8 to 16 processes, and we strongly believe that this is due to context switches and memory usage. In some cases (when the extraction need heavy DB sorting, for example), the overall process even fails for lack of memory if we have too many processes running in parallel. But in most of the cases, it really seems to be linked to having too many processes running for the number CPUs available, leading to intensive context switches. So what we are doing is to put the 56 processes into a waiting queue, whose role is just to to keep optimal the actual number of processes running in parallel (8 to 16 depending on the program). This is at least the best solution we've found so far. With about 15 years multiplied by 5 or 6 persons of cumulated experience on the subject. Now, if anyone has a better idea, I would gladly take it up and try it out.	[reply]
Re^3: Design advice: Classic boss/worker program memory consumption (context switching is cheap) by tye (Sage) on May 21, 2014 at 00:55 UTC
if we let 56 processes run in parallel, the overall process takes much longer to execute than when we have 8 to 16 processes, and we strongly believe that this is due to context switches and memory usage. Having to swap pages out and back in surely can make things run much, much longer. 56 processes talking to databases isn't really going to add enough overhead from context switching to cause the total run-time to be "much longer", IME. To get context switching to be even a moderate source of overhead, you have to force it to happen extra often by passing tiny messages back and forth way more frequently than your OS's time slice. There are also other resources that can have a significant impact on performance if you try to over-parallelize. Different forms of caching (file system, database, L1, L2) can become much less effective if you have too many simultaneously competing uses for them (having to swap out/in pages of VM is just a specific instance of this general problem). To a lesser extent, you can also lose performance for disk I/O by having reads be done in a less-sequential pattern. But I bet that is mostly "in the noise" compared to reduction in cache efficiency leading to things having to be read more than once each. Poor lock granularity can also lead to greatly reduced performance when you over-parallelize, but I doubt that applies in your situation. - tye	[reply]
Re^4: Design advice: Classic boss/worker program memory consumption (context switching is absolutely not cheap) by BrowserUk (Patriarch) on May 21, 2014 at 02:08 UTC
Re^5: Design advice: Classic boss/worker program memory consumption (cache misses suck) by tye (Sage) on May 21, 2014 at 03:15 UTC
Some notes below your chosen depth have not been shown here
Re^4: Design advice: Classic boss/worker program memory consumption (context switching is cheap) by Laurent_R (Canon) on May 21, 2014 at 06:30 UTC
Re: Design advice: Classic boss/worker program memory consumption by Laurent_R (Canon) on May 20, 2014 at 18:58 UTC
Forking as many as 30 children is probably quite inefficient (although I am not sure, you haven't provided enough detail about your platform). I would suggest that you create a number of children comprised between the number of CPUs (or CPU cores) available on the platform and about twice that number (the exact number will have to be found empirically depending on how CPU and IO intensive the processes are). Then, the boss assigns tasks to the slave tasks giving them only one task at a time (using IPCs, temp files or whatever is suited). Once a slave task has finished its work, it returns the results to the boss and waits for the boss to send another task if any left. If none left, the boss sends a message to the task to die. Oh, and, BTW, there are some CPAN modules to manage this almost automatically with threads, there might be some to do it with forked processes. And, BTW, may be threads would be better than forked processes. Edit, May 20, 2014, 19:07 UTC: I had forgotten that you speak about queuing the tasks in your post. If the same queue is available to all children, my description of a solution is probably useless, the queue is probably just doing that. But the fact that starting 30 (or 50) children is probably overkill leading to inefficiencies remains. Also I don't understand why the children should share the full data structure in memory. You could in principle farm out to each only a small part of it, not requiring the children to hold the full thing in memory. But there is not enough details on the overall process to be sure.	[reply]
Re: Design advice: Classic boss/worker program memory consumption (PipeWorkers) by tye (Sage) on May 20, 2014 at 20:27 UTC
I need to upload IPC::PipeWorkers to CPAN. It handles Solution 2 quite efficiently. One down side is that you can't fork a replacement child when one of your worker children dies. That motivated me to consider exactly your Solution 1. I spent a little time trying to work out the details of that. But I ended up abandoning the idea because of the huge increase in complexity involved. Worker children disappearing turned out to never be much of a problem in practice. Also, it is wise for long-running daemons to periodically be restarted anyway, which means the rate of child death needs to be pretty high for it to end up mattering much. - tye	[reply]
Re^2: Design advice: Classic boss/worker program memory consumption (PipeWorkers) by shadrack (Acolyte) on May 20, 2014 at 20:51 UTC
It handles solution 2 quite efficiently. How? Just in general, what's your strategy?	[reply]
Re: Design advice: Classic boss/worker program memory consumption by sundialsvc4 (Abbot) on May 20, 2014 at 23:34 UTC
Another trick is to reduce the parent’s role strictly to managing the children. Spin-off the process of creating the %dataset and of determining the number of workers to a single child ... call it the project manager. The PM then informs the parent how many children are needed, and the parent spawns them. Worker processes can, say, ask the PM for another unit of work, which the PM doles-out and sends to them. There is also another child process which is the recipient and filer of completed work units. Now, you have a dumb parent who has two gifted children (the PM and the filer), as well as a variable number of hired grunts. No large chunks of memory get duplicated on-fork, and that should eliminate your memory problem. The parent’s only job is to see to it that its children are alive. The children talk among themselves. Like any good bureaucrat, the parent is responsible for everything but basically does nothing . . . :-)
Re^2: Design advice: Classic boss/worker program memory consumption (less async) by tye (Sage) on May 21, 2014 at 13:53 UTC
Having the parent (that doesn't build the data nor do any of the tasks) "dole out" tasks to the children (for the one child) adds a lot of complexity vs. just having a pipe for the children to read from. It also can easily reduce concurrency. The significant increase in complexity comes from this one process trying to manage competing tasks that then need to be done asynchronously. To keep the code simple, you could have the parent just be responsible for spawning children. One child builds the hash and sends out jobs on a pipe that all of the other children read (fixed-length) jobs from. Those children then write their responses that aren't larger than "the system buffer" to a "return" pipe using one syswrite(). This leads to only one tiny bit of "async" fiddling to worry about in only the one child and in a way that can be handled simply in Perl while also only requiring two simple pipes. But to be able to have children pick up tasks from a single pipe, you have to use fixed-length task data (not larger than the system buffer, 4KB on Linux). To be able to use a single pipe for responses from workers, responses need to be written in chunks (with a fixed-length prefix that specifies the length of the chunk that must total no more than the system buffer size) using a single syswrite(). (And not all worker systems require a response to be sent back to the holder of the huge hash.) There might be a bit of complexity added by this "just spawns children" parent needing to keep handles open to both ends of both pipes if we expect it to replace both types of children. I'd probably just have to write the code to be sure of the consequences of that. I'd probably simplify that a bit by having the death of the "build the big hash" child cause the parent to re-initialize by execing itself. There could still be some added complexity just from having one more process that needs to keep handles to one end of each of the two pipes open. But I think that would only impact the "shut down cleanly" code. Again, until I swap all of the little details in (probably by just writing the code), I'm not sure of the full consequences. (Update: Yeah, the complexity comes because, if workers send back responses, you'll need to add a way for the one "build the tasks" child to tell the parent that it is time to shut down or else a way for the parent to tell the one child "all workers are finished".) (And getting this approach to work reliably on Windows is mostly a whole separate problem because you can't just use fork() and because Windows pipes don't have the same guarantees -- though you can use "message read mode", which should work but is just a separate problem to solve.) - tye	[reply]
Re^2: Design advice: Classic boss/worker program memory consumption by shadrack (Acolyte) on May 21, 2014 at 05:09 UTC
Thanks. Although my sense is that your solution as described is actually a bit more complex than solution 1, a variation of the core idea might be the way to go (at least for me). The solution I'm thinking of is identical to the one you've described up to the point where the parent spawns the workers. At this point the PM, instead of hanging around and answering requests from the children, sends the entire %dataset to the parent, then quits. The parent then continues to run the show exactly as it does today (remember, for me, this code is not only written, but well-tested). Some inefficiency will be introduced by the need for the PM to communicate the entire %dataset to the parent via IPC. I'm thinking that, since this only happens once, it probably won't have a huge impact on performance, though obviously, I'll have to test it. At any rate, it's a really good idea. Thanks!	[reply]
Re: Design advice: Classic boss/worker program memory consumption by flexvault (Monsignor) on May 21, 2014 at 15:22 UTC
shadrack, "Does anybody see a solution that's better than either of the above in terms of SIMPLICITY and EFFICIENCY -- preferably one that doesn't require a major overhaul?" After re-reading your post several times, it appears your problem is the expanding size of the '%dataset' hash. One solution is to save the assembled '%dataset' hash to disk and then clear the hash before forking the children. As each child asks for more work, the 'boss' reads the disk for the next unit of work and passes that to the child for processing. You could do this with a database, but that just adds overhead that isn't needed. (Note: I'm assuming the hash(now file) is deleted after the boss completes the run.) Also, you don't say anything about the contents of the data, so you may want to include the size of each unit of work on disk, and read just that amount. Perl takes care of this for you in the hash. This solution is very simple and would be easy to test/verify. Hope this helps! Regards...Ed "Well done is better than well said." - Benjamin Franklin	[reply]
Re^2: Design advice: Classic boss/worker program memory consumption by shadrack (Acolyte) on May 22, 2014 at 02:36 UTC
I thought of that, but I recall reading somewhere that if you allocate a big chunk of memory in perl and then release it (by way of undef'ing the variable or whatever) that memory is not returned to the operating system -- it's still owned by the perl interpreter as far as Linux is concerned. So you can write %dataset to disk and undef it, but w/rt children, forking, and copy-on-write, the operating system will treat that memory as if it's still in use. Thus, you gain nothing.	[reply]
Re^3: Design advice: Classic boss/worker program memory consumption by flexvault (Monsignor) on May 22, 2014 at 11:20 UTC
shadrack, Interesting, I just checked on AIX and the memory was freed for the child, but on Linux(Debian) the memory was still allocated for the child. So it depends on the OS. You could try this program on your system and see if the memory is freed or not? Note: My test results are at the bottom of the script. #!/usr/local/bin/perl use strict; use warnings; my $NAME = ""; my ( $FB_CacheMemoryVSZ, $FB_CacheMemoryRSS ) = &Display_Mem_Usage +($$,$NAME,0); print "Before: Virtual Memory: $FB_CacheMemoryVSZ\tReal Memory: $F +B_CacheMemoryRSS\n"; { my %RCache; for my $key ( 1_000 .. 11_000 ) { $RCache{$key} = chr( $key % 255 ) x 2_048; } ( $FB_CacheMemoryVSZ, $FB_CacheMemoryRSS ) = &Display_Mem_Usag +e($$,$NAME,0); print "After: Virtual Memory: $FB_CacheMemoryVSZ\tReal Memory +: $FB_CacheMemoryRSS\n"; } ( $FB_CacheMemoryVSZ, $FB_CacheMemoryRSS ) = &Display_Mem_Usage($$ +,$NAME,0); print "$$: Virtual Memory: $FB_CacheMemoryVSZ\tReal Memory: $FB_Ca +cheMemoryRSS\n\n"; my $fork = fork(); if ( $fork ) { ( $FB_CacheMemoryVSZ, $FB_CacheMemoryRSS ) = &Display_Mem_Usag +e($$,$NAME,0); print "$$: Virtual Memory: $FB_CacheMemoryVSZ\tReal Memory: $F +B_CacheMemoryRSS\n\n"; } else { ( $FB_CacheMemoryVSZ, $FB_CacheMemoryRSS ) = &Display_Mem_Usa +ge($$,$NAME,0); print "$$: Virtual Memory: $FB_CacheMemoryVSZ\tReal Memory: $F +B_CacheMemoryRSS\n\n"; } exit; sub Display_Mem_Usage { # VSZ is size in KBytes of the virtual memory ( VSZ * 1024 ) # RSS is size in pages of real memory ( 1024 * RSS ) my $cpid = shift; my $name = shift; my $from = shift; my $var = ""; my $fh; my $arg = qq\| -o "vsz rssize" -p $cpid\|; open ( $fh, "-\|", "/bin/ps $arg" ) or die "Display_Mem_Usage: No +t open \'$arg\': $!"; while (<$fh>) { $var .= $_; } close $fh; my $rno = my @ref = split(/\n/,$var); if ( $rno < 2 ) { return ( - +1, -1 ); } my $info = join(" ", split " ", $ref[1]); my ($vmem,$rmem) = ( split(/\ /,$info) ); return ( $vmem , $rmem ); } __END__ AIX:# perl ckmem.cgi Before: Virtual Memory: 776 Real Memory: 852 After: Virtual Memory: 21904 Real Memory: 21980 41030: Virtual Memory: 21908 Real Memory: 21984 40244: Virtual Memory: 424 Real Memory: 404 ## Child 41030: Virtual Memory: 21912 Real Memory: 21900 AIX:# Linux:# perl ckmem.cgi Before: Virtual Memory: 5288 Real Memory: 1928 After: Virtual Memory: 26136 Real Memory: 22804 26362: Virtual Memory: 26128 Real Memory: 22804 26362: Virtual Memory: 26128 Real Memory: 22808 26369: Virtual Memory: 26128 Real Memory: 21600 ## Child Linux:# [download] Regards...Ed "Well done is better than well said." - Benjamin Franklin	[reply] [d/l]
Re^4: Design advice: Classic boss/worker program memory consumption by shadrack (Acolyte) on May 23, 2014 at 09:53 UTC
Re^5: Design advice: Classic boss/worker program memory consumption by flexvault (Monsignor) on May 23, 2014 at 13:29 UTC
Re: Design advice: Classic boss/worker program memory consumption by oiskuu (Hermit) on May 21, 2014 at 16:42 UTC
Basically, you want a database that's not on the parent heap. How about this: `use GDBM_File; tie(my %dataset, 'GDBM_File', 'tmp.db', &GDBM_NEWDB, 0600) or die; ...` [download] I've assumed your dataset needs no serialization.	[reply] [d/l]
Re^2: Design advice: Classic boss/worker program memory consumption by sundialsvc4 (Abbot) on May 21, 2014 at 17:08 UTC
A potential issue with that design is that now you have the issue of multiple read/write access to a single file ... locking, concurrency and all of that. Which could become quite messy. I know that it is somewhat counter-intuitive to have a parent that does nothing but mind the kids, but I actually find it easier to do it that way because of the issue of separation-of-concerns, or in this case, of one-thing-to-wait-on. The parent owns the kids, minds the kids, wipes their little noses, and that’s it. The kids connect to their service-provider older brother, and send results to their service-consumer older sister. Exactly how you IPC that stuff together is really up to you ... “tim toady.” But, now, each process (including the parent) basically has only one concern, one thing to worry about at any particular time, and a very uncomplicated internal structure. The processing will scale-up or scale-down easily, and be quite reliable.
Re^2: Design advice: Classic boss/worker program memory consumption by shadrack (Acolyte) on May 22, 2014 at 02:43 UTC
I've assumed your dataset needs no serialization. Unfortunately, it would. The work units are moderately complex data structures. Assembling them directly to a tied hash isn't really practical	[reply]


more useful options
	PerlMonks