Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Design advice: Classic boss/worker program memory consumption

by shadrack (Acolyte)
on May 20, 2014 at 17:50 UTC ( #1086834=perlquestion: print w/ replies, xml ) Need Help??
shadrack has asked for the wisdom of the Perl Monks concerning the following question:

Monks -

I've got a boss/worker design issue that I'd like to get some suggestions on. I have a script that I've built up over the years, now over 6300 lines (yes, it's an unholy beast). The program's job involves assembling a big hash (%dataset), each member of which is a "work unit" of a few dozen KB. %dataset can contain anywhere from 0 (effectively a no-op) up to about 10,000 work units. Work units are farmed out in queue fashion to a set of child worker processes. Each worker processes one work unit at a time (altering it) and returns it to the boss. Although not directly relevant to the problem, I'll go ahead and mention that the processing of a work unit is I/O bound and typically takes less than a minute (though in extreme cases it can take up to 10). I'd estimate the program runs about 100 times a day, often with a dataset consisting of a single work unit.

The boss process currently uses the following algorithm (yes, this is pseudo-code):

$max_children=30 Create IPC socket $SIG{CHLD}=\&reaper Assemble %dataset $num_children_required = minimum($max_children,num_work_units) foreach $num_children (1..$num_children_required) { fork() a worker } while($num_children > 0) { listen for children on socket get from child on socket: processed work unit (if any) if(num_unprocessed_work_units > 0) { tell child on socket: process work unit identified by $key } else { tell child on socket: quit } } sub reaper { $num_children--; }

$max_children was initially 50, but in recent months, I've had to reduce it to 30 as a workaround for the problem I've started experiencing. This problem is memory -- or rather, a lack of memory caused by an increase in the number of work units. Because %dataset is assembled before forking off all the worker processes, they each have a complete copy of the entire dataset. As the boss collects processed work units it updates %dataset accordingly which triggers copy-on-write for the part of %dataset that's being updated.

The obvious solution is to create all the worker processes BEFORE assembling %dataset, however it's only after %dataset is assembled that the boss knows how many worker processes it needs, so I have sort of a chicken-and-egg problem.

So far, I've come up with two solutions, neither of which gives me the warm fuzzies.

SOLUTION 1:
Instead of forking all the workers directly, the boss forks off a single child BEFORE assembling %dataset (we'll call this child the SPAWNER). Then, after assembling %dataset and determining how many workers are required, the boss tells SPAWNER to create that many workers. Communication between the boss and workers is more or less the same as above, however the reaping of children must be handled by SPAWNER which must convey each reap to the boss.
PROBLEM: The addition of an extra process between the boss and worker processes adds COMPLEXITY in a number of areas: the spawning of workers, the reaping of workers, the requirement for the boss to detect and recover if SPAWNER terminates prematurely, and possibly other areas I haven't thought of.

SOLUTION 2:
Fork $max_children number of workers before assembling %dataset, then, after assembling %dataset and determining how many workers are required, kill off any unneeded workers (if any).
PROBLEM: Creates processes we don't need, which is INEFFICIENT.

Does anybody see a solution that's better than either of the above in terms of SIMPLICITY and EFFICIENCY -- preferably one that doesn't require a major overhaul?

Comment on Design advice: Classic boss/worker program memory consumption
Download Code
Re: Design advice: Classic boss/worker program memory consumption
by Laurent_R (Parson) on May 20, 2014 at 18:58 UTC
    Forking as many as 30 children is probably quite inefficient (although I am not sure, you haven't provided enough detail about your platform). I would suggest that you create a number of children comprised between the number of CPUs (or CPU cores) available on the platform and about twice that number (the exact number will have to be found empirically depending on how CPU and IO intensive the processes are). Then, the boss assigns tasks to the slave tasks giving them only one task at a time (using IPCs, temp files or whatever is suited). Once a slave task has finished its work, it returns the results to the boss and waits for the boss to send another task if any left. If none left, the boss sends a message to the task to die.

    Oh, and, BTW, there are some CPAN modules to manage this almost automatically with threads, there might be some to do it with forked processes. And, BTW, may be threads would be better than forked processes.

    Edit, May 20, 2014, 19:07 UTC: I had forgotten that you speak about queuing the tasks in your post. If the same queue is available to all children, my description of a solution is probably useless, the queue is probably just doing that. But the fact that starting 30 (or 50) children is probably overkill leading to inefficiencies remains. Also I don't understand why the children should share the full data structure in memory. You could in principle farm out to each only a small part of it, not requiring the children to hold the full thing in memory. But there is not enough details on the overall process to be sure.

Re: Design advice: Classic boss/worker program memory consumption
by moritz (Cardinal) on May 20, 2014 at 19:35 UTC

    A typical solution is to have the code for the boss and the worker in different executables. When spawning a worker, the boss does a fork, immediately followed by an exec, which is much more memory friendly (on modern Linux/Unix systems, copy-on-write means you need barely extra memory for a fork when neither of the programs modify the data).Then have some kind of IPC to transfer the work unit to the worker.

    That way the worker only needs modest amounts of memory, and spawning many of them isn't that expensive. No intermediate process either.

      When spawning a worker, the boss does a fork, immediately followed by an exec, which is much more memory friendly (on modern Linux/Unix systems, copy-on-write means you need barely extra memory for a fork when neither of the programs modify the data).

      copy-on-write does indeed make a fork more efficient, but you lose all that efficiency the minute you do an exec because the exec'd process starts over from scratch, loading its own separate copy of the perl interpreter, its own separate copy of the script, all the modules, etc. -- all taking up memory which would otherwise have been shared with the parent. This makes it LESS memory-friendly than just forking.

        You also lose all that efficiency when you start to read a lot of data structures in perl, because reading data ca n increase the reference counters, thus a write actually happens in the background, causing the operating system to copy the pages.

        So, don't assume stuff is memory unfriendly. Measure it. In your context, where it matters.

        Also, starting a new perl interpreter takes about 0.017s "cold" on my machine, and 0.002s when cached. Compared to a 1 minute run time of a worker thread, that's like 0.03% (cold) or 0.003% (warm) - nothing to worry about.

      The memory size is only part of the problem. The problem with spawning too many children also has to do with context switches.

      I am regularly working with seven separate databases, each having eight sub-databases (and huge amount of data). So, in principle, I could launch 56 parallel processes for extracting data from these 56 data sources. Most of the time, I don't use forks or threads, but simply launch parallel background processes under the shell. And the processes that I am referring to are sometimes in Perl, and sometimes in a variety of other languages (including a proprietary equivalent of PL/SQL), so that using the OS to fork background processes is often the easiest route: the shell and the OS manage the parallel tasks, the program manages the functional/business requirements.

      Our servers have usually 8 CPUs. We have been doing this for many years and have tried a number of options and configurations, and we have found that the optimal number of processes running in parallel is usually between 8 and 16, depending on the specifics of the individual programs (i.e. some are faster with 8 processes, some better with 12 and some better with 16, depending on what exactly they are doing and how).

      If we use less than 8 processes, we have an obvious under-utilization of the hardware; if we let 56 processes run in parallel, the overall process takes much longer to execute than when we have 8 to 16 processes, and we strongly believe that this is due to context switches and memory usage. In some cases (when the extraction need heavy DB sorting, for example), the overall process even fails for lack of memory if we have too many processes running in parallel. But in most of the cases, it really seems to be linked to having too many processes running for the number CPUs available, leading to intensive context switches.

      So what we are doing is to put the 56 processes into a waiting queue, whose role is just to to keep optimal the actual number of processes running in parallel (8 to 16 depending on the program). This is at least the best solution we've found so far. With about 15 years multiplied by 5 or 6 persons of cumulated experience on the subject. Now, if anyone has a better idea, I would gladly take it up and try it out.

        if we let 56 processes run in parallel, the overall process takes much longer to execute than when we have 8 to 16 processes, and we strongly believe that this is due to context switches and memory usage.

        Having to swap pages out and back in surely can make things run much, much longer. 56 processes talking to databases isn't really going to add enough overhead from context switching to cause the total run-time to be "much longer", IME. To get context switching to be even a moderate source of overhead, you have to force it to happen extra often by passing tiny messages back and forth way more frequently than your OS's time slice.

        There are also other resources that can have a significant impact on performance if you try to over-parallelize. Different forms of caching (file system, database, L1, L2) can become much less effective if you have too many simultaneously competing uses for them (having to swap out/in pages of VM is just a specific instance of this general problem).

        To a lesser extent, you can also lose performance for disk I/O by having reads be done in a less-sequential pattern. But I bet that is mostly "in the noise" compared to reduction in cache efficiency leading to things having to be read more than once each.

        Poor lock granularity can also lead to greatly reduced performance when you over-parallelize, but I doubt that applies in your situation.

        - tye        

Re: Design advice: Classic boss/worker program memory consumption (PipeWorkers)
by tye (Cardinal) on May 20, 2014 at 20:27 UTC

    I need to upload IPC::PipeWorkers to CPAN. It handles Solution 2 quite efficiently.

    One down side is that you can't fork a replacement child when one of your worker children dies. That motivated me to consider exactly your Solution 1. I spent a little time trying to work out the details of that. But I ended up abandoning the idea because of the huge increase in complexity involved.

    Worker children disappearing turned out to never be much of a problem in practice. Also, it is wise for long-running daemons to periodically be restarted anyway, which means the rate of child death needs to be pretty high for it to end up mattering much.

    - tye        

      It handles solution 2 quite efficiently.
      How? Just in general, what's your strategy?
Re: Design advice: Classic boss/worker program memory consumption
by sundialsvc4 (Abbot) on May 20, 2014 at 23:34 UTC

    Another trick is to reduce the parent’s role strictly to managing the children.   Spin-off the process of creating the %dataset and of determining the number of workers to a single child ... call it the project manager.   The PM then informs the parent how many children are needed, and the parent spawns them.   Worker processes can, say, ask the PM for another unit of work, which the PM doles-out and sends to them.   There is also another child process which is the recipient and filer of completed work units.

    Now, you have a dumb parent who has two gifted children (the PM and the filer), as well as a variable number of hired grunts.   No large chunks of memory get duplicated on-fork, and that should eliminate your memory problem.   The parent’s only job is to see to it that its children are alive.   The children talk among themselves.   Like any good bureaucrat, the parent is responsible for everything but basically does nothing . . . :-)

      Thanks. Although my sense is that your solution as described is actually a bit more complex than solution 1, a variation of the core idea might be the way to go (at least for me). The solution I'm thinking of is identical to the one you've described up to the point where the parent spawns the workers. At this point the PM, instead of hanging around and answering requests from the children, sends the entire %dataset to the parent, then quits. The parent then continues to run the show exactly as it does today (remember, for me, this code is not only written, but well-tested).

      Some inefficiency will be introduced by the need for the PM to communicate the entire %dataset to the parent via IPC. I'm thinking that, since this only happens once, it probably won't have a huge impact on performance, though obviously, I'll have to test it.

      At any rate, it's a really good idea. Thanks!

      Having the parent (that doesn't build the data nor do any of the tasks) "dole out" tasks to the children (for the one child) adds a lot of complexity vs. just having a pipe for the children to read from. It also can easily reduce concurrency. The significant increase in complexity comes from this one process trying to manage competing tasks that then need to be done asynchronously.

      To keep the code simple, you could have the parent just be responsible for spawning children. One child builds the hash and sends out jobs on a pipe that all of the other children read (fixed-length) jobs from. Those children then write their responses that aren't larger than "the system buffer" to a "return" pipe using one syswrite().

      This leads to only one tiny bit of "async" fiddling to worry about in only the one child and in a way that can be handled simply in Perl while also only requiring two simple pipes.

      But to be able to have children pick up tasks from a single pipe, you have to use fixed-length task data (not larger than the system buffer, 4KB on Linux). To be able to use a single pipe for responses from workers, responses need to be written in chunks (with a fixed-length prefix that specifies the length of the chunk that must total no more than the system buffer size) using a single syswrite(). (And not all worker systems require a response to be sent back to the holder of the huge hash.)

      There might be a bit of complexity added by this "just spawns children" parent needing to keep handles open to both ends of both pipes if we expect it to replace both types of children. I'd probably just have to write the code to be sure of the consequences of that. I'd probably simplify that a bit by having the death of the "build the big hash" child cause the parent to re-initialize by execing itself.

      There could still be some added complexity just from having one more process that needs to keep handles to one end of each of the two pipes open. But I think that would only impact the "shut down cleanly" code. Again, until I swap all of the little details in (probably by just writing the code), I'm not sure of the full consequences. (Update: Yeah, the complexity comes because, if workers send back responses, you'll need to add a way for the one "build the tasks" child to tell the parent that it is time to shut down or else a way for the parent to tell the one child "all workers are finished".)

      (And getting this approach to work reliably on Windows is mostly a whole separate problem because you can't just use fork() and because Windows pipes don't have the same guarantees -- though you can use "message read mode", which should work but is just a separate problem to solve.)

      - tye        

Re: Design advice: Classic boss/worker program memory consumption
by flexvault (Parson) on May 21, 2014 at 15:22 UTC

    shadrack,

      "Does anybody see a solution that's better than either of the above in terms of SIMPLICITY and EFFICIENCY -- preferably one that doesn't require a major overhaul?"

    After re-reading your post several times, it appears your problem is the expanding size of the '%dataset' hash. One solution is to save the assembled '%dataset' hash to disk and then clear the hash before forking the children.

    As each child asks for more work, the 'boss' reads the disk for the next unit of work and passes that to the child for processing. You could do this with a database, but that just adds overhead that isn't needed. (Note: I'm assuming the hash(now file) is deleted after the boss completes the run.) Also, you don't say anything about the contents of the data, so you may want to include the size of each unit of work on disk, and read just that amount. Perl takes care of this for you in the hash.

    This solution is very simple and would be easy to test/verify. Hope this helps!

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

      I thought of that, but I recall reading somewhere that if you allocate a big chunk of memory in perl and then release it (by way of undef'ing the variable or whatever) that memory is not returned to the operating system -- it's still owned by the perl interpreter as far as Linux is concerned. So you can write %dataset to disk and undef it, but w/rt children, forking, and copy-on-write, the operating system will treat that memory as if it's still in use. Thus, you gain nothing.

        shadrack,

        Interesting, I just checked on AIX and the memory was freed for the child, but on Linux(Debian) the memory was still allocated for the child. So it depends on the OS.

        You could try this program on your system and see if the memory is freed or not? Note: My test results are at the bottom of the script.

        #!/usr/local/bin/perl use strict; use warnings; my $NAME = ""; my ( $FB_CacheMemoryVSZ, $FB_CacheMemoryRSS ) = &Display_Mem_Usage +($$,$NAME,0); print "Before: Virtual Memory: $FB_CacheMemoryVSZ\tReal Memory: $F +B_CacheMemoryRSS\n"; { my %RCache; for my $key ( 1_000 .. 11_000 ) { $RCache{$key} = chr( $key % 255 ) x 2_048; } ( $FB_CacheMemoryVSZ, $FB_CacheMemoryRSS ) = &Display_Mem_Usag +e($$,$NAME,0); print "After: Virtual Memory: $FB_CacheMemoryVSZ\tReal Memory +: $FB_CacheMemoryRSS\n"; } ( $FB_CacheMemoryVSZ, $FB_CacheMemoryRSS ) = &Display_Mem_Usage($$ +,$NAME,0); print "$$: Virtual Memory: $FB_CacheMemoryVSZ\tReal Memory: $FB_Ca +cheMemoryRSS\n\n"; my $fork = fork(); if ( $fork ) { ( $FB_CacheMemoryVSZ, $FB_CacheMemoryRSS ) = &Display_Mem_Usag +e($$,$NAME,0); print "$$: Virtual Memory: $FB_CacheMemoryVSZ\tReal Memory: $F +B_CacheMemoryRSS\n\n"; } else { ( $FB_CacheMemoryVSZ, $FB_CacheMemoryRSS ) = &Display_Mem_Usa +ge($$,$NAME,0); print "$$: Virtual Memory: $FB_CacheMemoryVSZ\tReal Memory: $F +B_CacheMemoryRSS\n\n"; } exit; sub Display_Mem_Usage { # VSZ is size in KBytes of the virtual memory ( VSZ * 1024 ) # RSS is size in pages of real memory ( 1024 * RSS ) my $cpid = shift; my $name = shift; my $from = shift; my $var = ""; my $fh; my $arg = qq| -o "vsz rssize" -p $cpid|; open ( $fh, "-|", "/bin/ps $arg" ) or die "Display_Mem_Usage: No +t open \'$arg\': $!"; while (<$fh>) { $var .= $_; } close $fh; my $rno = my @ref = split(/\n/,$var); if ( $rno < 2 ) { return ( - +1, -1 ); } my $info = join(" ", split " ", $ref[1]); my ($vmem,$rmem) = ( split(/\ /,$info) ); return ( $vmem , $rmem ); } __END__ AIX:# perl ckmem.cgi Before: Virtual Memory: 776 Real Memory: 852 After: Virtual Memory: 21904 Real Memory: 21980 41030: Virtual Memory: 21908 Real Memory: 21984 40244: Virtual Memory: 424 Real Memory: 404 ## Child 41030: Virtual Memory: 21912 Real Memory: 21900 AIX:# Linux:# perl ckmem.cgi Before: Virtual Memory: 5288 Real Memory: 1928 After: Virtual Memory: 26136 Real Memory: 22804 26362: Virtual Memory: 26128 Real Memory: 22804 26362: Virtual Memory: 26128 Real Memory: 22808 26369: Virtual Memory: 26128 Real Memory: 21600 ## Child Linux:#

        Regards...Ed

        "Well done is better than well said." - Benjamin Franklin

Re: Design advice: Classic boss/worker program memory consumption
by oiskuu (Pilgrim) on May 21, 2014 at 16:42 UTC

    Basically, you want a database that's not on the parent heap. How about this:

    use GDBM_File; tie(my %dataset, 'GDBM_File', 'tmp.db', &GDBM_NEWDB, 0600) or die; ...

    I've assumed your dataset needs no serialization.

      A potential issue with that design is that now you have the issue of multiple read/write access to a single file ... locking, concurrency and all of that.   Which could become quite messy.

      I know that it is somewhat counter-intuitive to have a parent that does nothing but mind the kids, but I actually find it easier to do it that way because of the issue of separation-of-concerns, or in this case, of one-thing-to-wait-on.   The parent owns the kids, minds the kids, wipes their little noses, and that’s it.   The kids connect to their service-provider older brother, and send results to their service-consumer older sister.   Exactly how you IPC that stuff together is really up to you ... “tim toady.”   But, now, each process (including the parent) basically has only one concern, one thing to worry about at any particular time, and a very uncomplicated internal structure.   The processing will scale-up or scale-down easily, and be quite reliable.

      I've assumed your dataset needs no serialization.

      Unfortunately, it would. The work units are moderately complex data structures. Assembling them directly to a tied hash isn't really practical

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1086834]
Approved by Lotus1
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (7)
As of 2014-09-19 22:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (151 votes), past polls