You are probably forking too much.
Instead of forking in the most inner loop, do it at the second or third level, so that instead of launching
5**10 6**11 little processes, you launch just a few hundreds.
Update: Oh, sorry, I didn't read your post fully. It seems you have already tried to do that. If your problem is getting the results back, the simplest solution in my experience is to have every process write its part of the computation into a file and then have a last stage where all the partial outputs are merged. In most cases this is also good enough in terms of computational cost.
The alternative (as you are really doing under the hood in your code by using the on finish hooks) is to have the slave process serialize the partial results and pipe then to the master process which then merges all of then. You avoid the cost of writing and reading the intermediate data to the file system, but on the other hand everything has to be in RAM.