http://www.perlmonks.org?node_id=710237

haidut has asked for the wisdom of the Perl Monks concerning the following question:

OK, another try with

tags as suggested:-) Hi all,

Sorry if this is a dump/redundant question but I couldn't find a definitive answer anywhere so I decided to ask for collective wisdom.

I am working on data classification and machine learning project. So I have a large data set that I need to process. The job will run on multi-core CPU and since the data set items are independent the set can be split for processing into multiple units in order to take advantage of the multicore CPU.

Obviously the first things that come to mind are threads and forking. I wrote a version based on forking and it works fine but it is a RAM hog b/c when you fork, every new process is a copy of the parent and the parent in my case is quite large b/c it loads an AI model that consumes about 1GB of RAM. So each child becomes a 1GB monster and I run the risk of either thrashing the swap, which kills performance or running out of RAM altogether if another process kicks in somehow.

With threads it seems that it would be easier since threads have access to global variables defined in the parent, so all spawned threads would share the same AI model and I won't have multiple 1GB copies of the parent. In the thread case I obviously have to worry about locking but that's not an issue as I can implement it. The bigger issue is that it seems that Perl threads live INSIDE the spawning process, so they don't get scheduled on separate CPUs but simply compete for run time within the spawning process. I tried some tests and indeed on Linux the "top" command shows only one Perl process running on one of the 8 available CPUs even though I have 8 threads running. So with threads I am not achieving any speedup on multi-core CPUs.

Does anybody know if Perl supports kernel threads that the OS can then schedule on multiple CPUs? I read the Perl thread tutorial and all it says is that each thread loads a new Perl interpreter. But from what I see that doesn't result in a new runnable object separate from the spawning process that can be scheduled to run on a CPU other than the one used by the spawning process. That said, are there any modules on CPAN that provide through parallelization of tasks so that Perl can take advantage of multiple CPUs? Any help is appreciated. Thanks.