http://www.perlmonks.org?node_id=971356

pawan68923 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have to modify and use parallel processing for a large and complex system which is implemented in Perl. Each child process or threads are supposed to use memory around 1 GB and will run for around 1-3 hours on Solaris Server. Total parent's execution time would be around 12-14 hours.

My question is - what is your recommendation - use of ithreads or multi processing using fork(). 100s of processes or threads need to be created with a concurrency limit of around 15 at a time to loop over list of 130-150 jobs. Considering amount of memory it is going to use and execution duration, which approach you would suggest? Relibility of the overall processing is important since it will run in production environment.

Thanks!

Best regards, Pawan

Replies are listed 'Best First'.
Re: ithreads or fork() what you recommend?
by BrowserUk (Patriarch) on May 18, 2012 at 20:47 UTC
    Considering amount of memory it is going to use and execution duration, which approach you would suggest?

    These are the wrong criteria. The details you've provided are insufficient for anyone to reach a reasoned conclusion -- though many will try, based upon their own preferences, prejudices and dogmas.

    The amount of memory your subtasks use is irrelevant provided that the total concurrent memory requirement is within the capabilities of your hardware. The same applies whether the subtasks are implemented as threads or processes.

    The duration for which the subtasks run is equally irrelevant. Threads and processes are equally capable of running reliably for long periods.

    The kind of criteria you should be considering

    Do your subtasks need read-write access to shared data? If not, why are you considering threads?

    Is the 1GB of data per subtask different for each subtask, or shared by all the subtasks? If the latter, then you may well find that the process model is more economic because of copy-on-write; but be aware of subtle internal factors that can cause apparently 'read-only' references to COW data to induce copying.

    For example: if you load shared numeric data from a file, it gets stored initially as strings. If your subprocesses the use that in a numeric context, then the PV will get upgraded to an NV or IV and the 4096 page containing that value will be copied. And each time a new subprocess accesses that same value, that page will be copied again. And if every one of your 100+ subprocesses accessed every value -- or just one value on each page -- then the entire shared memory would end up get copied piecemeal 100+ times. (This scenario can be avoided by touching those numeric strings in a numeric context before spawning the subprocesses; but there are other, more subtle scenarios that can lead to similar affects.)

    The bottom line is, if your application can be written to use processes without having to jump through hoops to make it work, and you are more comfortable with processes, then use them.

    If however, you believe your application could benefit from, or be easier to write, using threads, then don't be put off by the threads-are-spelt-F_O_R_K dinosaurs. For many applications, threads are easier to use and reason about. Supply a few more details and we can compare possible solutions.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Hi BrowserUk,

      Basically there is not much shared data between sub tasks. And I can avoid much of that. But only thing they need to share (read and write both) using file is - how many processes of one type are in execution- just number like if there are 5 active sub task of one category or 6. Based on the max concurrency limit of overall tasks and sub tasks type, it can start a new sub task and increment that number in that shared file. The read and write operations to shared file will be over in fraction of second. And to avoid any deadlock, I can use wait of some seconds (while loop with exit condition) in case a process is not able to read or write to the shared file when it wants to do so. For just read operation on shared files, I feel there is no need to worry about synchronization between sub tasks, simple read retry would be sufficient.

      Based on the detailed explanation you have given, I feel that multiprocessing using fork() would be more appropriate then threads. I thought of using threads due to only one reason - it would have avoided significant code change and still I would have benefited by parallel processing of sub tasks.

      To answer query about system - it is SUN high end server with 40+ CPUs and 48 GB RAM with Solaris 10 OS. Perl modules use APIs of enterprise product to perform various operations related to the product (Perl is handling both automation and complex business logic) using input feed from CSV file.

      Currently it is kind of sequential approach with only parallel processing at the Product API level using fork(). I have to change it to end to end parallel processing (it is possible to logically group sub tasks) to reduce processing time, which is heavily dependent on enterprise product. But I see parallel processing giving around 40-50% (10 Hours) reduction in overall processing time, hence this question.

      I have to confess, I learned some really deep things from the the answers given to my question! And now I feel that fork() would be better option in this case, with only overhead of writing a lot more code to get this enabled :-).

      Thanks a lot to you and others for giving valuable suggestions and insight into this parallel processing options using threads and fork().

      Best regards, Pawan

        pawan68923,

          ...only thing they need to share (read and write both) using file is - how many processes of one type are in execution...

        I would have the the parent keep these counters in memory. From your description the parent starts the processes and maintain the counts. Why use a file?

        Some things to consider:

        • Test your theory. That many CPUs may act different than you (or I) might think. I have many times assumed a solution to find that it didn't scale well. You could have dependencies that you don't even know exist.
        • Always lock even it's "read only". 'flock' is trivial compared to the time fixing a solution that now will "read and write" only one thing.

        Sounds like a massive undertaking -- should be fun!

        Good Luck!

        "Well done is better than well said." - Benjamin Franklin

        Based on the detailed explanation you have given, I feel that multiprocessing using fork() would be more appropriate then threads. I thought of using threads due to only one reason - it would have avoided significant code change and still I would have benefited by parallel processing of sub tasks.

        Hm. Nothing in the sparse details you've outlined gives me cause to reach that conclusion; especially if -- as you've suggested -- using fork would require a substantial re-write.

        Let's say the basic structure of your current serial application is something like:

        #! perl -slw use strict; use constant { TOTAL_JOBS => 130, }; for my $job ( 1 .. TOTAL_JOBS ) { open my $in, '<', 'bigdata.' . $job or die $!; my @localData = <$in>; close $in; ## do stuff with @localData }

        Then converting that to concurrency using threads could be a simple as:

        #! perl -slw use strict; use threads; use constant { TOTAL_JOBS => 130, MAX_CONCURRENT => 40, }; for my $job ( 1 .. TOTAL_JOBS ) { async { open my $in, '<', 'bigdata.' . $job or die $!; my @localData = <$in>; close $in; ## do stuff with @localData }; sleep 1 while threads->list( threads::running ) >= MAX_CONCURRENT; $_->join for threads->list( threads::joinable ); } sleep 1 while threads->list( threads::running ); $_->join for threads->list( threads::joinable );
        But I see parallel processing giving around 40-50% (10 Hours) reduction in overall processing time, hence this question.

        Given the capacity of the hardware you have available, I could well see the above reducing the runtime to less that 5% of the serial version; though the devil is in the details you have not provided.

        Of course, using Parallel::ForkManager should allow a very similar time reduction, using a very similar minor modification of the existing code.

        Why you feel that using fork should require a substantial re-write is far from obvious from the scant details you've provided. Ditto, the need for file-based counting and locking.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: ithreads or fork() what you recommend?
by Eliya (Vicar) on May 18, 2012 at 20:10 UTC

    It would help if you could describe a bit more what the processes or threads need to be doing. For example, do they need to share data, or synchronize themselves somehow? If so, this would be a pro-thread argument, because these things are more difficult to do with independent processes. Otherwise, fork is likely the more solid/stable solution.

      Thanks Eliya! Best regards, Pawan
Re: ithreads or fork() what you recommend?
by Anonymous Monk on May 18, 2012 at 20:02 UTC

    Depends, which version of perl for which platform?

    In general, based on age and simplicity, fork ought to be more reliable if that works for your purposes ( or if forks works ), because if a child fork segfaults it shouldn't affect the parent, but if a child thread segfaults, it probably might affect the parent

    Vague, I know :)

      As I have replied below, after analyzing answers provided to the question, I feel multiprocessing using fork() would be more appropriate for the task.

      Thanks! Best regards, Pawan
Re: ithreads or fork() what you recommend?
by sundialsvc4 (Abbot) on May 20, 2012 at 22:31 UTC

    Another consideration for a long-running problem of this size is ... “okay, we hope that something will not take-down one of the runners who are working simultaneously on this problem, of course, but what if something does?   How much of the work will be lost ... or maybe a better way of saying it is ... how far will the shards of wreckage fly and how many unrelated things might they hit?

    A process is a fairly well-isolated thing from all other processes.   It is very advantageous to arrange things so that, even though they might well employ read-only access to common data, they have their own address space, their own file-handles, maybe even (on a cluster) their own processor-affinity characteristics.   They can be independently started and restarted.   Whereas a thread clearly shares most resources with its companions.   I like to think of a process as the primary container for a big unit of work, which unit of work might be (as its natural common-sense definition may indicate) “multi-threaded” in nature.   I like to compartmentalize any “sharing” between processes such that it is read-only as much as possible; otherwise, built using pipes and queues.   Designed to deal with what we mainframe jocks called an ABEND.

    The model that I like to use for big, long-running activities is one that has been around forever:   batch jobs.   x jobs are running simultaneously within a management framework that I didn’t write, and the remaining y are queued.   Choose your language-of-choice in which to implement them.

      The model that I like to use for big, long-running activities is one that has been around forever: batch jobs. x jobs are running simultaneously within a management framework that I didn’t write, and the remaining y are queued.

      Yes, after analyzing all advantages and disadvantages of process and threads, I also felt individual process is best option...it can not take the risk of failing all jobs if any one job fails, (even it has to avoid single failure) )...huge data is leaded for main process, whcih will not be ideal scenario to start threads....and batch is simple, tested concept whcih works, so on my way of getting this done using process.

      Choose your language-of-choice in which to implement them.

      If it had been from scratch, I would have thought of this option, but in this case everything is written in Perl and written well (except multiprocessing part) working reliably from past 6 years (with agile mode updates).