Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Reliable asynchronous processing

by Codon (Friar)
on Jul 07, 2005 at 21:31 UTC ( #473244=perlquestion: print w/ replies, xml ) Need Help??
Codon has asked for the wisdom of the Perl Monks concerning the following question:

I am on the eve of designing an extensive application that will need to be extremely efficient. I am going to need to do asynchronous (parallel) processing. The results of these sub-processes will need to be collected and "collated" by the parent process and sent back to the user. I am looking for some technology that will be able to support this.

I cannot use fork() because of the overhead of spawning (and reaping) multiple real processes. Perl threads are not ready for Prime Time. This seems to leave me in a bit of a bind as to how to solve this.

Do any of the monks have suggestions for technology that can support this model?

Update: I am concerned about the overhead of fork() because of the amount of data that I intend to have cached in shared memory. When process reaping occurs, Perl's garbage collection attempts to free memory that is really shared memory. The Linux kernel will then attempt to copy all of the shared memory for Perl so Perl can clear it. The concern is less with the actual fork() as much as it is with the reaping of children.

Ivan Heffner
Sr. Software Engineer, DAS Lead
WhitePages.com, Inc.

Comment on Reliable asynchronous processing
Select or Download Code
Re: Reliable asynchronous processing
by Transient (Hermit) on Jul 07, 2005 at 21:32 UTC
    Java threads work just fine. I have used them in a situation similar to this to make real-time "availability" queries to various sources and combine the input to present the available products back to a user.

    Update: Edited title to remove the suffix "(preferably Perl)" (after OP was edited?) - apparently Java is either not construed as a technology or I'm lying. Go figure.

      Perhaps Java is not construed as reliable or appropriate by the people here, or they think the OP was expecting a perl based solution. :)

Re: Reliable asynchronous processing
by shiza (Hermit) on Jul 07, 2005 at 21:40 UTC
    I'm not quite sure if this would be suitable, but, would it be possible to separate each sub-process to it's own process? Each individual process could then be scheduled to run at certain intervals (or triggered by your app) and store results (in a database or file) that could then be retrieved by your user facing application.
Re: Reliable asynchronous processing
by eyepopslikeamosquito (Canon) on Jul 07, 2005 at 21:48 UTC

    I cannot use fork() because of the overhead of spawning (and reaping) multiple real processes.
    Why not pre-fork them and leave them up, communicating with them via a shared memory "scoreboard" (a la Apache)?

    Also, you might find POE useful.

Re: Reliable asynchronous processing
by perrin (Chancellor) on Jul 07, 2005 at 21:49 UTC
    If you're on Linux, forking is probably more efficient than you think it is. You'd have to be doing almost nothing in your actual processing for it to have a big impact.

    If you're determined to have the most efficient approach possible, it would most likely be a single process using non-blocking I/O. There are some event modules on CPAN, like Event-Lib. At the point where you are worried about the overhead from forking though, you probably should be looking at C.

      Event processing is an efficient and time-tested approach to asynch processing. There is an excellent discussion of this in the Event module.

      Remember: There's always one more bug.
Re: Reliable asynchronous processing
by jdhedden (Deacon) on Jul 07, 2005 at 21:51 UTC
    From your analysis, what specifically is it about Perl threads that caused you to drop Perl from consideration in your design?

    Remember: There's always one more bug.
Re: Reliable asynchronous processing
by Corion (Pope) on Jul 07, 2005 at 21:58 UTC

    Depending on how desperate you are, and how much time you can spend on programming instead of simply purchasing more hardware (or time), there are a few other possibilities:

    Use Coro. It implements cooperative multitasking in a much more programmer friendly way than POE. In fact, you will likely not need to rewrite your existing codebase, as it hooks into the IO stuff of Perl as well. It has the slight drawback that DBI queries still block your whole application (and every "thread" in it). There is a (POE) module to hand out long running queries to a separate process, and likely you will be able to abuse that module under Coro too. Coro is very picky about your version of glibc - you need to have version 6 (or higher).

    Use Apache. Yes. Use Apache. Apache has a very extensive multithreading/multitasking framework and can be used to serve data/process/pass data, not only on Port 80. For example, mock wrote a mailserver using the Apache API (I think Apache::SMTP). If you have the time and manpower, using the Apache framework can give you lots of C-powered leverage while you still use Perl.

    Updated: Added "on programming" in the first sentence to clarify where the time is to be allocated, at programming time or runtime

Re: Reliable asynchronous processing
by BrowserUk (Pope) on Jul 07, 2005 at 22:34 UTC
    Perl threads are not ready for Prime Time.

    What does that mean? Who said it? And how will you know unless you try it?

    Does it get easier? Of course, the devil is in the details, but then, you didn't give us any.

    #! perl -slw use strict; use threads; use Thread::Queue; use Data::Dumper; our $QMAX ||= 1000; our $TMAX ||= 3; our $N ||= 1000000; my $Q = new Thread::Queue; sub thread { my $tid = threads->self->tid; for( 1 .. $N ) { $Q->enqueue( join ':', $tid, int rand( 10 ) ); select undef, undef, undef, 0.01 while $Q->pending > $QMAX; } $Q->enqueue( undef ); } threads->new( \&thread )->detach for 1 .. $TMAX; my %collate; for ( 1 .. $TMAX ) { while( my $data = $Q->dequeue() ) { my( $src, $value ) = split ':', $data; $collate{ $src }{ $value }++; } } print Dumper \%collate;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      Perl threads are not ready for Prime Time.
      What does that mean? Who said it? And how will you know unless you try it?

      threads::shared says it:

      BUGS bless is not supported on shared references. In the current version, bless will only bless the thread local reference and the blessing will not propagate to the other threads. This is expected to be implemented in a future version of Perl. Does not support splice on arrays! Taking references to the elements of shared arrays and hashes does not autovivify the elements, and neither does slicing a shared array/hash over non-existent indices/keys autovivify the elements. share() allows you to share $hashref->{key} without giving any error message. But the $hashref->{key} is not shared, causing the error "locking can only be used on shared values" to occur when you attempt to lock $hasref->{key}.

      DBI says it:

      Threads and Thread Safety (...) Using DBI with perl threads is not yet recommended for producti +on envi- ronments. For more information see <http://www.perl- monks.org/index.pl?node_id=288022>

      liz says it on PerlMonks: 288022 (the node referenced by DBI above).

      rt.perl.org says it.

      Considering those references, I wouldn't feel comfortable using Perl ithreads in an application designed to accomodate high transaction volumes. I understand that there have been a lot of fixes to ithreads since 5.8.1 (current at the time of liz's post), but there's still no COW. That alone makes it very undesireable.

      -Colin.

      WHITEPAGES.COM | INC

        Let's take those one at a time.

        • Liz's node says: "Perl's ithreads are not light.".

          That's true, but then they could not be so.

          It's like saying that trucks weight more than cars. It's true, but that doesn't stop trucks being useful or usable. It just says, that you shouldn't pretend you are driving a car when your driving a truck.

          Many of the other issue's Liz raises are true limitations of iThreads, and indeed, she provides several modules that can be used to make many of these problems less apparent.

          But as you rightly point out, Liz's node was written at the time of 5.8.1, which as those of us that have been following along know, was the very worst build for threads problems ever. It was worse than it's predecessor and was very rapidly superseded by 5.8.2 which went a long way to fixing many of the problems it introduced.

          Basing your judgment on Liz's post, is like saying "Perl can't do structured data", based on the docs for Perl4--it is wildly out of date.

          Since then, there have been several more builds which have each cleaned up outstanding bugs further until, in my estimation, Perl's threads are stable. That is, they (mostly) comply to their "specification".

          Note: That does not by any means say:

          1. They are perfect--nothing ever is!
          2. They are the easiest thing in the world to use--their specification unfortunately guarantees that they cannot be!
          3. That they are guaranteed bug-free--but nor is anything else in Perl.

          Many of the other issues that Liz raised in that article are non-issues when you stop expecting threads to act like forked processes.

          Threads and forks are different.

          IMany of the issues Liz raises come directly because ithreads have been design to work in a fork-like manner. Ie. duplicating all existing code and data at the point of thread creation.

          Indeed, if this was not done, many of the issues with ithreads--including the perceived need for COW-- would disappear!.

          If thread->create( <coderef>, ... ); simply created a new thread running a new interpreter running the coderef supplied, and left the programmer to decide what needed to be loaded into that interpreter and shared with that new interpreter, most of the issues would not exist.

          With the greatest respect to Liz, her Things you need to know before programming Perl ithreads has completely undone ithreads because it continues to be used as the derigour reason for not using threads, which means no-one uses them, which means the issues with them never get addressed and little or nothing happens by way of improvement. It's a vicious circle that leads to the next issue.

        • Using DBI with perl threads is not yet recommended for production environments. For more information see <http://www.perlmonks.org/index.pl?node_id=288022>

          Notice how it references the same, out-of-date information.

          Indeed, in my (admittedly limited) experience, there is no problem using DBI in conjunction with iThreads provided you only use DBI from one thread only.

          The same is true of many other things like Tk.

          And whilst that may seem like a major restriction, in practice, using one thread to take care of your DBI interactions and another thread to maintain your user interface, whether a GUI or HTTP, is a very sane way to structure your application. In fact, even if both Tk and DBI could guarantee thread-safety on all platforms and with all DBs, and all DB interface libraries--which is unlikely to ever be true--I'd still recommend using separate threads for each anyway.

          And that bring me to:

        • rt.perl.org

          If you look at those outstanding tickets, several of them, including the first half dozen on the list, are issues to do with 5005threads, and literally nothing to do with ithreads

          It is also possible to produce a list of outstanding bugs for many other areas of Perl, but that isn't stopping those features from being used in production.

        • Finally, there are the limitations described in threads::shared.

          Most of these restrictions could be lifted.

          And if there was more demand for them, they quite probably would have been lifted by now. The skills of the p5p guys are certainly up to the task as most of them are not that complicated, but without demand, there is little incentive for the work to happen.

        Again, none of that means that iThreads is the perfect api or that I wouldn't like to change things--but then there are many other things in Perl5 that are less than perfect. IO state in globals, the object model, syntax inconsistencies etc., but none of these things prevent Perl being usable in production environments to good effect. It simply means that you have to work within and around them.

        I think iThreads, and their restrictions are the same. Work with them and they can greatly simplify many programming problems that are awkward, messy, non-intuitive and a maintenance headache to deal with using the alternatives.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Reliable asynchronous processing
by Ultra (Hermit) on Jul 08, 2005 at 08:04 UTC

    I am concerned about the overhead of fork() because of the amount of data that I intend to have cached in shared memory. When process reaping occurs, Perl's garbage collection attempts to free memory that is really shared memory. The Linux kernel will then attempt to copy all of the shared memory for Perl so Perl can clear it. The concern is less with the actual fork() as much as it is with the reaping of children.

    Maybe I didn't understood your point well, but if your are sharing a big amount of data between childs, why not keep a "master" process that feeds the childs with chunks of data (i.e.: using pipes) when they need it, instead of keeping a copy of the cached data in each process's memory.
    Also, do you think that changing some data in one process auto magically reflects in the other processes states? Because this doesn't happens with fork.

    If I'd be in your shoes, I'd benchmark a small, memory eating problem using perlthrtut, my implementation with fork, my implementation using select and an implementation based on a single process (maybe with some forked helpers) using POE

    Dodge This!

      The data that would be shared is actually "static" data structures built from DB queries that need to be accessed multiple times per child per request. Trying to communicate over IPC for this sort of look-up would not be efficient for my needs. And when I say "static", I mean this data is to be read at start-up and will not be changed by any of the children (or parent) during the life of all processes / threads. I'm just starting on the architecture for this, so prototyping and benchmarks will consume a good portion of my time over the coming weeks. This was mostly a solicitation for technology suggestions.

      Ivan Heffner
      Sr. Software Engineer, DAS Lead
      WhitePages.com, Inc.
Re: Reliable asynchronous processing
by tphyahoo (Vicar) on Jul 08, 2005 at 09:30 UTC
Re: Reliable asynchronous processing
by SimonClinch (Chaplain) on Jul 08, 2005 at 10:04 UTC
    As has already been hinted at, it isn't necessary to spawn and reap threads during operation, only at startup and shutdown of the service as a whole.

    IPC::Semaphore and IPC::SysV (IPC = interprocess communication) contain all the technology you'll need for a master process to communicate with its living children (semaphores, which are part of the operating system, are the most efficient way to manage locking and triggering with shared memory).

    See also the System V IPC Chapter of Programming Perl by Larry Wall, Tom Christiansen, and Jon Orwant ISBN 0-596-00027-8 for a nice code example of how to use fork with shared memory and semaphores so that the processes can stay alive and remain in contact for control purposes.

    One world, one people

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://473244]
Approved by gryphon
Front-paged by tlm
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (13)
As of 2014-09-30 17:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (378 votes), past polls