Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

What's the best way to fetch data from multiple sources asynchronously?

by xaprb (Scribe)
on Jan 01, 2007 at 18:39 UTC ( [id://592452]=perlquestion: print w/replies, xml ) Need Help??

xaprb has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on making innotop (a MySQL monitor) capable of monitoring multiple MySQL servers simultaneously. One of the bottlenecks is waiting for queries to come back from the servers; it's fine when there's just one server, but if the program is to refresh itself every second and there are many servers, it's a problem. I'd like to fire the queries all at once. There are other processes (parsers) I'd like to do this with too, so the parsers can run on the first query results while the others are still being generated, but that's really kinda optional; if I can get the queries to be asynchronous, I'll be happy enough with that.

My first thought was forking child processes, but (correct me if I'm wrong, as I'm new to this) the child processes can't alter data in the parent process.

Then I looked around the web. Other solutions came to mind: shared memory, opening a pipe from the child to the parent so the child can serialize the result and the parent can re-hydrate it, using Storable and having the child write to a file and the parent read from it, etc etc. These all strike me as horrific kludges that may not be safe or portable. It also looks like Perl's threads are not portable, according to what I've read online.

There are some CPAN modules that do things like this, but I'm picky: I don't want people to have to install a bunch of arcane modules to run this program (one or two is okay, but not an entire bundle). And it needs to be fast, stable and portable to at least Linux, FreeBSD and Windows. Stop me if I'm asking too much.

I'm willing to go down any of the above-mentioned roads, but I'd love to hear your thoughts on which will be the most fruitful. I'd hate to spend time on something that's not going to work out well in the end.

Your guidance is gratefully received -- thanks for reading this far!

  • Comment on What's the best way to fetch data from multiple sources asynchronously?

Replies are listed 'Best First'.
Re: What's the best way to fetch data from multiple sources asynchronously?
by bsdz (Friar) on Jan 01, 2007 at 19:20 UTC
    I've implemented something similar using threads before. You can set up a job queue and have worker threads collect these jobs (sql queries), execute them, then return the data to some shared data object such as an array or hash. This shared object might then be tied to your output screen (console or gui?). It will certainly work on Windows and Linux. How do you wish to port your program?
      Was the thing that you implemented using DBD::mysql? It's not very clear if that module is thread-safe.
        I did use DBD::mysql and it didn't cause any multi-threading problems. At least the module's documentation suggests it is thread safe. I may have kept a separate DB handle open in each thread though I am not entirely sure as it was several years ago.
Re: What's the best way to fetch data from multiple sources asynchronously?
by zentara (Archbishop) on Jan 01, 2007 at 19:22 UTC
    I can only give you general guidance, because I'm not well versed in the workings of mysql servers, but it sounds like a good bet that you can setup multiple socket connections in threads. I would first look at the POE Cookbook, it probably has a ready made example.

    Otherwise you have a couple of possibilties. IO::Select and threaded. IO::Select will let you add socket filehandles then loop thru them and read them as they reply. The drawback is a huge transfer from one server, will block the others until it is finished.

    An alternative, would be to have your main thread connect to the mysql server, then have it pass the socket filehandles off to a separate thread for reading. That way they won't block each other. See FileHandles and threads


    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: What's the best way to fetch data from multiple sources asynchronously?
by BUU (Prior) on Jan 01, 2007 at 21:53 UTC
    One word: POE. POE is an excessively cool asynchronous framework that makes these sorts of things a breeze. The idea is that every thing you do with POE will be non-blocking, for example, querying a web page, talking to a db, and so forth, so you can launch as many of these as you want, without the hassles of dealing with IPC or threads. Check out some of the examples in the cookbook for sample applications that demonstrate the principle. Then see CPAN's massive collection of POE::Components.

      I'm interested in this thread because I need to re-write something similar eventually and I want to use the best approach. I was totally sold on (well at least looking into) Thread::Queue until you posted this. Then I recalled talking to one of the POE authors on a #pike IRC channel and he'd sold me on that long ago. If I recall correctly, he was writing POE for pike — which seemed a little unnecessary since pike has call_out()s built in, but the POE features that made it cool even in pike are what made me want to check it out, even though I never did. (I'd link to it, but he either never finished, or it's not called POE in pike.)

      I know there are quite a number of office POE tutorials — some written by famous perl monks IIRC — but I wish there were a really good POE tutorial here on this site. I bet it would even get tons of feedback in RFC because I bet there'd be a lot of interest in it.

      You know, perhaps I'll check out both approaches and see which I enjoy the most. I love perl.

      -Paul

Re: What's the best way to fetch data from multiple sources asynchronously?
by NetWallah (Canon) on Jan 01, 2007 at 19:19 UTC
    I have had excellent success with Thread and Thread::Queue.

    Sample code at This node. (Original code was Win32, but success has been reported on Linux - there is no Windows-specific code in the script).

    Update 1:I acknowledge my knowledge, and hence, this advice is now somewhat obsolete (although, it continues to work). Please see BrowserUk's advice below.
    I believe the sample code would work with minimal modifications, using ithreads (threads) instead of Thread(Perl 5.5), since both modules are compatible with Thread::Queue. Details are in perlthrtut.

         "A closed mouth gathers no feet." --Unknown

Re: What's the best way to fetch data from multiple sources asynchronously?
by rodion (Chaplain) on Jan 02, 2007 at 04:47 UTC
    At work, we've had good results using select(). We've had code using it that has been running at least 5 years. We've been able to add to the code when we wanted, and we've not had any significant problems. The code has been running on 32 and 64-bit Linux systems, and even on our older BSDI boxes, where the OS thread support is broken.
      The problem is that a DBI database/statement handle is not a pipe or socket, so you can't simply call select() on it. The DBI does not specify a method of executing statements asynchronously, though perhaps drivers might support it. It looks like there is some way of doing it with POE that uses forked processes behind the scenes.
      Quite right. For databases you have to spin off a separate process to deal with the database, which communicates with the selecting process through a socket. Thus all the things you are trying to coordinate become files or sockets.

      I definitely should have made this more explicit. I over-read the OPs statement that My first thought was forking child processes. Thanks for catching it.

        You would need not just one database process, but one process per database server, and round-robin your select loop over all of them.

        The real problem comes if the queries return largish volumes of data, then you have to sqeeze it all through those 8-bit pipes. Of course this is normal when you communicate with a dbserver via a socket. But in this scenario, the db processes are perl scripts using DBI, which means that the data received by those processes has already be de-streamed and structured (fetchall_hash|array etc.), which is a relatively expensive process. But now you need to flatten (serialise) that structure to pass it through the socket back to the parent process where it has then to be restructured again. With all the duplication of parsing and memory allocation that involves.

        So yes, you could definitely do this using a select loop, Storable freeze & thaw, and one socket & DBI process per DB server, but it ain't gonna be quick, or memory effiecient. If the required poll rate x the number of DBServers is more than 1 every ~3 or 5 seconds, you ain't gonna keep up.

        And if the data volumes are anything more than trivial, your gonna need a machine with a substantial amount of memory. Each byte of data queried (plus all the Perl datatstructure overhead), will concurrently exist in at least 4, probably 5 places at some point in time: The inbound, Perl-structured version in the DBI process; the outbound, Storable-structured version in the DBI process; the inbound Storable structured version in the select-loop process; the Perl re-structured version in the select-loop process; and whatever form the final application requirements need it to be in. Actually there would probably be a sixth (partial?) copy in the DBI library buffers also. And remember, you cannot take advantage of COW for any of this.

        With threads, you'd have at most 3 copies; no (additional) communications latency; no double deserialisation, reserialisation or restructuring. On top of that, there would be no need to break the applications processing up into a bunch of iddy biddy chunks so as to ensure than your select loop wasn't starved.

        And threads would be more easily scaled. If later you need to monitor another 10 DB servers, you simply spawn another 10 threads (the processing would be identical). With the multi-process and pipes method, you'd probably have to go back and repartition the application processing code, because you'd need to service the select loop with greater frequency.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: What's the best way to fetch data from multiple sources asynchronously?
by Moron (Curate) on Jan 02, 2007 at 15:57 UTC
    Unless you know in advance that there is a reasonable limit to how long a query will take to complete, you might like to consider the traditional approach of splitting the threads into submission threads and a separate transaction monitor. The latter is a daemon (detached process) which maintains a matrix of active queries and returns their results to the requesting processes. A submitter program is needed to communicate between the requester and the transaction monitor. The point is that otherwise all functionality needing to operate on the queries would need to stay alive in the same process until all others were done. By using a transaction monitor architecture, submitters can have any granularity they like down to a single query session or up to any number of asynchronous submissions from the same process without any dependence on each other for completion.

    Both requesting processes and the transaction monitor still need a technical way to manage threads, forks, poe or whatever you choose, this having been already addressed in other posts.

    -M

    Free your mind

      Many thanks to all. I love knowing my options.
Re: What's the best way to fetch data from multiple sources asynchronously?
by emazep (Priest) on Jan 04, 2007 at 15:22 UTC
    My first thought was forking child processes, but (correct me if I'm wrong, as I'm new to this) the child processes can't alter data in the parent process. Then I looked around the web. Other solutions came to mind: shared memory, opening a pipe from the child to the parent so the child can serialize the result and the parent can re-hydrate it, using Storable and having the child write to a file and the parent read from it, etc etc. These all strike me as horrific kludges that may not be safe or portable.
    Working with separate processes is certainly more difficult than working with threads, but (real) processes provide some advantages that threads can't provide.

    A multiprocess application is much more fault tolerant, since every process runs in its own separate address space, so that if a single child process dies, the whole application is not affected and the child process can simply be restarted. Instead all the threads in a multithreaded application share the same address space, so that a fatal error in a single thread can bring down the whole application.

    Another advantage of (real) processes is that they transparently migrate over an SSI cluster (as long as they don't use shared memory to communicate each other), while threads don't (at least with the most common SSI cluster implementations available today).

    Also the fork() emulation provided by Perl on Windows works quite well (except in some cases, which are btw avoidable).
    There are some CPAN modules that do things like this, but I'm picky: I don't want people to have to install a bunch of arcane modules to run this program (one or two is okay, but not an entire bundle). And it needs to be fast, stable and portable to at least Linux, FreeBSD and Windows. Stop me if I'm asking too much.
    You are not asking too much: on the contrary, you are worrying too much (and you are probably approaching the problem the wrong way).

    You are providing a complete application, not a module/library, right? So, for the Windows users, what does restrain you from providing them with a complete bundle (including all the necessary modules and the perl interpreter itself) packaged in a Windows installer, so that they don't have to worry about anything?
    On the other hand, requiring the average Windows user to first manually install perl, then version X of module A from repository A1 applying patch A2, then version Y of module B from repository B1 applying patch B2 etc., will make him run away from your application.

    In the case that the Windows user has already got perl installed he won't probably worry too much to have few megabytes duplicated on his hard-disk, and in the much more common case that he doesn't have perl and/or the necessary modules already installed, he will be more than happy to have a familiar-looking installer which transparently provides everything which is needed to run the application.

    If you want to see a working example of a (great) Windows application written in Perl, which packages into a Windows installer all the necessary modules, perl itself and even a bundled web server, have a look at POPFile.
    Update: my friend and monk lucas does the same with his popular free groupware application IGSuite.

    On *nix, where by the way you'll probably have much less problems, a similar path should be followed anyway: your application should be properly packaged for any single distribution (using a a deb package for debian-based distros, using an rpm package for RH-based distros etc.), so that the dependencies will be handled by the specific package managers (again, mostly transparently for the end-user). If your application becomes popular, this will be handled by various maintainers/packagers, so that you don't even have to worry about that.

    You can provide the instructions to manually install everything even on Windows, nothing prevents you from doing that, but provide the installer as well: nothing prevents you from doing that either.

    Ciao,
    Emanuele.
      OT*: Long live "SCANNER CON THREAD"!

      * A joke within the italian mongers mailing list, for a thread that lasted more than three months, with many flames that were successfully drowned under liters of wine

      Flavio
      perl -ple'$_=reverse' <<<ti.xittelop@oivalf

      Don't fool yourself.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://592452]
Approved by andyford
Front-paged by andyford
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (7)
As of 2024-04-24 06:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found