http://www.perlmonks.org?node_id=709061

Krambambuli has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perlmonkers,

I'm trying to get your advice about when to choose a 'threads' approach versus a 'forks' approach for solving a problem.

Being quite a bit disappointed right now after trying for days to overcome a memory-leak problem with an threaded approach to solve a mass-insertion DB app, I'll go now the forking and/or POE path anyway.

But in retrospect, I wonder if there would have been some good advices to consider about what to be aware of; so my question is: what would be the main arguments and reasonings for choosing threads over forks or viceversa before trying out both and see what's up the hard way ?

Thanks in advance,

Krambambuli

---
occasionally enjoying Mark Jason Dominus' Higher-Order Perl
  • Comment on When to use forks, when to use threads ...?

Replies are listed 'Best First'.
Re: When to use forks, when to use threads ...?
by tilly (Archbishop) on Sep 04, 2008 at 17:50 UTC
    My personal opinion is that in Perl I would need an extraordinary motivation to use multi-threading. Forking is almost always a better way to go on platforms that natively support it.

    For a mass insertion into a database I would recommend neither approach. Instead I would suggest learning what native tools your database has for mass inserts. I would then use them. If the table you're inserting into has indexes I would strongly suggest dropping all indexes, doing the insert, then re-creating the indexes. (The reason for that is that maintaining indexes during inserts results in a lot of random seeks to disk. Seeking to disk is expensive. Throwing away the index and rebuilding it at the end avoids most of those seeks and is therefore much faster.)

      For a mass insertion into a database I would recommend neither approach. Instead I would suggest learning what native tools your database has for mass inserts. I would then use them.
      ...If the table you're inserting into has indexes I would strongly suggest dropping all indexes,...

      That doesn't work as needed, unfortunately. It's not a simple insertion, but rather an insert-or-update-if-already-there process. So I cannot disable indexes either. Parallelizing the process in some way seemed the most appealing alternative to speed up things.

      Krambambuli
      ---
        If you're using MySQL, the fastest approach is to load the data in a temporary table and then use an INSERT... ON DUPLICATE UPDATE statement to copy it all into the existing table in one shot.
        Those are called upserts and the SQL 03 standard for that (may not be implemented in your database) is called merge. If that is not implemented and no other variants exist you can do an update followed by an insert of everything that is not found.

        In any case variants of Perrin's solution is the standard way to do it. Load a temporary table, then do the update within the database. If there is The primary key index should not be dropped, but all other indexes can be, and then can be re-created. With a good database that should be the most efficient way to go.

Re: When to use forks, when to use threads ...?
by kyle (Abbot) on Sep 04, 2008 at 17:04 UTC
      Blush... I didn't this time ;)

      However, I did now, thanks for the link. And still, reading through the answers, I'm convinced that it was worth re-posing the question.

      Thanks,

      Krambambuli
      ---
Re: When to use forks, when to use threads ...?
by wojtyk (Friar) on Sep 04, 2008 at 17:09 UTC
    Well, they each certainly have their pros and cons. Personally, I opt for forking unless I have something that requires a significant amount of shared interaction (either through required shared memory or producer/consumer queues). I find the extra overhead of forking to be minimal in modern day operating systems, and the gains are significant:

    - fewer memory leaks
    - process isolation
    - fewer race conditions
    - deadlocks/starvation not an issue

    Look to the latest release of Google's Chrome as a perfect example. The current threaded asynchronous model of browsers isn't cutting it, so they're moving to a process-per-tab model for all of the above reasons.

    Of course, this is a personal opinion and therefore flame-worthy, so take all advice with grain of salt :)

      It is possible to deadlock two separate processes that communicate with one another if the protocol between them is not well designed. The "fewer memory leaks" claim isn't necessarily true either, although it's easier to clean up after a small leak because the OS will reclaim the memory once the process exits.
        It is possible to deadlock two separate processes that communicate with one another if the protocol between them is not well designed
        Fair enough, but still seems like a nitpick (in that it is possible, but not probable - whereas in the case of programming threaded apps, it's a common occurrence). Also in the case of separate processes, it's much easier to recognize & debug.
        The "fewer memory leaks" claim isn't necessarily true either
        This one I stand by (in the statistical average case), particularly as the software grows more complex. As you mentioned, process termination is a natural "reset switch" that does wonders in keeping leaks down. But there's a whole host of other reasons that make memory leaks more common in threaded apps:

        - Harder to recognize and pinpoint (and thus fix...), since "memory usage" reported will be a macro-level observation instead of a process-isolated observation
        - More frequent use of resource locks (ie semaphores), reference counting, and shared resources in threaded apps, making it easier to leave resources in a limbo state
        - Maybe a redundant point, but "increased coding complexity" when using threads increases the possibility of a coding error

        As with all things involving coding, _any_ piece of code could be done "right" with no memory leaks, but I believe it's much easier to both introduce leaks AND harder to pinpoint and fix them in threaded apps vs standalone processes.

Re: When to use forks, when to use threads ...?
by zentara (Cardinal) on Sep 04, 2008 at 18:01 UTC
    My rule of thumb is use threads if you need realtime communication between threads, as they are running. If you don't need that, its better to fork, only because memory cleanup is better, and it's easier on the system resources.

    As far as your memory leak goes, there are ways to fix it, but you need to show some code. Usually it involves reusing threads.... don't detach them and expect them to go away.... threads must return or reach the end of their code block to be joined, or self-destruct if detached. Maybe your database connection isn't finishing completely, and the threads are not allowed to destruct.


    I'm not really a human, but I play one on earth Remember How Lucky You Are
      Usually it involves reusing threads.... don't detach them and expect them to go away.... threads must return or reach the end of their code block to be joined, or self-destruct if detached. Maybe your database connection isn't finishing completely

      I'm not sure I correctly understand what you are saying.

      My threads where intended to last practically forever, each thread handling one distinct and 'private' database connection.

      The memory leak was showing up during normal operation of the threads - apparently with each thread-switch - and I checked of course that my code would not 'grow' with a simple equivalent nonthreaded counterpart.

      Krambambuli
      ---
        If you are reusing the threads, each handling one connection, and you are gaining memory, it seems like something in the database connection is causing it. Have you looked at "perldoc -q clear" ? Somewhere at the end of each thread run, you need to clear out some old data or filehandles. You may have some hash or array that is retaining entries.

        I'm not really a human, but I play one on earth Remember How Lucky You Are
Re: When to use forks, when to use threads ...?
by renodino (Curate) on Sep 04, 2008 at 17:41 UTC
    after trying for days to overcome a memory-leak problem

    Care to expand on that ? You are aware that Perl maintains its own heap, and doesn't release memory once it acquires it from the runtime heap ? Which, from external monitoring apps, may make it appear that you've got a memory leak, when in fact Perl is just doing its thing.

    That said, when doing big load jobs into DBMS's, there are any number of potential concurrency pitfalls, some of which will occur in both forked and threaded environments (possibly because they occur in the DBMS itself). YMMV.


    Perl Contrarian & SQL fanboy
      Ah... I just posted a node about Perl releasing memory back to the system when using threads... see OS memory reclamation with threads on linux From my previous experimentation, it was almost certain that Perl would hold onto the memory, but that may have been because I was testing at a "sweet spot" where Perl's calculations of free memory-vs-it's use allowed it to retain the memory. But when it is very high mem usage, Perl will release it. It's a crap shoot, and may depend on Perl versions, thread versions, and even the kernel.

      But the gist is, Perl will release large memory chunks if it's in a thread. And the c guru said that top and ps cannot always be trusted as an accurate measure of mem use. So I would ask, as your memory climbed, did the system slowdown, or did things keep running normally.


      I'm not really a human, but I play one on earth Remember How Lucky You Are
        So I would ask, as your memory climbed, did the system slowdown, or did things keep running normally.

        I watched the RES column from ps's output, first via top, then also using a code snippet taken out from Process::MaxSize like
        [...] my $size; open PIPE, "/bin/ps wwaxo 'pid,rss' |"; while(<PIPE>) { next unless /^\s*$$\s/; s/^\s+//g; chomp; $size = (split(' ', $_))[1]; } close PIPE; return $size; }
        I didn't wait to let the system really slowdown, but stopped when it became obvious that swapping would have to start.


        Krambambuli
        ---
      Care to expand on that ?

      Well, I'm not sure how to do that efficiently, it's a rather convoluted story - my baseline so far is that I will avoid for now and the near future to try using threads for parallelizing conditional-insert-or-updates into Oracle.

      The base problem is something like 'given a (rather big) amount of new data, add it to the existent tables that are holding it, inserting or updating if similar records are already there. Speed up the process so that it gets done as fast as possible.'

      What I seem to have so far is that Oracle's libclntsh.so in conjunction with Perl threads will loose 4 or 8 bytes on every thread-switch. Which thread to use depends on the input record.


      Krambambuli
      ---

        For your specific problem I wouldn't use threads either.

        1. Whilst using DBI from a single thread (within a possibly multithreaded process) will not cause problems. Using DB vendor C libraries from multiple threads is fraught with dangers, regardless of whether you do so from Perl via DBI, or your own C program. Some DB vendor C libraries are not themselves thread-safe because they variously:
          • use process ids as keys to internal structures.

            You can imagine the problems this will cause of you try to run two or more threads concurrently accessing the same DB from the same process.

          • Allocate and deallocate prodigous amounts of heap memory--for the transport of the data--which is allocated by the client program and freed by the vendor library or vice versa.

            Unless both the client program and the vendor libraries are using the same underlying (C runtime) memory management libraries--and those memory management routines are thread-safe--then memory leaks can occur.

            It is easy to see how, with the client program allocating heap memory to hold the data it is giving to the DB and the DB vendor libraries freeing that memory once they've dispatch that data to the DB via a socket or pipe, that unless both the client program and vendor libraries are built against exactly the same version of the underlying C runtime, problems can result.

            Eg. If the client libraries are statically linked to (say) GCC CRT v2.9x but your client program (perl) is statically linked against GCC CRT v3.x, then problems can arise. The same thing with MSVCRT7 versus. MSVCRT8 for example.

        2. More importantly, multiprocessing of large volume inserts into a single DB will quite likely slow things down. Regardless of whether you are using forks or threads!

          Think about what is happening at the DB server when you have multiple clients doing concurrent inserts or updates to the same tables and indexes. Regardless of what mechanisms the DB uses for locking or synchronisation, there is bound to be contention between the work being done by the server threads on behalf of those concurrent clients.

          And if you have indexes and foreign keys etc. then those contentions compound exponentially. Add transactions into the mix and things get much slower very fast.

          For mass updates, using the vendors bulk insertion tool from a single process, preferably on the same box as the DB and via named pipes rather than sockets if that is available, will always win hands down, over trying to multiprocess the same updates. Always.

          For best speed, lock the table for the duration of the insert. If possible, drop all the indexes, perform the insertion and then re-build them.

          If dropping the indexes is not possible (as you've mentioned elsewhere), then consider inserting the data into a non-indexed auxiliary table first, and then using an SQL query that runs wholly internal to the DB to perform the updates to the main table, based on data from that auxiliary table. Again, locking both first for the duration.

          Finally, bend the ear of, or employ, a good (means expensive) DBA to set up your bulk insertions and updates processing for you. A few days of a good DBAs time in setting you up properly, can save you money over and over and over. There is no substitute for the skills of a good DBA. Pick one with at least 5 years of experience of the specific RDBMS you are using. More than most other programming fields, vendor specific knowledge is of prime importance for a DBA.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: When to use forks, when to use threads ...?
by perrin (Chancellor) on Sep 04, 2008 at 17:50 UTC
    It's a big question, so read up with Google and SuperSearch. The gist of it is that if you're on Windows you will usually prefer Perl threads to forking and on Unix it will be the opposite.
      Right, because on windows forking is emulated using threads, because windows doesn't have fork...
        I dunno... any time I've used that OS, I've thought "forking Windows...."
        -Bib
Re: When to use forks, when to use threads ...?
by dHarry (Abbot) on Sep 05, 2008 at 11:46 UTC

    For bulk loads into Oracle (I understand that’s the db that you use) nothing beats SQL*Loader. It’s quite flexible: various load formats, selective loading, multi-table loading etc. etc. The configuration can be a bit tricky but normally that’s where the DBA comes in :-) There are also many third-party (loading) tools which offer additional features. I was addicted to Toad making the life of the Oracle developer a lot easier!

    I have tried doing bulk loads into Oracle with lots of simultaneous connections and it worked very much against me. Especially when you use connection pooling (Oracle 9i + WebSphere 5).

Re: When to use forks, when to use threads ...?
by jdrago_999 (Hermit) on Sep 05, 2008 at 14:06 UTC
    On Windows, use threads; but on Linux, use forks;

    Everything else should be the same.

    Check out forks and threads. In my experience, using multiple processes via the forks module is convenient, simple and maintainable, even when inserting/updating large amounts of records in a database.
Re: When to use forks, when to use threads ...?
by Illuminatus (Curate) on Sep 05, 2008 at 16:04 UTC
    One of the prime motivators for choosing threads over forks used to be 'overhead'. In older versions of Solaris, for example, the context-switch time between threads was about 1/10 of that for processes. It is still the case that process creation is *much* slower than thread creation. The other was easier mechanisms for interaction (ie, not having to rely on IPC for information-sharing), and overall management (SIGCLD handling is less than elegant).

    Now, context-switch is not as big a win as it used to be, but IPC is still more complicated to build and maintain, and is not as portable as pthreads

    Of course, this says nothing about the 'challenges' of perl's implementation of threads.
Re: When to use forks, when to use threads ...?
by pajout (Curate) on Sep 05, 2008 at 16:07 UTC
    I have some experience with mass, remote replication from MySQL, MSSQL and PostgreSQL source databases into one PostgreSQL target database. According to advices above, I would prefer to tune the mass-insertion logic rather than threading or forking. If you can design upserts specifically for every case, following ideas could help you:

    - try to recognize what should be inserted and what updated (IDs, columns storing last change time, ...) generally, not row-by-row
    - use previously advised ways of insert or update
    - use stored procedure upsert_my_table() on the target database
    - try to insert more than one row in one dml command
    - try to take advantage of existing indexes (on both source and target db) - sometimes upserting from properly sorted source table to (properly sorted) target table makes sense