Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: When to use forks, when to use threads ...?

by renodino (Curate)
on Sep 04, 2008 at 17:41 UTC ( #709076=note: print w/ replies, xml ) Need Help??


in reply to When to use forks, when to use threads ...?

after trying for days to overcome a memory-leak problem

Care to expand on that ? You are aware that Perl maintains its own heap, and doesn't release memory once it acquires it from the runtime heap ? Which, from external monitoring apps, may make it appear that you've got a memory leak, when in fact Perl is just doing its thing.

That said, when doing big load jobs into DBMS's, there are any number of potential concurrency pitfalls, some of which will occur in both forked and threaded environments (possibly because they occur in the DBMS itself). YMMV.


Perl Contrarian & SQL fanboy


Comment on Re: When to use forks, when to use threads ...?
Re^2: When to use forks, when to use threads ...?
by zentara (Archbishop) on Sep 04, 2008 at 18:13 UTC
    Ah... I just posted a node about Perl releasing memory back to the system when using threads... see OS memory reclamation with threads on linux From my previous experimentation, it was almost certain that Perl would hold onto the memory, but that may have been because I was testing at a "sweet spot" where Perl's calculations of free memory-vs-it's use allowed it to retain the memory. But when it is very high mem usage, Perl will release it. It's a crap shoot, and may depend on Perl versions, thread versions, and even the kernel.

    But the gist is, Perl will release large memory chunks if it's in a thread. And the c guru said that top and ps cannot always be trusted as an accurate measure of mem use. So I would ask, as your memory climbed, did the system slowdown, or did things keep running normally.


    I'm not really a human, but I play one on earth Remember How Lucky You Are
      So I would ask, as your memory climbed, did the system slowdown, or did things keep running normally.

      I watched the RES column from ps's output, first via top, then also using a code snippet taken out from Process::MaxSize like
      [...] my $size; open PIPE, "/bin/ps wwaxo 'pid,rss' |"; while(<PIPE>) { next unless /^\s*$$\s/; s/^\s+//g; chomp; $size = (split(' ', $_))[1]; } close PIPE; return $size; }
      I didn't wait to let the system really slowdown, but stopped when it became obvious that swapping would have to start.


      Krambambuli
      ---
Re^2: When to use forks, when to use threads ...?
by Krambambuli (Deacon) on Sep 05, 2008 at 08:03 UTC
    Care to expand on that ?

    Well, I'm not sure how to do that efficiently, it's a rather convoluted story - my baseline so far is that I will avoid for now and the near future to try using threads for parallelizing conditional-insert-or-updates into Oracle.

    The base problem is something like 'given a (rather big) amount of new data, add it to the existent tables that are holding it, inserting or updating if similar records are already there. Speed up the process so that it gets done as fast as possible.'

    What I seem to have so far is that Oracle's libclntsh.so in conjunction with Perl threads will loose 4 or 8 bytes on every thread-switch. Which thread to use depends on the input record.


    Krambambuli
    ---

      For your specific problem I wouldn't use threads either.

      1. Whilst using DBI from a single thread (within a possibly multithreaded process) will not cause problems. Using DB vendor C libraries from multiple threads is fraught with dangers, regardless of whether you do so from Perl via DBI, or your own C program. Some DB vendor C libraries are not themselves thread-safe because they variously:
        • use process ids as keys to internal structures.

          You can imagine the problems this will cause of you try to run two or more threads concurrently accessing the same DB from the same process.

        • Allocate and deallocate prodigous amounts of heap memory--for the transport of the data--which is allocated by the client program and freed by the vendor library or vice versa.

          Unless both the client program and the vendor libraries are using the same underlying (C runtime) memory management libraries--and those memory management routines are thread-safe--then memory leaks can occur.

          It is easy to see how, with the client program allocating heap memory to hold the data it is giving to the DB and the DB vendor libraries freeing that memory once they've dispatch that data to the DB via a socket or pipe, that unless both the client program and vendor libraries are built against exactly the same version of the underlying C runtime, problems can result.

          Eg. If the client libraries are statically linked to (say) GCC CRT v2.9x but your client program (perl) is statically linked against GCC CRT v3.x, then problems can arise. The same thing with MSVCRT7 versus. MSVCRT8 for example.

      2. More importantly, multiprocessing of large volume inserts into a single DB will quite likely slow things down. Regardless of whether you are using forks or threads!

        Think about what is happening at the DB server when you have multiple clients doing concurrent inserts or updates to the same tables and indexes. Regardless of what mechanisms the DB uses for locking or synchronisation, there is bound to be contention between the work being done by the server threads on behalf of those concurrent clients.

        And if you have indexes and foreign keys etc. then those contentions compound exponentially. Add transactions into the mix and things get much slower very fast.

        For mass updates, using the vendors bulk insertion tool from a single process, preferably on the same box as the DB and via named pipes rather than sockets if that is available, will always win hands down, over trying to multiprocess the same updates. Always.

        For best speed, lock the table for the duration of the insert. If possible, drop all the indexes, perform the insertion and then re-build them.

        If dropping the indexes is not possible (as you've mentioned elsewhere), then consider inserting the data into a non-indexed auxiliary table first, and then using an SQL query that runs wholly internal to the DB to perform the updates to the main table, based on data from that auxiliary table. Again, locking both first for the duration.

        Finally, bend the ear of, or employ, a good (means expensive) DBA to set up your bulk insertions and updates processing for you. A few days of a good DBAs time in setting you up properly, can save you money over and over and over. There is no substitute for the skills of a good DBA. Pick one with at least 5 years of experience of the specific RDBMS you are using. More than most other programming fields, vendor specific knowledge is of prime importance for a DBA.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://709076]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2014-09-16 07:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (157 votes), past polls