http://www.perlmonks.org?node_id=1062445

scunacc has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks

UPDATE: Interesting. The REST calls I make require authentication. I get an authentication token and then make a 2nd call. I decided to cache the authentication token per thread (which worked - wasn't sure if it would for multiple successive calls) and it dropped *2 hrs* off the AIX run time. So, making the REST calls from AIX is definitely a source of the problem. Not sure why yet. Interestingly, doing the same on the Xeon Intel machine running Linux dropped the runtime to 52 mins from 1.5 hrs. So, not so much of an improvement but definitely better there too.

Kind regards

Derek

-----------------------------------------

ORIGINAL QUESTION BELOW

Hi folks,

Skip to the bottom for the Q if you don't want the background here:

I have a mature robust multithreaded application that's been running in various guises and various UN*X flavors since 2008. I have just modified it to add some new features and run in a different way. The process can spawn up to several hundred threads at a time which remain running to multiplex and queue multiple parallel client inputs on multiprocessor machines.

In the new version, I hit an external website with a REST query rapidly - multiple times per second with several hundred clients talking to the server process threads sequentially - 20 at a time.

When I run this on an old Pentium D Dual core machine, it takes 2 hrs to run thru' all my clients. On a Xeon quad core, it's an hour and a half. The network response times do dominate a bit - I'm already aware of that. However, on AIX, the thing takes **4 hours** to complete! That.Should.Not.Happen! :-) The AIX machine is a multiprocessor workhorse with partitioning and 64G memory. It beats the pants off of the Xeon-based machine, and certainly the Pentium D machine.

So, - my Q is, why would this take so long on AIX? Is it a function of using threads on AIX? Something in the REST client libs? (Using REST::Client) something else odd that anyone is aware of on AIX? Remember - I've been using this software in one form or another on AIX for years (I wrote it - it's about 60k lines of code overall). Only noticed this problem because I had a comparison baseline with other systems on this particular instance of the server, and other uses of it didn't hit the problem. I'm thinking something in the net calls or something in the context switching among threads.

Corollary: I have a way to batch the REST query data. When I do that, on the Pentium D, the data set takes 18 mins (but is less accurate). When I batch it on AIX it takes *15* mins. Yes - right - so - something to do with networking or context switching. I'm pitching to networking.

Corollary 2: The Xeon machine is on the same internal network as the AIX machine. So, not a network connectivity problem per se.

Fellow Perl folks. Your help - as always - is much appreciated.

Kind regards, Derek

Replies are listed 'Best First'.
Re: Multithreaded process on AIX slow
by flexvault (Monsignor) on Nov 13, 2013 at 21:20 UTC

    scunacc,

    What version of AIX?

    Background: If its AIX 6.1 or later, you may be working against the AIX dispatcher. Unix and Linux treat cores as CPUs, while AIX knows that the first core of the CPU is the fastest and the last core is the slowest. So as an example, let's say you have 8 CPUs with 6 core each. The AIX dispatcher will always want the first core of each CPU working the most and then the next level and so on.

    You may want to search on this since I seem to remember that in AIX you may get better performance by limiting the number of active threads, so they execute on the faster cores. Also, I believe I/O bound threads are dispatched on the slower cores and CPU intensive on the faster cores.

    I believe the 'xlc' compiler is designed to help C/C++/Fortran programs, but Perl is on it's own. And if your Perl was built with 'gcc' you may be in an even worse situation.

    I don't know if this will help, but maybe the information will give you a different approach to the problem.

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

      Hi Ed,

      Appreciate the observations. Interesting… How does it handle assignments for micropartitioning then since it's virtualized further? I have no control over how that is allotted on this system.

      Also - the way the application is designed, I have multiple threads for input and multiple for output. Each either feeds a Q (input) from a client or reads a Q (output) and communicates back to waiting clients. The clients send info. to the server, then sit waiting (as a reverse "server" if you will) for results. The Q's are used to enable the processing threads (of which there are a considerable number in a hierarchy / intercommunicating community performing different related functions) to consume what they want in parallel. When done, they asynchronously dump the results into the output Q's, shared with the output threads. There is no ongoing connection to clients. That is also asynch, the client connection information being carried from input to output Q as part of the SOAP object. The output Q handling threads then contact the client back (acting as a client to the client acting as a server) saying: "Here's the answer".

      The I/O that's binding me here isn't that though, since that processing has worked fine with some other in-machine operations at breakneck speed maxing things out nicely when required ;-) since 2008.

      The problem seems to be multiple net connections using REST as I mentioned. I can still slam things as fast as they will go with hundreds of clients in sequence if I *batch* my REST data - and - as I say - it completes in nearly the same time as the Xeon-based version then. It's still doing the exact same amount of *other* I/O though. I still start the same # of threads. I still have the same number of clients sending the same amount of data. It's just how much data I then send in each REST request - and, I guess, how many consecutive REST requests I'm making as a result. (1 vs. 50).

      So, I/O *per se* isn't binding me. I think what I'm wondering is whether the REST::Client or HTTP::Request modules have any known issues on AIX.

      This is AIX 5.3 - can't upgrade this machine. I have a 6.1 machine available that I will have to build an identical Perl on to test with though.

      Hmmm. Let's see (…logging in sounds…) I built this particular Perl instance with gcc. :-/ Ah - there was a reason for that - I also had to build postgreSQL on this particular machine and have it dynamic link. Was the only way to get them to play nicely with each other.

      Kind regards

      Derek

Re: Multithreaded process on AIX slow
by BrowserUk (Patriarch) on Nov 13, 2013 at 20:32 UTC

    There is only one way you will make progress on this: profile.

    Profile a run on the AIX machine. Profile a run on the xeon machine.

    Compare.

    If you are unfamiliar with profiling, or want someone to cast a second pair of eyes over the output, and you could send me the block, line & sub .csvs produced (by NYTProf) for the main script plus the sources, I'd willingly take a look.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi, appreciate the suggestion but…

      From the docs:

      "Devel::NYTProf is not currently thread safe or multiplicity safe. If you'd be interested in helping to fix that then please get in touch with us. Meanwhile, profiling is disabled when a thread is created, and NYTProf tries to ignore any activity from perl interpreters other than the first one that loaded it."

      So, - not really any help there I'm afraid. This application cannot be run - or rewritten to run - without threads. By definition it runs with over 100 threads and already takes a lot of shortcuts, e.g. to reduce memory footprint per thread for example like doing dynamic module unloading in each thread after startup - which will confuse profiling. To run the program without threads (even if that were possible which it isn't -would take several months development effort to rewrite), it would then take 10s of hours to run the data.

      Kind regards

      Derek

        Devel::NYTProf is not currently thread safe or multiplicity safe.

        Sorry. I don't use NYTProf myself, so I've never encountered the limitation. I only mentioned it because it seems to be what everyone else recommends.

        This application cannot be run - or rewritten to run - without threads.

        There would be no point to doing so as the bottlenecks would undoubtedly move.

        But, the core of my advise above remains. You will need to profile.

        I tend to do this manually. I find that profiling every line takes too long and produces far too much output to be useful.

        So, I start course-grained at first till I find the points of interest, then more and more fine-grained around those points of interest until I find what I'm looking for.

        So, for a queue consumer thread I will just trace:

        printf STDERR "[%3d] %u %s:%d %s\n", threads->tid, time(), __FILE__, _ +_LINE__, $workitem;

        once for each loop of the queue processing loop.

        That should have negligible affect on overall performance of the code, but should give sufficient information to see if one or more your threads and/or particular work items are significantly slower on the AIX machine that the others.

        Once you get some indication of where the differences lie, you can then disable or remove the irrelevant traces and add a few more in the appropriate places to allow you to zoom in on the cause(s).

        One possibility -- based upon something you said in one of your other replies above --

        are you using virtualisation on one or both of the systems? If so, how are you configuring the vm that is running this app (Ie. what cores/cpus/virtual network adapters/etc)?

        Because historically, if you configure a VM with equal or more of any of those than the actual machine it runs on has, then it can lead to some very bad interactions between the host and guest OS schedulers and lead to diabolically bad performance. Just a thought.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Multithreaded process on AIX slow
by kschwab (Vicar) on Nov 13, 2013 at 23:23 UTC
    Does smtctl say that simultaneous multithreading is on? I think it defaults to on, but it could be off.
      On, yes.
        If it's easy to test, try setting these environment variables before you kick it off:
        export AIXTHREAD_SCOPE=S export SPINLOOPTIME=500 export YIELDLOOPTIME=100 export MALLOCMULTIHEAP=1
        These are what are recommended for websphere (a java app server), which is, at a high level, doing something similar. All mostly targeted at reducing contention for deadlocks, mutex, etc.
Re: Multithreaded process on AIX slow
by talexb (Chancellor) on Nov 14, 2013 at 13:42 UTC

    Interesting problem. The application without the REST piece works OK -- and the REST piece works OK, but put them together and the whole thing slows down. That's my understanding of the problem.

    I wonder if there's an issue in the middle -- is there a way to limit how many REST requests are in flight? If so, you might see how things go by limiting that number to see if you get better performance. Alternatively, is there some resource that a REST request uses in the application that's in short supply?

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Hi Alex.

      Thanks for the thoughts. Essentially you are correct in your 2nd sentence.

      I do have a standalone version of the very simple REST piece I could try hammering away with as fast as possible in parallel I guess. I should do that to eliminate the cross-fertilization with the threading on AIX I suppose, but, it needs to work in this server, otherwise I'll have to use the Linux port instead which is not ideal in this situation but might be workable. (The server communicates with another similar (but-non-REST-call-making) server to store the results from the REST calls in a database among many other tasks, but that second server currently runs on the AIX box where it needs to sit - and does so very smoothly presently. However, they *are* on the same net and it is all interoperable - just that it means more net interactions between the two machines instead of having both servers on the same machine.)

      And, yes, if I reduce the # of REST calls by batching as I've mentioned, it works very quickly - quicker even than the Linux version.

      The REST resource is not in contention. + I have short-fuse alarm timeout handlers using eval built around the calls so if they do fail it bombs gracefully and moves on to the next call.

      Kind regards

      Derek

      Is the REST code using some kind of semaphore internally? You've been running this same code, with REST, on other machines and it's only slowing-down on one?
        Hi

        Appreciate the further thought. All good to help the juices flow :-)

        Identical code - yes. I guess I need to go look at the actual REST::Client and HTTP::Request CPAN module sources to see what they are doing to answer that.

        I know when I developed the surrounding code back in 2008 I discovered a bug/feature in the Perl threading mechanism and variable sharing (reported back on it) that had me going in circles for a while, but led to some interesting workarounds that ended up making the code more efficient as a result. That isn't related to this though. I suppose I'll just have to dig through those net modules and see as well on this. Oh well :-) All good fun.

        Kind regards

        Derek