Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re^4: Multithreaded Server Crashes under heavy load

by rmahin (Scribe)
on Aug 29, 2012 at 20:40 UTC ( #990561=note: print w/replies, xml ) Need Help??

in reply to Re^3: Multithreaded Server Crashes under heavy load
in thread Multithreaded Server Crashes under heavy load

Ok, so heres the link to the comma delimited CSV file perfmon generated. Did not see how to get it to generate socket information for a specific process though. Also, the thing doesn't show that perl ever reached 100% cpu usage, but using the task manager shows that it spawns multiple perl processes, and that cpu usage is maxed for a while, eventually all but one perl process disappears from the task manager, cpu usage drops back down, and then the program is unresponsive.

Made this log by adding a new counter log, and then adding various process counters. If this was incorrect let me know.

  • Comment on Re^4: Multithreaded Server Crashes under heavy load

Replies are listed 'Best First'.
Re^5: Multithreaded Server Crashes under heavy load
by Anonymous Monk on Aug 29, 2012 at 22:11 UTC

    Posted anonymously because my account is being f**** with (permission denied) -- BrowserUk

    Did not see how to get it to generate socket information for a specific process though.

    If you go into the Performance monitor view, click the green + icon, scroll the top-left list and expand the TCPv4 topic, you can select various counters relating to TCP. The most interesting counters are "Connections Active/passive/established/reset". The trace is most useful if there is not to much other TCP traffic on the box as it isn't process specific.

    Alternatively, you could grab TCPView which is a little more intuitive for viewing the instantaneous state, but not so good for tracing over an extended period.

    For an instantaneous snapshot in a textual form, try netstat -a -b and redirect to a file.

    The log you posted only covers a 3 minute period, and there are already 301 threads running when it starts. None of them apparently finish, so the trace essentially shows a static state. No so useful. I can only assume that by the time that trace is started, everything is hung.

    Without seeing threads being started and ending it is pretty useless. You need to establish whether threads are ever ending? (Perhaps your debug log shows that?) And whether once they end, whether they are actually going away.

    I suspect that either your connections are not being closed properly and are taking a long time to time out. And/or your threads ending, but not going away. I cannot (yet) see any reason from the source code however, and the log doesn't really tell me anything.

    The elapsed time suggests that this snapshot was taken 38 hours into a run? You really need to start PerfMon immediately the server is started and monitor its cpu/threads/memory/sockets at (say) 1 minute intervals until it appears to hang. Running netstat to a file with a similar interval over the same run would give the best information.

      Good luck with your account!

      Ok collected the information again. Sorry, the other 300 threads were from another a perl process running on the machine, but was just not visible on my remote desktop session, should have figured they would show up in the log though, my mistake. Ran this one on another machine, with no other users accessing it, and no other perl programs running. The 38 hours you saw was also for the process. Nonetheless, the log I posted only covered a 3 minute period because thats just how long it takes it to crash under the stress testing I'm doing. We had seen it happen occasionally under our regular use, so to flush out the problems I created some tests to get it to happen quicker, as the time it took for it to happen seemed entirely random, sometimes a day or two. This log also only lasts a few minutes, and I did it over 10 second intervals. Some commands it executes will be as simple as making a directory, and take almost no time at all to process, and others are longer.

      Also, I just cant really see any reason why a thread would not be exiting properly. When I run the program in debug mode, it seems to crash almost instantly which would lead me to believe could have something to do with printing to the screen, but have tried redirecting all output to a file with no change. But what is printed with the debug messages shows that all commands that the server actually receives and executes run, and close successfully. Here's the output of all the commands before it crashed It shows the "DEBUG -- (30) Closing connection" printed out by

      debug("Closing connection"); shutdown($client, 2); close $client; threads->exit;
      at the end of thread subroutine. So, yeah I'm really at a loss.

      Again, thank you for your help

        It is all over after the first 3 minutes. Its not a "hang". It's just running really slowly because it cannot get new sockets for connections.

        You are creating huge numbers of connections -- there are 6,897 open sockets when your program runs, presumably still timing out from previous runs -- but rather than cleanly shutting those connections down, they are going into TIME_WAIT state and then your server has to wait for one of them to time out (900 seconds or some such) before it can establish a new connection.

        MST Elapsed Time Working Set + Established Reset Processor Time Thread Count Ac +tive Passive 18:45:38.609 0 0 0 0 68 +97 6 6062 505 18:45:48.609 5.9687 1 12566528 68 +97 5 6062 505 18:45:58.609 16.0937 15.96875 5 40304640 68 +97 8 6081 505 18:46:08.609 23.4375 25.96875 13 78077952 68 +97 60 6166 505 18:46:18.609 38.9062 35.96875 12 109629440 68 +98 15 6185 505 18:46:28.625 52.1060 45.984375 8 94683136 68 +98 11 6185 505 18:46:38.640 59.5943 56 2 21966848 68 +98 5 6185 505 18:46:48.656 15.6006 66.015625 11 58703872 68 +98 47 6259 505 18:46:58.656 27.5000 76.015625 41 283103232 68 +98 69 6309 506 18:47:08.671 17.0046 86.03125 41 475500544 68 +98 69 6309 506 18:47:18.671 14.8437 96.03125 41 536985600 68 +98 69 6309 506 18:47:28.671 0 106.03125 41 536985600 68 +98 69 6309 506 18:47:38.671 0 116.03125 41 536985600 68 +99 69 6309 506 18:47:48.671 0 126.03125 41 536985600 68 +99 119 6359 506 18:47:58.671 0 136.03125 41 536985600 68 +99 192 6432 506 18:48:08.671 0 146.03125 41 536985600 68 +99 192 6432 506 18:48:18.671 0 156.03125 41 536985600 68 +99 192 6432 506 18:48:28.671 0 166.03125 41 536985600 68 +99 193 6433 506 18:48:38.671 0 176.03125 41 536985600 68 +99 193 6433 506 18:48:48.671 0 186.03125 41 536985600 68 +99 193 6433 506 18:48:58.671 0 196.03125 41 536985600 68 +99 191 6433 508 18:49:08.671 0 206.03125 41 536985600 68 +99 192 6434 508 18:49:18.671 0 216.03125 41 536985600 68 +99 192 6434 508 18:49:28.671 0 226.03125 41 536985600 68 +99 241 6535 508 18:49:38.671 0 236.03125 41 536985600 68 +99 241 6558 508 18:49:48.671 0 246.03125 41 536985600 68 +99 240 6559 509

        (If you are going to throw another set of data at us, how about you expend a little energy to make the csv data readable :)

        There is either:

        • something wrong with the architecture of your server;

          (I haven't spotted it yet, but without running and tracing it can be hard to spot);

          The debug log -- had it not been empty - might have helped.

        • or your clients are not closing their ends of the connections properly;

          Basically, you are in a "dead man's shoes" situation, where your server cannot establish (or probably even accept) a new connection, until one of the existing dying-but-still-to-finally-die connections times out.

        Before you do another run, you need to clean up any existing connections. There is the netsh command that allows you to reset at various levels -- winsock; interface; ipv4; tcp etc. -- but perhaps the simplest is to just reboot the machine.

        You need to work out what is causing the connections to 'linger'. You appear to be using shutdown correctly -- at the server end at least -- and closing the filehandles; but something is preventing them from being reused immediately, despite your ReuseAddr setting on the listener.

        Not a solution, but maybe it will give you some clues about where to start looking and how.

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        Your debug.log file is empty.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://990561]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2017-06-27 06:27 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (600 votes). Check out past polls.