Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^7: Multithreaded Server Crashes under heavy load

by BrowserUk (Pope)
on Aug 30, 2012 at 09:13 UTC ( #990692=note: print w/ replies, xml ) Need Help??


in reply to Re^6: Multithreaded Server Crashes under heavy load
in thread Multithreaded Server Crashes under heavy load

It is all over after the first 3 minutes. Its not a "hang". It's just running really slowly because it cannot get new sockets for connections.

You are creating huge numbers of connections -- there are 6,897 open sockets when your program runs, presumably still timing out from previous runs -- but rather than cleanly shutting those connections down, they are going into TIME_WAIT state and then your server has to wait for one of them to time out (900 seconds or some such) before it can establish a new connection.

MST Elapsed Time Working Set + Established Reset Processor Time Thread Count Ac +tive Passive 18:45:38.609 0 0 0 0 68 +97 6 6062 505 18:45:48.609 5.9687 1 12566528 68 +97 5 6062 505 18:45:58.609 16.0937 15.96875 5 40304640 68 +97 8 6081 505 18:46:08.609 23.4375 25.96875 13 78077952 68 +97 60 6166 505 18:46:18.609 38.9062 35.96875 12 109629440 68 +98 15 6185 505 18:46:28.625 52.1060 45.984375 8 94683136 68 +98 11 6185 505 18:46:38.640 59.5943 56 2 21966848 68 +98 5 6185 505 18:46:48.656 15.6006 66.015625 11 58703872 68 +98 47 6259 505 18:46:58.656 27.5000 76.015625 41 283103232 68 +98 69 6309 506 18:47:08.671 17.0046 86.03125 41 475500544 68 +98 69 6309 506 18:47:18.671 14.8437 96.03125 41 536985600 68 +98 69 6309 506 18:47:28.671 0 106.03125 41 536985600 68 +98 69 6309 506 18:47:38.671 0 116.03125 41 536985600 68 +99 69 6309 506 18:47:48.671 0 126.03125 41 536985600 68 +99 119 6359 506 18:47:58.671 0 136.03125 41 536985600 68 +99 192 6432 506 18:48:08.671 0 146.03125 41 536985600 68 +99 192 6432 506 18:48:18.671 0 156.03125 41 536985600 68 +99 192 6432 506 18:48:28.671 0 166.03125 41 536985600 68 +99 193 6433 506 18:48:38.671 0 176.03125 41 536985600 68 +99 193 6433 506 18:48:48.671 0 186.03125 41 536985600 68 +99 193 6433 506 18:48:58.671 0 196.03125 41 536985600 68 +99 191 6433 508 18:49:08.671 0 206.03125 41 536985600 68 +99 192 6434 508 18:49:18.671 0 216.03125 41 536985600 68 +99 192 6434 508 18:49:28.671 0 226.03125 41 536985600 68 +99 241 6535 508 18:49:38.671 0 236.03125 41 536985600 68 +99 241 6558 508 18:49:48.671 0 246.03125 41 536985600 68 +99 240 6559 509

(If you are going to throw another set of data at us, how about you expend a little energy to make the csv data readable :)

There is either:

  • something wrong with the architecture of your server;

    (I haven't spotted it yet, but without running and tracing it can be hard to spot);

    The debug log -- had it not been empty - might have helped.

  • or your clients are not closing their ends of the connections properly;

    Basically, you are in a "dead man's shoes" situation, where your server cannot establish (or probably even accept) a new connection, until one of the existing dying-but-still-to-finally-die connections times out.

Before you do another run, you need to clean up any existing connections. There is the netsh command that allows you to reset at various levels -- winsock; interface; ipv4; tcp etc. -- but perhaps the simplest is to just reboot the machine.

You need to work out what is causing the connections to 'linger'. You appear to be using shutdown correctly -- at the server end at least -- and closing the filehandles; but something is preventing them from being reused immediately, despite your ReuseAddr setting on the listener.

Not a solution, but maybe it will give you some clues about where to start looking and how.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


Comment on Re^7: Multithreaded Server Crashes under heavy load
Select or Download Code
Re^8: Multithreaded Server Crashes under heavy load
by Anonymous Monk on Aug 30, 2012 at 09:29 UTC

    I think  net stop "dhcp client" && net start "dhcp client" ought to do that quick :) if you're using dhcp that is

Re^8: Multithreaded Server Crashes under heavy load
by rmahin (Beadle) on Aug 31, 2012 at 21:16 UTC
    Hey sorry about the CSV! I just opened it up in Libre Office as a comma delimited file, and it looked good. BUT! Good news. Think you're definitely right that I'm not closing connections properly. Think this is caused by commands that need to open other file handles, and not handling any errors correctly. For instance, I think the only command I included that has that behavior is the PUT command, which does  open( PUTFILE, ">$outfile" ) or threads->exit;. The threads->exit is clearly getting called, leaving the socket open. I fixed all those and added more messages to see when exactly it was happening, and it definitely seems to be working better. Have on more part of the code to look at using a subroutine that does not appear to be thread friendly, but will update on progress once I've determined if I solved it or not. Thanks for the direction!
      which does open( PUTFILE, ">$outfile" ) or threads->exit;. The threads->exit is clearly getting called, leaving the socket open.

      FWIW: I have written a crap load of threaded perl code and never had occasion to use thread->exit;. It is IMO redundant and dangerous.

      I would code that line as simply:

      open( PUTFILE, ">$outfile" ) or return;

      That way, all the normal perl cleanup will take place before the thread function returns and the thread terminates.

      Try it, it just might sort out a lot of your problems!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        Hey sorry for the late reply, but I think everything is working great now! I essentially did what you suggested so looks like that was what was killing me. Been running for a week and no problems. Thanks very much for the help.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://990692]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2014-12-21 12:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (104 votes), past polls