Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re^9: PANIC: underlying join failed threded tcp server

by BrowserUk (Pope)
on Oct 20, 2012 at 05:37 UTC ( #1000093=note: print w/replies, xml ) Need Help??

in reply to Re^8: PANIC: underlying join failed threded tcp server
in thread PANIC: underlying join failed threded tcp server

Okay. I think I have a handle on what is happening. The short explanation is that you are simply running the OS out of resources.

2000 concurrent threads, each starting a console session doing a dir -- 100 of which are recursive from root -- consumes prodigious amount of resources.

Your use of a VM maybe a contributory factor; I cannot reproduce the error here. My system grinds to a near complete halt for an extended period, but once the 1900 dirs of the current directory finish and 1900 cmd.exe's & 1900 threads & 1900 tcp connections go away, my system returns to a responsive state and it is then just a case of waiting while the 100 dir/s c:\ finish recursing the 212,000 directories and 1.5 million files on my hard drive, and then for all that data to get wrapped up in your protocol, shipped back to the receiving processes, unwrapped and output to the terminal.

But it works. It eventually completes okay; which I find quite remarkable and makes me think Ithreads -- on windows at least -- is in remarkably good fettle.

I do not believe that the problem you are seeing is a Perl issue; but rather an OS issue where -- under extreme resource depletion -- it is dropping/forgetting kernel thread objects that have completed before perl gets the chance to wait for them. I don't believe that should happen under normal circumstances, but these are not normal.

Why "believe" this and "believe" that!

There is a possibility that the trace output we produced is lying to me. For simplicity, the trace I had you add to threads.xs is crude -- and flawed. Using printf from multiple threads in C, is subject to the same problems of buffering and overlapping as print/printf are from multiple threads in Perl. It needs to be serialised. It is possible the symptoms I am seeing in the trace output you supplied -- Ie. A thread created that has become a non-thread by the time Perl tries to join it:

9408: ITCREATE: thread handle:2ca0 thread-id: 3620x ... thread handle:2ca0 thread-id: 0x GetLastError output: '6'

Is a symptom of overwritten buffered IO, rather than an OS "quirk". To counter that possibility, I've re-written the tracing code and wrapped it in a critsec to (attempt to) preclude that possibility. What that means is I am going to ask you to replace your threads.xs with this version:

I was going to post it above, but it is too big; perlmonks won't accept it. You'll need to /msg me an email ID so I can send it to you.

And re-build/install it. Then re-create your problem one more time. If the new install goes well, you should see:

*** CritSec initialised *** RXD+ had been started on port 1600 ...

when you start

If you are successful in re-creating the failure, the trace output should be more reliable.

I also suggest the following 1-line change to which whilst it won't cure the problem; should make it less likely to occur -- assuming I've diagnosed the problem correctly. The change is to severely reduce the stack size allocated to each thread:

use warnings; use threads stack_size => 4096; use threads::shared; use IO::Handle; use IO::Socket::INET; use File::Find; use File::Path; use Digest::SHA; ...

Note: Anecdotal evidence suggest that this does not work under (some versions of) *nix. If that is one of your targets. (It might work with 64k rather than 4k, but that is a guess! I've never had any feedback to confirm or deny that.)

A long term fix

Finally, I think that the real fix for the problem -- assuming we can confirm my diagnosis -- is to limit the number of concurrent clients to some sane number. On windows, with the stack_size fix above, a moderately specified VM -- say 8GB memory -- should handle 100 concurrent clients okay. You'll need to tweak that number for your target environment.

How I would implement that limiting is in the following code:

... unless ($client = $lsn->accept) { tprint ("Could not connect to socket: " . $!); next; } if( threads::list( threads::running >= 100 ) { $client->shutdown( 2 ); $client->close; tprint( "Client $client rejected; too many concurrent clients. +" ); next; } ...

You might want to defer the rejection until you've accepted and validated the transmitted command and return a rejection/retry notification at that point if there are still too many concurrent clients.

{Thwack!} Balls in your court :)

Update: BTW, I also reduce to this:

my $fh; open $fh, "testRXD.txt" or die $!; my $i = 0; while (my $line = <$fh>){ chomp $line; system 1, $line; # printf STDERR "task %d '$line' running\n", ++$i; sleep 1 while `tasklist | find "perl.exe" | wc -l` > 100; } close $fh;

Which both reduced system resource usage (by doing away with the threads waiting on clients) on my single machine tests and control the number of concurrent clients.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong

Replies are listed 'Best First'.
Re^10: PANIC: underlying join failed threded tcp server
by rmahin (Scribe) on Oct 22, 2012 at 21:28 UTC

    Alrighty, reran the tests with the new threads.xs file you sent me. The first time with the default thread stack size, and the second time with the value of 4k you suggested. The second time did take considerably longer to die, but as you said it would only make it less likely to occur (which seems to be the case).

    I like your solution to limiting the number of connections. Had to make a slight tweak so threads would still be joined, but that seems to be working well..ish. The current VM I'm testing on has only 4GB of RAM, and doing my usual tests, the thing would still occasionally crash with the same message even reducing it to 50 threads. Tried 30, and it seems to be going ok. I was just wondering if there was any logic that went behind your suggestion of 100 threads for 8GB of memory if that was just a rough estimation?

    Thanks again for all your help, I'll post again if I run into the same problem

      I was just wondering if there was any logic that went behind your suggestion of 100 threads for 8GB of memory if that was just a rough estimation?

      A simple guestimation based upon my observation that on my system, each client requesting a dir /s c:\, required around 50/60 MB in order to accumulate all the output, wrap it up and forward it to the client. 100 * 60MB ~= 6GB leaving some headroom for other stuff. Also, remember that there is a fixed overhead for the OS, so 100 on 8GB might well translate to < 50 on 4GB.

      I think that the real resource problem with your server/protocol is the need to accumulate all the output at the server prior to returning it to the client, forced on you in part by your use of backticks to execute the command.

      If you used a piped-open and returned the output to the client line by line as you get it:

      # $resp = `$rxdArgs 2>>&1`; my $pid = open my $PIPE, '-|' qq[ $rxdArgs 2>>&1 ] or + die $!; while( <$PIPE> ) { returnOutputToClient( $_ ); }

      then your server memory usage would be cut to a 1/10th of its current requirements with (hopefully) pro-rata benefits to the number of concurrent clients you could handle. But I realise that would require a substantial re-working of both your server processing and the communications protocol.

      The upside of the change would be that your server's concurrent client limits would be independent of commands they are running (and volumes of output they produce), as you would only cache a single line at the server. It would also allow your clients to start seeing the output from their interactions in much closer to real time. And potentially even interrupt that output if they've seen enough.

      Also, transmitting the retrieved output line by line would have far less impact upon the network infrastructure than returning it in one huge chunk.

      I also wonder if you have the possibility to try your tests on a real machine rather than a VM? I suspect that if you did, you would see far fewer of these kinds of "mysterious OS problems". That based on my own observations of weirdnesses with code running in VMs.

      You might also consider upgrading the OS. WS-2003 predates most of the rise and rise of VMs, and I'm sure that the use of VMs has highlighted (and hopefully caused to be fixed) many dubious practices in the earlier kernels. WS-2010 might be more stable in that environment.

      In a similar vein, I found far fewer problems running VMs under Vista than I did under XP. And more modern processors with the various levels of VT-x/AMD-V extensions are less prone to such mysteries than older ones.

      Thanks again for all your help, I'll post again if I run into the same problem

      You're welcome and good luck. (And it is always nice to get feedback:)

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        Alrighty well...not quite there yet...The program was "working", ie not dying with limiting the number of concurrent connections, however in our environment with how many failures that would cause from clients attempting to connect and just getting a "server is busy" error would just be very problematic. So I implemented retrying to send the command client-side. Ideally, I would do this server side using the Thread::Queue module, have a static sized thread pool removing clients from the queue. But, the restriction to no third party modules prevents this. Anyways, the retrying works well as far as I can tell. If the server is bogged down, it sends a message back to the client, and closes the connection. The client then retries after a random amount of time between 1-10 seconds.

        But alas, the problem returned, and the script just dies. At this point I just figured I would try just catching the error with an eval, and not dying, BUT! This just caused the program to completely hang once it tried to join thread with id= 0x.

        For this round, I did make some changes to the tests. For one, I added back our code that pipes the output of the program and sends it a line at a time. The client still accumulates it as a massive string, but as you said, without substantially change the code, would be tough to add. (We had this originally, it was just one of the commands I omitted since I was creating the problem without it). I also changed all the commands to be just "DIR" to ensure the cpu being completely taxed was not the source of the problem.

        And to further test the hypothesis that it was the number of connections, I reduced the number to 5 and met the same result. So I do not believe that this was causing the problem unless you still think otherwise.

        I also tried this on an actual machine rather than a vm and got the same message. :(

        Currently, the best idea that I can come up with is to give up on the joining all together and detach the threads. This is the implementation we used to have, before reading another example (I think it was yours as well) and figuring out what the hash of file descriptors is for. So if I wanted to detach the threads, I would have to ensure that the thread opened the socket from the file descriptor before the main thread accepted another connection correct? Let me know what you think about that.

        Here is the logs showing the server dying/hanging. Both occur right after a ITJOIN: thread handle:4f0 thread-id: 0x message appears as weve seen before. I also included the updated rx/rxd with the retrying, and connections limit. On the server, I did not omit any commands, just in case. The command using the pipe, is 'EXECPRINT'.

        As always, your help is much appreciated. Sorry the problem is not yet resolved haha.

        Update: Just had the idea to copy the parts of Thread::Queue into my code and make a much simpler version supporting only dequeue/enqueue operations. So will give that a shot as well. Update2: I tried this and it seems to be working! Going to lets some stuff run overnight, and ill report the status in the morning. And the code if anyone (ie, BrowserUk) is interested. Hopefully all will be good...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1000093]
[Corion]: For testing I imagine one would want to test a random sampling of such "bad"/"unexpected" requests, while for downloading, one would want to generate them all in order, but not necessarily as a huge list
[Corion]: s/Algorithm:: Permute/Algorithm ::Loops/

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2017-01-16 15:45 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (151 votes). Check out past polls.