Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^12: PANIC: underlying join failed threded tcp server

by rmahin (Beadle)
on Oct 23, 2012 at 22:14 UTC ( #1000530=note: print w/ replies, xml ) Need Help??


in reply to Re^11: PANIC: underlying join failed threded tcp server
in thread PANIC: underlying join failed threded tcp server

Alrighty well...not quite there yet...The program was "working", ie not dying with limiting the number of concurrent connections, however in our environment with how many failures that would cause from clients attempting to connect and just getting a "server is busy" error would just be very problematic. So I implemented retrying to send the command client-side. Ideally, I would do this server side using the Thread::Queue module, have a static sized thread pool removing clients from the queue. But, the restriction to no third party modules prevents this. Anyways, the retrying works well as far as I can tell. If the server is bogged down, it sends a message back to the client, and closes the connection. The client then retries after a random amount of time between 1-10 seconds.

But alas, the problem returned, and the script just dies. At this point I just figured I would try just catching the error with an eval, and not dying, BUT! This just caused the program to completely hang once it tried to join thread with id= 0x.

For this round, I did make some changes to the tests. For one, I added back our code that pipes the output of the program and sends it a line at a time. The client still accumulates it as a massive string, but as you said, without substantially change the code, would be tough to add. (We had this originally, it was just one of the commands I omitted since I was creating the problem without it). I also changed all the commands to be just "DIR" to ensure the cpu being completely taxed was not the source of the problem.

And to further test the hypothesis that it was the number of connections, I reduced the number to 5 and met the same result. So I do not believe that this was causing the problem unless you still think otherwise.

I also tried this on an actual machine rather than a vm and got the same message. :(

Currently, the best idea that I can come up with is to give up on the joining all together and detach the threads. This is the implementation we used to have, before reading another example (I think it was yours as well) and figuring out what the hash of file descriptors is for. So if I wanted to detach the threads, I would have to ensure that the thread opened the socket from the file descriptor before the main thread accepted another connection correct? Let me know what you think about that.

Here is the logs showing the server dying/hanging. Both occur right after a ITJOIN: thread handle:4f0 thread-id: 0x message appears as weve seen before. I also included the updated rx/rxd with the retrying, and connections limit. On the server, I did not omit any commands, just in case. The command using the pipe, is 'EXECPRINT'. https://dl.dropbox.com/u/19686501/perlServer.zip

As always, your help is much appreciated. Sorry the problem is not yet resolved haha.

Update: Just had the idea to copy the parts of Thread::Queue into my code and make a much simpler version supporting only dequeue/enqueue operations. So will give that a shot as well. Update2: I tried this and it seems to be working! Going to lets some stuff run overnight, and ill report the status in the morning. And the code if anyone (ie, BrowserUk) is interested. Hopefully all will be good...


Comment on Re^12: PANIC: underlying join failed threded tcp server
Download Code
Re^13: PANIC: underlying join failed threded tcp server
by BrowserUk (Pope) on Oct 24, 2012 at 08:14 UTC

    Would it be possible for you to revert the perl in your VM to 5.10.1?

    It seems that somewhere between 5.10.1 and 5.16.1, somebody has dicked around with the threading and totally screwed the pooch.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      HEY! Got the server working by putting some of Thread::Queue into my code, and now it seems to be running like a champ. If you'd like to see the code let me know, but I'm guessing you have a fair idea of how its working. Ran 200k commands in 2 hours without a hiccup.

      And that is really interesting that threads got messed up. I'll definitely try downgrading perl and testing it out but as it is no longer a time critical thing, and more a matter of interest, I probably wont get to it until next week.

      Again, I really cannot stress enough how thankful I am for all your help debugging. Could not have gotten this far without you.

        HEY! Got the server working by putting some of Thread::Queue into my code, and now it seems to be running like a champ. If you'd like to see the code let me know, but I'm guessing you have a fair idea of how its working. Ran 200k commands in 2 hours without a hiccup.

        Cool. (Yes. I would really like to see the code. You have my email id.)

        And that is really interesting that threads got messed up. I'll definitely try downgrading perl and testing it out but as it is no longer a time critical thing, and more a matter of interest, I probably wont get to it until next week.

        I took your latest code -- stripped out the defensive stuff out of rxd.pl:

        # if( threads::list( threads::running ) >= 100 ) { # sendResponse("Session rejected. Too many sessions currently r +unning.\n", 101, $client); #just decided 101 should be the retry retu +rn code, no reason really # # tprint( "Session request by " . $client->sockhost() . " rej +ected" ); #disabled because it was flodding the log # $client->shutdown(2); # $client->close; # }else

        Tweaked the client (rx,pl) to run many commands in loop and do so from multiple threads:

        our $T //= 1000; our $R //= 1000; my @threads; for my $t ( 0 .. $T ){ push @threads, async { for my $r ( 1 .. $R ) { print "Client:$t sending command: $r\n"; my $server; my $rc; my $response; while(1){ $server = IO::Socket::INET->new( Proto => "tcp", PeerAddr => $address, PeerPort => $port, ); unless (defined($server)){print "Unable to connect to +RXD; [$! / $^E\n";} $server->autoflush(1); sendCommand( $server ); ($response, $rc) = receiveResponse( $server ); if($rc == 101){ shutdown($server, 2); close $server; my $random = 1 + rand(10); # select(undef, undef, undef, $random); next; } else{last;} } debug("Closing connection"); shutdown($server, 2); close $server; }; }; } $_->join for @threads;

        With 5.16.1, this falls in a heap with 1 thread after just a few seconds. The same "invalid handle" stuff as you've been seeing.

        With 5.10.1, I ran the above with those numbers -- 1000 concurrent clients, each running 1000 commands (the simple dir /s) and it ran to completion (after 1,000,000 commands served :), without errors in a little over 1 1/2 hours. And that's with 3 cores running the server and 1 core running the clients. Peak memory usage for the server was around 1.5 GB.

        Again, I really cannot stress enough how thankful I am for all your help debugging. Could not have gotten this far without you.

        YW. I learn almost as much from each of these real-world uses as the people I'm helping do.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1000530]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (10)
As of 2014-07-25 07:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (169 votes), past polls