Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^7: PANIC: underlying join failed threded tcp server

by BrowserUk (Pope)
on Oct 19, 2012 at 03:01 UTC ( #999853=note: print w/ replies, xml ) Need Help??


in reply to Re^6: PANIC: underlying join failed threded tcp server
in thread PANIC: underlying join failed threded tcp server

None of that info is particularly helpful to me at least but if you see something I don't, I'm all ears.

What I can see from that info is:

  1. At line 3426:
    RXD+ (982) > Command completed with a return code of 0 RXD+ (0) > _handle() output: 51686792 thread handle:2d00 thread-id: 4240x

    A thread, with a perl thread Id (tid) of 982 completes and Windows thread handle:2d00 has (just prior to joining) a OS thread ID of 4240, and the join completes without error.

  2. Later, at line 3775:
    thread handle:2d00 thread-id: 0x GetLastError() output: '6' Join failed with 'Bad file descriptor' : 'The handle is invalid' at rxd.pl line 1128.

    Just prior to a join attempt, 'another thread' with the same OS thread handle 2d00, this time does not have an OS thread id, which indicates that the thread handle:2d00 is indeed an invalid handle as the system reports.

What that indicates is that either:

  • The OS is reusing the same OS thread handle -- which whilst possible seems unlikely.
  • Or this; threads->list(threads::joinable) is returning the handle of an already joined thread. Which also seems unlikely, but could happen if the (Perl) internal linked list got corrupted some how.

The next thing I would try is adding a similar trace line at the end of S_ithread_create(), something like:

S_ithread_create( ... printf( "ITCREATE: thread handle:%x thread-id: %dx\n", thread->han +dle, GetThreadId( thread->handle ) ); MY_POOL.running_threads++; return (thread); }

And also in

STATIC void S_ithread_free(pTHX_ ithread *thread) { ... #ifdef WIN32 printf( "ITFREE: thread handle:%x thread-id: %dx\n", thread->handl +e, GetThreadId( thread->handle ) ); if (handle) { CloseHandle(handle); } #endif ... }

The idea is to isolate whether -- when the error occurs -- the invalid handle is to a thread that has already been freed -- in which case the bug is in threads::list() -- or to a thread that has not yet been freed -- in which case it would mean an OS error of some kind; perhaps resource constraint;

I breifly looked at trying to run your server here and trying to re-create the failure. Whilst the server runs and accepts connections from a telnet seesion, it won't accept input from it because (my) telnet sends character by character and it is expecting entire commands wrapped in your (incredibly complicated) comms protcol.

There is no way I am going to be able to reverse engineer a client that can talk that protocol.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


Comment on Re^7: PANIC: underlying join failed threded tcp server
Select or Download Code
Re^8: PANIC: underlying join failed threded tcp server
by rmahin (Beadle) on Oct 19, 2012 at 21:35 UTC
    Alrighty, took a bit longer to recreate this time for whatever reason. I made the changes you suggested, but in the line printf( "ITFREE: thread handle:%x thread-id: %dx\n", thread->handle, GetThreadId( thread->handle ) ); i changed thread->handle to just 'handle' since it looks they already freed that pointer at that point.
    I attached the log of this run below in the files called serverOutput2.txt, it exited with a different error this time,
    Join failed with 'Inappropriate I/O control operation' : 'The handle i +s invalid' at rxd.pl line 1128. The RXD server has been shutdown.Perl exited with active threads: 151 running and unjoined 4 finished and unjoined 0 running and detached
    Which again appears to be related to the thread handle.
    Sorry about the complicated protocol. People I work with did not wish to go through the trouble of having much back and forth communication between the client/server, and rather just send command once, receive response, and still needed a way to transfer a 400MB file. So this is what we (I) came up with (you should have seen the earlier version). I've attached the client (rx.pl) as well as the test script i used to recreate this issue. (i also included the server, rxd.pl, with all other commands besides exec stripped out except for EXEC to shorten the code. And included the exact threads.xs used to compile the threads module)
    Thanks again for the help. Files:
    https://dl.dropbox.com/u/19686501/perlmonk.zip

      Okay. I think I have a handle on what is happening. The short explanation is that you are simply running the OS out of resources.

      2000 concurrent threads, each starting a console session doing a dir -- 100 of which are recursive from root -- consumes prodigious amount of resources.

      Your use of a VM maybe a contributory factor; I cannot reproduce the error here. My system grinds to a near complete halt for an extended period, but once the 1900 dirs of the current directory finish and 1900 cmd.exe's & 1900 threads & 1900 tcp connections go away, my system returns to a responsive state and it is then just a case of waiting while the 100 dir/s c:\ finish recursing the 212,000 directories and 1.5 million files on my hard drive, and then for all that data to get wrapped up in your protocol, shipped back to the receiving processes, unwrapped and output to the terminal.

      But it works. It eventually completes okay; which I find quite remarkable and makes me think Ithreads -- on windows at least -- is in remarkably good fettle.

      I do not believe that the problem you are seeing is a Perl issue; but rather an OS issue where -- under extreme resource depletion -- it is dropping/forgetting kernel thread objects that have completed before perl gets the chance to wait for them. I don't believe that should happen under normal circumstances, but these are not normal.

      Why "believe" this and "believe" that!

      There is a possibility that the trace output we produced is lying to me. For simplicity, the trace I had you add to threads.xs is crude -- and flawed. Using printf from multiple threads in C, is subject to the same problems of buffering and overlapping as print/printf are from multiple threads in Perl. It needs to be serialised. It is possible the symptoms I am seeing in the trace output you supplied -- Ie. A thread created that has become a non-thread by the time Perl tries to join it:

      9408: ITCREATE: thread handle:2ca0 thread-id: 3620x ... thread handle:2ca0 thread-id: 0x GetLastError output: '6'

      Is a symptom of overwritten buffered IO, rather than an OS "quirk". To counter that possibility, I've re-written the tracing code and wrapped it in a critsec to (attempt to) preclude that possibility. What that means is I am going to ask you to replace your threads.xs with this version:

      I was going to post it above, but it is too big; perlmonks won't accept it. You'll need to /msg me an email ID so I can send it to you.

      And re-build/install it. Then re-create your problem one more time. If the new install goes well, you should see:

      *** CritSec initialised *** RXD+ had been started on port 1600 ...

      when you start rxd.pl.

      If you are successful in re-creating the failure, the trace output should be more reliable.

      I also suggest the following 1-line change to rxd.pl which whilst it won't cure the problem; should make it less likely to occur -- assuming I've diagnosed the problem correctly. The change is to severely reduce the stack size allocated to each thread:

      use warnings; use threads stack_size => 4096; use threads::shared; use IO::Handle; use IO::Socket::INET; use File::Find; use File::Path; use Digest::SHA; ...

      Note: Anecdotal evidence suggest that this does not work under (some versions of) *nix. If that is one of your targets. (It might work with 64k rather than 4k, but that is a guess! I've never had any feedback to confirm or deny that.)

      A long term fix

      Finally, I think that the real fix for the problem -- assuming we can confirm my diagnosis -- is to limit the number of concurrent clients to some sane number. On windows, with the stack_size fix above, a moderately specified VM -- say 8GB memory -- should handle 100 concurrent clients okay. You'll need to tweak that number for your target environment.

      How I would implement that limiting is in the following code:

      ... unless ($client = $lsn->accept) { tprint ("Could not connect to socket: " . $!); next; } if( threads::list( threads::running >= 100 ) { $client->shutdown( 2 ); $client->close; tprint( "Client $client rejected; too many concurrent clients. +" ); next; } ...

      You might want to defer the rejection until you've accepted and validated the transmitted command and return a rejection/retry notification at that point if there are still too many concurrent clients.

      {Thwack!} Balls in your court :)

      Update: BTW, I also reduce testrdx.pl to this:

      Which both reduced system resource usage (by doing away with the threads waiting on clients) on my single machine tests and control the number of concurrent clients.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        Alrighty, reran the tests with the new threads.xs file you sent me. The first time with the default thread stack size, and the second time with the value of 4k you suggested. The second time did take considerably longer to die, but as you said it would only make it less likely to occur (which seems to be the case). https://dl.dropbox.com/u/19686501/perlmonk/logs.zip

        I like your solution to limiting the number of connections. Had to make a slight tweak so threads would still be joined, but that seems to be working well..ish. The current VM I'm testing on has only 4GB of RAM, and doing my usual tests, the thing would still occasionally crash with the same message even reducing it to 50 threads. Tried 30, and it seems to be going ok. I was just wondering if there was any logic that went behind your suggestion of 100 threads for 8GB of memory if that was just a rough estimation?

        Thanks again for all your help, I'll post again if I run into the same problem

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://999853]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2014-09-01 09:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (299 votes), past polls