Alrighty well...not quite there yet...The program was "working", ie not dying with limiting the number of concurrent connections, however in our environment with how many failures that would cause from clients attempting to connect and just getting a "server is busy" error would just be very problematic. So I implemented retrying to send the command client-side. Ideally, I would do this server side using the Thread::Queue module, have a static sized thread pool removing clients from the queue. But, the restriction to no third party modules prevents this. Anyways, the retrying works well as far as I can tell. If the server is bogged down, it sends a message back to the client, and closes the connection. The client then retries after a random amount of time between 1-10 seconds.
But alas, the problem returned, and the script just dies. At this point I just figured I would try just catching the error with an eval, and not dying, BUT! This just caused the program to completely hang once it tried to join thread with id= 0x.
For this round, I did make some changes to the tests. For one, I added back our code that pipes the output of the program and sends it a line at a time. The client still accumulates it as a massive string, but as you said, without substantially change the code, would be tough to add. (We had this originally, it was just one of the commands I omitted since I was creating the problem without it). I also changed all the commands to be just "DIR" to ensure the cpu being completely taxed was not the source of the problem.
And to further test the hypothesis that it was the number of connections, I reduced the number to 5 and met the same result. So I do not believe that this was causing the problem unless you still think otherwise.
I also tried this on an actual machine rather than a vm and got the same message. :(
Currently, the best idea that I can come up with is to give up on the joining all together and detach the threads. This is the implementation we used to have, before reading another example (I think it was yours as well) and figuring out what the hash of file descriptors is for. So if I wanted to detach the threads, I would have to ensure that the thread opened the socket from the file descriptor before the main thread accepted another connection correct? Let me know what you think about that.
Here is the logs showing the server dying/hanging. Both occur right after a ITJOIN: thread handle:4f0 thread-id: 0x message appears as weve seen before. I also included the updated rx/rxd with the retrying, and connections limit. On the server, I did not omit any commands, just in case. The command using the pipe, is 'EXECPRINT'. https://dl.dropbox.com/u/19686501/perlServer.zip
As always, your help is much appreciated. Sorry the problem is not yet resolved haha.Update: Just had the idea to copy the parts of Thread::Queue into my code and make a much simpler version supporting only dequeue/enqueue operations. So will give that a shot as well. Update2: I tried this and it seems to be working! Going to lets some stuff run overnight, and ill report the status in the morning. And the code if anyone (ie, BrowserUk) is interested. Hopefully all will be good...
In reply to Re^12: PANIC: underlying join failed threded tcp server