|No such thing as a small change|
Long time reader, first time poster. We have just been completely stumped on this problem for a while now so thought the geniuses over here might be able to shed some light on the matter. So here is our problem. We have a server that needs to accept connections from possibly hundreds of clients, all of whom can issue commands with extremely long output. The program hangs when it gets too many connections at a time, or...for some other reason, that we cannot identify.
Ill give you an overview of how our program works, forgive me if some explanation is left unclear I am trying to break it down into code segments that will be useful. Essentially we have a server accepting connections from multiple clients which will be reading and writing from a SQLITE database (we may change to different database if it becomes a problem) and each client issues commands which the server then kicks off to a client thread, which is passed to a worker thread.
Here is the general flow of our code. Will post segments as needed, but my job does not particularly like us sharing lots of code, and we have written several modules.
We have our main server script. It starts by initializing 2 thread pool objects (which add/remove/reuse threads as needed) using the subroutine we specify. For instance:
$ref used to have more values before, so we have left it as is for now.
One threadpool is for client connections and processing commands, and one is for doing the real work so the client can issue more commands while it is running. As the server receives new connections, each client connection is put into a thread queue with 2 values: it's socket as a file descriptor, and its ip address. This is shown in the code below.
Each thread in our clientPool then is then dequeuing connections that are enqueued by the server, containing the socket file descriptor, and the ip address of the client for communication and reading input from the client. We then read from this file descriptor like this which works well:
and check a few more values the clients sends the server. After the checks are done, we are ready to actually process commands.
So this is the main flow of our program. The subroutine executeCommand() uses the modules we have written to interact with our program, and log the output. Each command will be accessing our database getting varrying amounts of information, and updating the table to show jobs in use. For some commands our database must be locked for a good couple minutes or two. We are using the 'begin immediate transaction' to allow other clients to still be able to read from the database. This could be a source of problem, but right now we are thinking it lies elswhere. There two thing we have thought of that COULD be the problem with our code, but all attempts on our own to fix have turned up with very little.
1. log4perl. We write our info messages to the screen, and debug messages to a file. We have LOTS of debug messages. The output of some commands that we run can be well over 50,000 lines of text. We think this could be an issue of multiple threads are attempting to print to the same place at the same time because if we print these messages to the screen, our program hangs extremely quickly, whereas if we keep it to the info messages, things run mostly smoothly. Another sign that this could be a problem, heres a scenario:
we kick off a script to open 80 connections at once, with a client already connected. opening the 80 connections will hang the program if debug messages print to the screen. No new clients can connect to the server. The client already connected however, can issue commands, and have debug messages printed to our log file up to the point of actually receiving the command, but then nothing. nothing is printed to the screen here, only the file. which just seems weird!
We do seem to be running into a memory problem, but we should be able to track that one down, and would difficult for anyone here to do so without complete access to our code. but if you have any ideas on that matter, we'd be happy to hear them.
2. When we execute out commands, we are just doing system calls, but need the output. We originally used backticks to execute our command, but we have now switched to using open() and piping the output to a filehandle. The subroutine also looks for expected output to return to the user and if not found returns the last N lines of the command. that code is seen here, and nearly all of the time when our program hangs, it is around this point.
I really appreciate any help/advice anyone is able to give, so thank you so much in advance for taking the time to look at it!
UPDATE: Thanks everyone for your ideas, in the process of trying them out now. Also forgot to mention that this server is running on windows, using activestate perl. In case that affects any of your suggestions.
UPDATE 8/29/2012: Sorry for the late response everyone. The program took some time to debug and to try various suggestions. We eventually found the problem was log4perl, and for some reason when printing to STDOUT, the program hung. The problem was fixed by printing to STDERR instead. Not sure exactly why this is the case, but it seems to be working now...Thanks again for all your posts!