|Keep It Simple, Stupid|
Thanks for this.
It is reassuring that it is more likely an issue of the configuration of host we're using (as we can ask for more resources of any kind), rather than the code (except my code doesn't handle running out of resources well). And of course, we'll try to use resource monitoring, to see if we can identify what process is going crazy and consuming all available resources of whatever kind.
We know that it can not be a scheduled task that is causing the problem because the periods during which data is presently being lost are random: no particular time of day or day of the week.
The process is a user submits data to my cgi script, which validates it all, and if the validation check passes, some of it is sent to one or more of the web services we use, and the results we receive back from these is validated and combined with the request data we originally received to be stored. Since I can check the data on the services we're using, and it is checking that data that tells me about the missing data, and when it happens, that made me aware of the problem. We therefore know that I am receiving the data, successfully validating it, and sending it to the services we're using (which means the web service in question properly processed it). Thus, the failure is either that the web service fails to get it's response back to me, or at the point at which we store the data.
Clearly I need to beef up the web logs, so that the web server actually stores the complete request that we receive, and in the case of the request we submit to the web services we're using, the complete request we submit and the complete response that we get, if there is one. What I normally see in these logs is the first few dozen bytes of data sent, or received, and a message that so many hundred additional bytes were sent/received. I do not yet know how to tell Apache to do that, so that is more documentation I need to study (unless you know and can tell me - I have not yet even figured out how to tell Apache to start new logs every day at midnight, keeping the logs with the current date embedded in the logs' file names - but I will figure that out sooner or later, unless someone just tells me how to do it).
The critical data determining success or failure from the perspective of the user is whether or not the web services we're using successfully processed the data we sent to them. We can not tell the user that the transaction failed if the web service in question succeeded in processing the transaction, regardless of whether or not there is a subsequent failure, either in them communicating the result back to us or in our attempt to store the data. This failure must be handled entirely on my server, and in a way that preserves all data we received from the user and any data received from the web service(s).