Re^4: waitpid returns -1 for still running child (Windows)

Replies are listed 'Best First'.
Re^5: waitpid returns -1 for still running child (Windows) by BrowserUk (Patriarch) on Jul 06, 2010 at 16:16 UTC
My vague notion was that if you were using a centralised logging mechanism that added time stamps upon reciept, there might be some buffering going on. As described, it is hard to conceive of any circumstance that could result in the child perl being able to write after the parent cmd had completed. Nor even of what might cause the parent cmd to end before the child perl completed. Hence, looking for some reason why the recorded time stamps might be incorrect. A very (very) remote possibility, if you have NTP synchronisation set up, is that the time was synchronised between the child writing the file and printing its log message; and the grandparent detecting the completion of the parent and writing its log file; and the system's clock had been running 8 seconds fast prior to syncing. Unlikely, but if the time frame fits with the scheduled sync time on the machine... Beyond that, if the problem occurs sufficiently frequently to make it worth your while fixing the problem, rather than adding some workaround like sleeping for 10 seconds if the file isn't immediately available, then I'd look to setting up the Performance Monitoring tool to track processes and IO and see if that sheds any light once the problem reoccurs. Be very selective in what you choose to monitor, those Performance logs can get very large, very quickly. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re^6: waitpid returns -1 for still running child (Windows) by rovf (Priest) on Jul 07, 2010 at 13:45 UTC
I had meanwhile a second case of this effect, and analyzing the timestamps here, shows the following pattern: TIME 1: Child is started witch `system(1,...)` TIME 2: waitpid returns -1 TIME 3: The child writes the first entry to its log TIME 4: The child writes the last entry to its log The difference between the timestamps 4 and 1 is only 2 seconds, much less than in the first case. It is interesting here that the child seemed to start running after it had ended. Now we know that -1 returned by waitpid does not really mean "process has ended", but "process does not exist". The fact that it indicates an ended process is just my conclusion, since I had started it before. Could it be that there is some time gap between `system(1,...)` returning to the caller, and the child process being entered in the Windows process table? If the master process queries waitpid during this time, it would indeed get "process does not exist" and wrongly conclude that the child already terminated. I don't have access to all the logs of the first case, where the time gap was 8 seconds - much longer than in the second case. However, in both cases, the tests run on a heavily loaded machine, so maybe this could be an explanation. -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l] [select]
Re^7: waitpid returns -1 for still running child (Windows) by BrowserUk (Patriarch) on Jul 07, 2010 at 14:49 UTC
Could it be that there is some time gap between system(1,...) returning to the caller, and the child process being entered in the Windows process table? Um. I think not. For a definitive answer you would probably have to ask Dave, but as system pretty much translates into a simple call to CreateProcess(), it is hard to see how that would return to the caller before the created process existed within the system tables. However, it is conceivable that on a very heavily loaded system, that the process of creating a process, meant that the creating process overstayed its time slice, and so as soon as that calling process transitioned back from ring 0 to ring 3, it was immediately context switched, and did not get another time slice for 2 seconds. If sufficient other processes had become eligaible to run in the interim. But 8 seconds? That said. I have on occasion run a perl process to explore the limits of my system--for example, by requesting the allocation of a single 10 GB scalar (I only have 4GB of physical memory), and then got up and made myself a cup'o'char. On my return, if the system is still running like a dog, I've resorted to killing the process. Now, like Mr. Tipton grits, some things just take as long as they take. And unless you bought magic grits from the beanstalk salesman, you're just gonna have to wait till they're ready. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^6: waitpid returns -1 for still running child (Windows) by rovf (Priest) on Jul 07, 2010 at 09:23 UTC
A very (very) remote possibility, if you have NTP synchronisation set up, is that the time was synchronised between the child writing the file and printing its log message Nice idea! Unfortunately, it does not apply here, because the evidence is not only the timestamp, but also the plausibility check we are doing: After the master received waitpid -1, it checks the control file, and this file did not exist at this time, but existed afterwards. This, together with the log entries from the child, shows that indeed the child was still running at the time. -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l]


"be consistent"
	PerlMonks