I had meanwhile a second case of this effect, and analyzing the timestamps here, shows the following pattern:
- TIME 1: Child is started witch system(1,...)
- TIME 2: waitpid returns -1
- TIME 3: The child writes the first entry to its log
- TIME 4: The child writes the last entry to its log
The difference between the timestamps 4 and 1 is only 2 seconds, much less than in the first case. It is interesting here that the child seemed to
start running
after it had ended. Now we know that -1 returned by waitpid does not really mean "process has ended", but "process does not exist". The fact that it indicates an ended process is just my conclusion, since I had started it before. Could it be that there is some time gap between
system(1,...) returning to the caller, and the child process being entered in the Windows process table? If the master process queries waitpid during this time, it would indeed get "process does not exist" and wrongly conclude that the child already terminated.
I don't have access to all the logs of the first case, where the time gap was 8 seconds - much longer than in the second case. However, in both cases, the tests run on a heavily loaded machine, so maybe this could be an explanation.
--
Ronald Fischer <ynnor@mm.st>