http://www.perlmonks.org?node_id=863820


in reply to Re: Postpone garbage collection during a child process run?
in thread Postpone garbage collection during a child process run?

. . . You're using threads

I use 'fork' to generate the children. I can't show the failing subroutine, since all subroutines have failed at one time or other. This is a 20,000+ line sub-system with 91 subroutines. Some times, a process works for 1-hour and then loses the pointer, and some times it's almost immediate. From analysis of the logs, it usually happens in the 4th or 5th inner subroutine call. The application starts with 4 children per core, and expands if xx% are working, and contracts if more than minimum exist and work is below yy%. All children live for approximately 4 hours or end early if their RSS is more than 2 * original RSS.

Environment:

Testing: AIX/Unix
Production: Suse Linux or AIX

Thank you

"Well done is better than well said." - Benjamin Franklin

  • Comment on Re^2: Postpone garbage collection during a child process run?

Replies are listed 'Best First'.
Re^3: Postpone garbage collection during a child process run?
by sundialsvc4 (Abbot) on Oct 06, 2010 at 15:46 UTC

    Bugs like this one ... ahem ... suck large.   But, “bugs they are,” and your application as-written will never work reliably in production.

    Maddeningly, it will be “tantalizingly close.”   You’ll spend a lot of time trying to convince yourself that the existing design can be made to work.   But these flaws are structural.   “Patches and workarounds,” like disabling signals and whatever else you might now be doing, only (I am very sorry to pronounce...) prolong the inevitable.

    “Because I have been there too, my son... And a more unpleasant place to be, might never be found on this world.”

    Perhaps you can isolate the necessary disciplines into a few base-classes.   Or, maybe you can just change the routine that hands-off the work to the child threads.   (Ideally, the child would simply pull data off of one queue and put it back onto another.)   So, the code changes might not be as dramatic as you may fear.   But, until the changes are made, the code will continue to throw sporadic errors under production loads (and maybe nowhere else).

    You definitely want to implement the solution so that it is “nearly invisible,” being encapsulated into a class such that every other part of the application can “just ignore it... It Just Works™.”   And I’d say that you can do that fairly easily.   Once it is done, the app will be rock-steady under any production load.

      Bugs like this one ... ahem ... suck large

      There is always the fallback to "caused by cosmic rays disrupting non-checked summed memory" :-)


      I'm not really a human, but I play one on earth.
      Old Perl Programmer Haiku ................... flash japh

        Yep.   And in the days of magnetic core...

        (In the early 1980’s, I actually got to see such a computer once, running...)