You may have noticed PerlMonks becoming non-responsive from time to time. This is usually (of late) due to our 'A' web server taking a "time out" to indulge in some recreational "extreme swapping" for quite a few minutes (this appears to have started happening after upgraded the OS, Apache, etc.), early this year.

After several attempts, I finally captured some output from 'top' that shows much about the problem. Anyone care to offer interpretations / insights regarding this output from 'top' trying to dump the list of processes every 60 seconds when PerlMonk's 'A' web server "goes away"? Notice how it takes 5 1/2 minutes not "a bit over 60 seconds" for one update there. See the load average climb. See lots of 'httpd' processes appear, starving lots of older httpd processes for real RAM.

I'll look at this more as I find more time, but I'd be interested in well-considered theories about possible sources for such behavior.

I wish 'top' would show who the parent of each process was so I could tell which process is creating all of these extra processes, but 'top' isn't particularly flexible but is still the best tool I've found available on this system so far (no, I don't have root access and doubt seriously that would give it to me). Perhaps the next iteration of this logging should add periodic "ps" output to the logs to get that parent/child information, though I bet cron trying to start up "ps" would take so long when the problem is happening that it'd miss seeing the problem, based on past iterations. ;)

One difference between the 'A' and 'B' web servers is that the 'A' web server gets quite a lot of traffic from search engine spiders indexing PerlMonks via "http://someotherhostname/~monkads/?...". I disabled this for msnbot as it was doing twice as many hits as the next-busiest robot and was doing hits for a lot of bizarre URLs. I may soon disable it for all robots since the problem continues.

The 'top' output is in <spoiler> tags as just <readmore> tags would make it impossible to view the whole thread of discussion w/o the data "in the way". So "reveal" spoilers to see the output.

Update: Looking at the http access_log for around the time that the problem appears to start has not revealed any "smoking gun" evil URLs that somehow cause the receiving httpd to become a fork bomb, but that hay stack is rather large and the data recorded isn't ideal for finding such things. A more Everything-aware log of accesses is on my to-do list...

- tye