|go ahead... be a heretic|
Re^2: 'A' web server takes another "time out" (disk)by tye (Sage)
|on May 04, 2006 at 16:53 UTC||Need Help??|
One of my prior working theories was a disk "going bad" (having much experience with the fact that the manufacturors of commonly-used disk drives, drivers, and controllers only took away half of the point of "fault tolerance"1 resulting in drives "going bad" extremely silently, the only "evidence" being a particular pattern of slow-down).
But that was when I didn't see good evidence of lots of swapping going on. Of course, "lots" is a relative term so, anyone, please feel free to make some calculations of disk speed based on the amount of swapping reported above and let us know if, in order to explain the CPU idleness, we'd need to have an unusually slow disk in the mix as well.
There is no database disk on this system.
1 The major point of the fault tolerance movement was to prevent things from suddenly failing. The point was that you could spend more and more resources making things more and more reliable, probably reducing how frequently something just "falls down" but you'd still end up having things suddenly fail, likely at a very inconvient time and have to spend a lot of down time and running around in a panic trying to replace / repair what failed. A "better way" was seen: Don't have single points of failure so that when something fails, things can continue on and you can schedule to replace the failed part at a convenient time, perhaps without even requiring down time. And the key to this working is that someone must be notified that a failure happened! Unfortunately, so many common modern systems include features that are tolerant of faults but provide no means of notification and often even prevent you from ever being able to tell, no matter how hard you look, that a fault happened. Hard disks are a great example of this, in my experience.
It used to be that a hard disk going bad would start recording faults in your syslog and the frequency of these reports would rise, very slowly at first but following a geometric curve, and you'd replace the disk before it catastrophically died. Now most disks start to fail by slowing down from time-to-time, more and more dramatically, eventually nearly locking up while the disk retries reading the sector that is going bad but eventually fails, then the driver/controller retries which causes the disk to do a whole nother round of retries, then the operating system multiplies the number of retries yet again with its own retries... and eventually we just get lucky and the CRC "passes" and no hard evidence that anything at all went wrong remains.
I'd point you to a google search for the "S.M.A.R.T." acronym but google no longer treats searching for "s m a r t" differently from searching for "s-m-a-r-t" and so you'd just get a huge list of pages containing the word "smart". That system lets you query some internal counters kept nearly hidden inside the disk drive that likely includes a count of at least some types of retries. It is the only way I've been able to find any real evidence (usually still quite vague) that a disk is starting to fail. But note that most S.M.A.R.T. tools try to be "smart" and just figure out for you whether or not the disk is about to suddenly fail (making nearly the identical mistake mentioned above) and thus usually don't tell you a single thing until the disk is within minutes of failing (usually while you aren't using the computer, and often only after the failure has already become catastrophic). So you have to jump through hoops to look at the raw S.M.A.R.T. data and make guesses at what some of them mean... Which has a lot to do with why you've probably not heard of S.M.A.R.T. before (or only heard bad things about it).
And then there is the other extreme: parity checking of memory. When your memory is working just fine 99.999% of the time but a single bit error is noticed and reported to you by virtue of the fact that your entire computer system has suddenly become a frozen brick displaying the notification on the console. Being blissfully unaware of the rare single-bit error starts to look good when compared to having all of the in-progress work, most (probably all) of which would be unaffected by that one bit, being sent to evaporate for the sake of providing notification of a fault...
Yes, I understand that the plumbing of notifications is hard and that is why this plumbing of notifications is so often not done or is done so badly.