Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

[OT] Reminder: SSDs die silently

by afoken (Chancellor)
on Apr 04, 2023 at 09:01 UTC ( [id://11151456]=perlmeditation: print w/replies, xml ) Need Help??

Yesterday, one of the SSDs in my main computer suddenly died, from one second to the other. It simply disappeared from the system, leaving two very confused virtual machines behind that lost access to their virtual disks stored on that SSD. This way, I lost about one hour of work. That would have been annoying, but could have been fixed easily. Shut down, rip out the SSD, replace it with a fresh SSD or a harddisk, and restore the backup.

But: That SSD was added at the beginning of the Covid-19 pandemic, as a quick hack to have room for the VMs needed for working from home. It was never intended to work for more than a few weeks, and so I simply forgot to include that disk in the configuration of the backup software.

I tried about an hour to read the dead SSD using two other computers, but it is dead. It identifies correctly, but reports junk when reading SMART data, and reads not a single bit of user data. I reassembled my computer, added a temporary HDD, ordered a replacement SSD, and started a 17 hours copy job to get the required VMs as huge ZIP files from work to home. It will take another hour or two to unpack and reconfigure the VMs for the new environment. And one or two hours to resync some work data from a cloud service.

This is totally my fault, having no backup for that disk was stupid, period.

So, take this as a warning if you are - like me - used to get an audible warning from a failing disk. SSDs die silently and suddenly. You won't get that nasty metal workshop sounds you know from failing hard disks.

Check your backups, and check your backup configuration.

Updates:

Changed some wording.

https://www.backblaze.com/blog/ssd-edition-2022-drive-stats-review/ does not look very promising for using SMART monitoring. SSD SMART data is messy at best:

[L]et’s talk about SSD SMART stats. [...] we’ve been wrestling with SSD SMART stats for several months now, and one thing we have found is there is not much consistency on the attributes, or even the naming, SSD manufacturers use to record their various SMART data. For example, terms like wear leveling, endurance, lifetime used, life used, LBAs written, LBAs read, and so on are used inconsistently between manufacturers, often using different SMART attributes, and sometimes they are not recorded at all.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re: [OT] Reminder: SSDs die silently
by Fletch (Bishop) on Apr 04, 2023 at 13:28 UTC

    Condolences on your loss. That did remind me of the time had a sparc 1 (? Pizza box I’m sure but maybe it was a 2) that the drive wouldn’t start spinning up cold (maybe the spindles had just dried their lube up). As long as it was running it worked fine though; if the box lost power it took a whack with a screwdriver handle to jar things into motion and it’d be fine again.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      if the box lost power it took a whack with a screwdriver handle to jar things into motion and it’d be fine again.

      That brings back memories :-)

      We had six Sparc 1 boxes with dual Quantum Prodrive 104S disks and the advice we had from Sun engineers if they didn't spin up was to lift the pizza box half an inch off the desk and drop it. That usually got things moving again. We had plenty of other Sun boxes over the years, IPCs, IPXs, 2's, 10s etc., but it was only those Quantum drives that ever failed to spin up.

      Cheers,

      JohnGG

Re: [OT] Reminder: SSDs die silently
by hippo (Archbishop) on Apr 04, 2023 at 10:25 UTC

    Sorry to hear about your SSD woes, afoken. Sounds like another one to add to the quick-hack-stealthily-becomes-mission-critical file.

    However, I did want to offer my thanks for this wisdom about SSD sudden death. My current main machine uses spinning platters (mostly RAID-1) but is planned to be replaced in the next few months, likely with one with SSD as that seems to be the new norm. On the basis of what you have written, I will be even more paranoid than usual about ensuring the frequency, quality and availability of the backups.


    🦛

Re: [OT] Reminder: SSDs die silently
by afoken (Chancellor) on Apr 13, 2023 at 20:38 UTC

    OK, that problem won't bite me again.

    I've replaced the broken SSD by an old harddisk, and included it in the backup configuration. I ordered two new SSDs, same size, same model, same brand.

    Yesterday, I spend a day juggling with four SSDs, a HDD, the on-board RAID-capable SATA controller on my mainboard, and a few old SATA controllers. After a bit of fiddling, I now have set up the two new SSDs as a RAID-1 used as the data drive. The two old SSDs, that also happened to be a pair of same size, same model, same brand, are set up as another RAID-1 used as the system/boot drive. Data from the old data SSD and the HDD was copied to the data RAID before creating the system RAID. Then, I created a new system RAID from the old data SSD and the HDD, copied the working operating system from its SSD to the RAID, disconnected the system SSD, and booted from the system RAID. Finally, I ripped out the HDD, degrading my fresh system RAID, and replaced it with the old system SSD.

    Now, any single SSD may die, and I will still be able to use the PC. All I will have to do is to order a replacement SSD and swap it for the then-broken SSD. I know that this will work, because ripping our the HDD at run time was part of the way I migrated to the RAID setup.

    If two SSDs in the same RAID will die within short time, I still have a backup, but restoring that will take some time.

    I'm considering repeating that process on my machine at work. That machine is basically a copy of my main computer, using a slightly newer mainboard and a slightly faster CPU. And it has no backup at all.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      I'm considering repeating that process on my machine at work. That machine is basically a copy of my main computer, using a slightly newer mainboard and a slightly faster CPU. And it has no backup at all.

      Well, this becomes a kind of "Alexander's blog of dying SSDs".

      I installed the Urbackup client on the PC at work, so it will be backed up by our regular backup server. Let the machine run for a weekend to have both an image and a file backup. Backup problem solved.

      Originally, the PC at work had a relatively cheap, well used small SSD for the system and a major brand large SSD for the data, the latter was recently upgraded. I ordered two more major brand SSDs (one small, one large), started creating two RAIDs (using the onboard SATA fake RAID) and copying the data and system drives to the RAIDs. Things got slightly wrong right at the start. The first RAID for the data drive switched to degraded mode (i.e. one drive was set to failed) while still copying data to the RAID. Oh well, I used a slightly wonky setup with cables and SSDs dangling around, using old abused data cables, so it might have been my fault.

      But during the next days, the data RAID failed almost every day, and almost always on one of the new major brand SSDs. Rebooting made the disks show up again, and I could start RAID reconstruction. I changed cables, I changed the SATA ports, I even replaced the power supply because it made some load-related noise. No change. Almost every day, the RAID failed, with one of the new SSDs disappearing. I changed the RAID drivers and the RAID monitor/controlling software to the newest one available from the chipset manufacturer, and back to exactly the versions running at home. No change. I ordered a new large SSD, again the same major brand, to replace the SSD that seemed to be broken.

      I booted Linux from a USB stick and luckily, that Linux did not know anything about the fake RAID. It saw just four SSDs, and using smartmontools, I confirmed that all four SSDs were healthy and diagnosed themselves as healthy.

      I googled A LOT while testing around. Many, many people suggesting using almost any other RAID solution, but not using THIS onboard SATA fake RAID (AMD RAIDXpert). From what I read from generated logs, change logs from the drivers, and behaviour of my work PC, my guess is that that the RAID driver very aggressively reads the disk identification and maybe also SMART data. If one disk fails to response in very short time, it is considered offline and the RAID switches to degraded mode. RAID failed especially on high disk load (e.g. booting two VMs at the same time).

      So I decided to order another fake RAID controller, using a relatively cheap SATA controller, but from a manufacturer with a good reputation and a lot of RAID experience. It came with a set of new SATA cables. I did some more disk juggling and copying, and finally got my system and data volumes to the new controller. During setup, the new RAID controller complained about the old cheap SSD. It failed to execute the TRIM command needed to get rid of the onboard RAID metadata on the SSD. So I used the SSD I ordered last to replace that SSD instead of the one I suspected to be broken.

      The work PC has worked for two weeks without a single complaint. The RAID monitor software of the new controller logs some issues at boot up, but it does not complain about them. So it seems the SSDs may show some slightly unexpected, but completely tolerable behaviour at boot.

      So, I was left with the old, cheap SSD, containing my entire OS including my user profile. That won't go into the junk bin in that state. I grabbed another computer, pushed in another fake RAID controller from the same manufacturer, and used its extension ROM to try deleting the SSD again. It failed even in that machine, that has almost nothing in common with my work PC. So, I removed the fake RAID controller from the temporary PC, connected the SSD and the spare harddisk used while copying my data to the onboard SATA controller, fired up Linux, tried and failed one more to use the TRIM command, and finally deleted both using ddrescue, writing /dev/zero to each of the disks until the disk is full. The HDD wrote at about 130 MB/s, while the SSD had a hard time reaching even 30 MB/s. Both were successfully filled with zeros. The HDD goes back to the cold spare shelf, the SSD will be subject to a nice 4 kV burn-in test (suggested by our hardware expert) before going to the junk bin.

      So what happened? Why did the data RAID break when the system RAID had a malfunctioning SSD?

      My guess is that this is an issue of the mainboard fake RAID drivers. I guess that they use a timer to poll the identification and/or SMART data, plus async I/O. Once the timer is expired, a single(!) timeout timer is started, and results are read from all disks. The cheap SSD was the first drive, and took quite long to answer, but just not long enough for a timeout. The other three SSDs answered quicker, but with high I/O load, the SSDs were busy doing other stuff. And in that case, due to the slow first SSD, the timeout timer expired before the third (unlucky) SSD could answer. Sometimes, the fourth SSD was unlucky, rarey even the second SSD.

      Why is that not a known bug? I guess using two RAID-1s of two SSDs each is not a common use case with that mainboard. Having one of the SSDs slow down is probably even more rare. And having people with SSD problems reporting to the mainboard and/or chipset manufacturer is very unlikely.

      Will it be fixed? Unlikely. I did not bother to report the problem, the chipset is 12 years old, I guess no one will fix drivers for consumer hardware that old.

      Lessons learned:

      • Always have a working backup.
      • SSDs can die in very creative and unexpected ways.
      • Fake RAID controllers suck. Some more, some less.

      Alexander


      A note on RAID jargon:

      • A hardware ("real") RAID is some kind of intelligent hardware (SoC, Microcontroller, FPGA, even dedicated desktop CPUs) that connects to a bunch of disks and presents one disk per RAID volume to the host. In the most basic case, it's a blackbox with two SATA ports to two SATA disks, and a SATA port to the host. It is completely transparent to any operating system including the BIOS, each RAID volume looks like a simple disk. You can boot from any RAID volume, no matter how the RAID is set up. Hardware RAID controllers often use SCSI or SAS, and often have a battery buffer for handling unexpected power outages. They almost always have some RAM used for buffering and caching. Especially in servers, HW RAID controllers may be integrated into SCSI or SAS controllers.
      • A software RAID uses no special hardware, it is just a set of drivers doing RAID math at the OS level, sitting between the low-level disk drives and the higher level drivers, pretending to be simple disks for each RAID volume, and using the low-level disk drivers to communicate with the disks. Of course, this needs OS specific drivers, and is usually not bootable, or only from a simple RAID 1 mirror volume. Obviously, the I/O load for the OS is higher, and the host CPU needs to do all RAID math. Software RAID usually works even across different controllers.
      • A fake RAID is often sold as a cheap hardware RAID controller, but it is not. It is a clever combination of a BIOS extension ROM, OS drivers, and almost always a SATA (or IDE) controller with little or no modifications. The BIOS extension ROM provides boot support for all supported RAID levels. As soon as the operating system has loaded the drivers, it takes control from the extension ROM, and works exaktly like a software RAID. Drivers are often artificially limited to support RAID arrays only on the controller with the extension ROM.
      • Onboard RAID, especially on consumer mainboards, is usually a fake RAID. Enabling RAID mode just does two things: It slightly changes the PCI ID of the onboard SATA controller, and enables an extension ROM. Changing the PCI ID is needed so that a different driver (the one with Software RAID) is loaded for the same hardware.
      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Time to really finish this story:

        the SSD will be subject to a nice 4 kV burn-in test

        That was spectacularly unspectacular. A few sparks from the 4 kV probe, but no burn marks, no fire, no exploding parts. Our 4 kV supply is just way too limited. It can deliver just a few mA. The next misbehaving SSD will just see plain mains voltage. 230 V with a slow-blow 16 A fuse.

        I decided to order another fake RAID controller, using a relatively cheap SATA controller, but from a manufacturer with a good reputation and a lot of RAID experience.

        That fake RAID controller is really a nice piece of hard- and software. But it is not completely free of problems. It still had trouble when running more than one VirtualBox VMs at the same time in the factory default configuration, both on my work machine and on my home machine. So I finally called tech support. The manufacturer insists on phone calls, which is a little bit odd, but it took just one phone call to get rid of my problem. The supporter told me, no, that should not happen, not with my machines, and not with any other. I was using the newest firmware and drivers available, and so I was told to try disabling Native Command Queuing for all SSDs right in the controller's BIOS. The drivers will respect that setting. I also disabled sleep mode, just to be sure. Disabling NCQ costs a little bit of performance, but both machines now work fine. I don't care if disk performance goes down by a few percents, the SSDs are sufficiently fast even without NCQ. If the onboard SATA fake RAID had a way to disable NCQ, I would try to go back to the onboard RAID. It is there, it has power, it has a sufficient number of SATA ports, and it does not need a PCIe slot.

        A little detail: The RAID software does write a log file, to aid debugging. But that does not help if the log file is written to the RAID volume that has problems and needs to be debugged. The supporter proposed the obvious solution: Add a USB flash drive and have the RAID software log to that drive instead. I don't do that, my problem is solved.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        In response to your sig, here is my experience. Always use software RAID. Always. I have never had an issue with Linux software RAID that was not ultimately down to a failure of the drives themselves and thus easily rectifiable. OTOH, hardware RAID (real or fake) has caused no end of problems and given the choice I would never go back there again.


        🦛

Re: [OT] Reminder: SSDs die silently
by stevieb (Canon) on Apr 14, 2023 at 19:59 UTC

    I've had a somewhat related incident happen recently. I have a Plex system, with a 4TB SSD that has about 2.5 TB of library data.

    Every night, this data is rsync'd to replica system at another one of my properties (with the --delete-after flag set).

    One day, Plex wasn't working. Checked, and the disk had gone away in much the same way yours did. However, the /media/STORAGE directory still existed, but was empty because it didn't have the SSD mounted, so my system blindly just erased everything on the remote backup replica.

    Thankfully, I also have an online storage that uses an rsync push for all of the data I back up (including the Plex libraries), which I don't use the --delete-after flag for, specifically because of these hiccups.

    Took a bit of time to restore nearly 3TB of data from an online source, but thankfully I was able to recover everything (helps to have a 1Gbps fibre connection).

    So just as a precautionary note, even if you have a good backup regimen, it pays to review it to ensure that it isn't set up in a way where if a disk goes away, it can't erase the backup.

    Update: Now my backup script checks to see if a file under the /media/STORAGE directory exists before proceeding. If not, it alerts me.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://11151456]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2025-07-13 08:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.