Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^3: Have you ever lost your work? (disaster recovery)

by afoken (Chancellor)
on Jan 09, 2024 at 22:53 UTC ( [id://11156824]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Have you ever lost your work? (disaster recovery)
in thread Have you ever lost your work?

if you haven't tested your backup/restore procedure, then it's a little like Schroedinger's Cat. You don't know if you have a backup .. until you actually successfully do a restore.

Hey, it's war story time again! ;-)

On the last few days of my final year at university, a student-managed little server in my favorite lab had lost a lot of data. I don't remember the exact details, I think it lost an entire harddisk. The server was an old tower PC, build around something like a Pentium-II, with no redundancy at all, all consumer parts, no server parts, filled with old harddisks, and a big fan tied to the front of the case with old wires. I guess all of its parts were picked out of the dumpster. It ran Linux, probably an early version of Debian, and it had a SCSI tape streamer. Actually, two streamers, one online, one "offline" in the spare parts bin.

Someone has set up a cron job to use tar to write a backup to tape. Great idea, that's what tar was designed for. One of the students must have swapped the tapes each morning. Larger disks were added, and some day, the tape was full. Backup failed. Some "clever" guy must have found tar's -z option to compress data using gzip, and added that option to the cron job. Backup worked again, tapes had some room again. Nobody verified or tested the backup.

Then, data was lost. Restoring the backups failed. The tapes were worn out and had several read errors, streamers were dirty as hell. tar can handle tapes with errors. It uses fixed-size blocks, and if a block is not readable, it can at least find the next file on tape and continue from there. That way, you won't get all of your data back, but probably a lot of it. Remember the -z option? The cron job wrote a gzip compressed byte stream to the tapes. No more fixed blocks, and gzip absolutely does not like I/O errors while decompressing a compressed data stream. All tape-handling advantages of tar were lost.

In the end, I had a lot of free time that day, and so I could help recovering data from the tape. We found another large, empty harddisk, and used something like dd if=/dev/tape conv=noerror of=/mnt/tmpdisk/backup.tar.gz to get a damaged, but readable compressed tape archive. It could be decompressed, at least partially, and tar was then able to extract a lot of files. Swapping the streamers allowed to read some more data from the current tape. The other tape could also be read partially, and a few more, but older files were recovered. I left sorting out old and new, damaged and sane files and copying them back to the replacement disk to the admin, and told him to fix some things:

  • get rid of the -z flag to tar in the cron job, NOW
  • get new tapes, preferably longer tapes
  • discard the old, worn-out tapes
  • get a cleaning tape
  • clean up both streamers
  • verify the archive on tape after backup
  • preferably, get another junk PC, connect the second streamer to that PC, and use that PC to actually test data recovery

In the end, a lot of data was recovered, some from the tapes, some from student PCs in the lab, some from some old disks in the junk bin. But a lot was lost.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^4: Have you ever lost your work? (disaster recovery)
by eyepopslikeamosquito (Archbishop) on Jan 09, 2024 at 23:46 UTC

      Yeah, "do as I say, don't do as I do." I've finished the SSD saga, there were still a few loose ends.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11156824]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2025-07-09 13:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.