Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Curve fitting for rsync snapshot deletion

by mhearse (Chaplain)
on Nov 21, 2013 at 16:43 UTC ( [id://1063755]=perlquestion: print w/replies, xml ) Need Help??

mhearse has asked for the wisdom of the Perl Monks concerning the following question:

I use rsync snapshots for backups on my workstation. Every 15 minutes during the workday. Eventually the partition where I'm keeping my snapshots will fill up. Rather than just blindly deleting the oldest snapshots... I thought a better solution might be to following:

1. Convert the snapshot dir's datetime name to epoch
2. In the trip script apply the epochs to a parabolic function: y = x^2
3. Delete snapshot dirs which don't conform to the parabolic function... until the partition is ~70%

The majority of this script is simple enough... but I'm struggling with the parabolic function part. Can someone offer help or advise?

Replies are listed 'Best First'.
Re: Curve fitting for rsync snapshot deletion
by atcroft (Abbot) on Nov 21, 2013 at 18:05 UTC

    I was trying to visualize what you are intending. Are you taking '2013-11-21 17:50:42 +0000', converting it to epoch (1385056242), then attempting to find if it is a square number, and deleting if not? And you are doing the backups on 15m intervals, so you are looking for sqrt(15*$m + $d) == int(sqrt(15*$m + $d))? For the time given above, the previous and two following times that would meet the criteria are:

    Thu Nov 21 10:44:16 2013 (37216) Fri Nov 22 07:24:49 2013 (32717) Sat Nov 23 04:05:24 2013 (32718)
    The only ways I could see that working are either:
    1. you start at the oldest, deleting until you reach your threshold or a lower limit of versions to keep, or
    2. you look to see if backup falls into some kind of interval that agrees with your idea (for instance, it is the closest backup to the actual time that matches).

    Personally, if you are not already using it I would probably look at using --link-dest=DIR option, where you give each backup run the directory of the previous as the parameter for this option. If the files match, then they are hard-linked, so (on *nix systems, at least) they only add a directory entry in the new directory.

    Hope that helps.

      Thanks for your post. Yes, I am invoking rsync via: --link-dest=DIR. I'm wanting to avoid deleting from the oldest... and instead have a parabolic representation of time (snapsnots)
Re: Curve fitting for rsync snapshot deletion
by Grimy (Pilgrim) on Nov 21, 2013 at 19:21 UTC

    Using rsync for backups is kinda outdated. You should probably use a versionning tool. If you never used a VCS, learning one won’t be a waste of time—it’s an increasingly important skill for programmers.

    I think that git is the best fit for your particular use case: each commit is a full snapshot of your working directory (about every other VCS stores commits as a chained list, making it impossible to delete a particular commit). It also uses a compression scheme optimized for storing successive snapshots, making it more space-efficient than rsync.

    With that said, if you’re dead set on using rsync, the parabolic function you suggest can be implemented like this (I’m only demonstrating the algorithm here, not the rsync stuff):

    #!perl -w use v5.16; use List::MoreUtils qw(uniq); my @snapshots; my $capacity = 100.5; my $ratio = .70; my @keep_me = reverse uniq map { $capacity - int $_**2 / $ratio**2 / $ +capacity } 1..$ratio * $capacity; for (1..1000) { push @snapshots, $_; if (@snapshots > $capacity) { @snapshots = @snapshots[@keep_me]; } } say "@snapshots";
    This example assumes that you do a total of 1000 snapshot, but only have enough disk space to store 100. It lists the snapshots that are kept: 760 803 825 834 (…) 998 999 1000. Rather that blindly keeping the 100 latest snapshots (901..1000), it keeps snapshots from much further back, getting increasingly sparse the further back in time you go. You could also try other functions than y=x^2; I’d suggest an exponential.

      Note that git really wants to use memory mapped files when committing/restoring files. This means git has problems with large files, at least on 32-bit systems. For example, I could not store video files with a size of 200MB or so in a git repository.

      Also, git cannot purge older backups or create "holes" in the history. You cannot age out old or intermediate backups. git wants to keep the full history.

      Other than that, git has at least the user interface part of storing and restoring things done.

        Thanks for your post. This is definitely applicable to me. As most of my machines are 32 bit clunkers... barring my sparc64 boxes... but they have only 256MB of memory!
      Thanks for your post. I agree about git. I guess I could check in new files... and revise existing ones as modifications are made. And best of all I could search them insanely fast via: git grep
Re: Curve fitting for rsync snapshot deletion
by atcroft (Abbot) on Nov 21, 2013 at 19:59 UTC

    Rereading your post, I will ask this (for completeness)-have you looked over what is being backed up to be sure you are not backing up unnecessary files (cache/temp files, installers/ISOs/other large files you have elsewhere, etc.)? Also, does your backup include things that may be changing frequently (example: database files, Outlook .ost/.pst files, etc.) that may result in changes/updates causing the entire file to be archived multiple times?

    Just a thought.

      This is a good point... Basically I'm klutz proofing my work machine. Protecting against accidental deletions. So I'm blindly backing up everything. And relying on the deduplicating power of rsync snapshots.
Re: Curve fitting for rsync snapshot deletion
by Voronich (Hermit) on Nov 21, 2013 at 17:57 UTC

    Color me stupid. but...what is x?

    I'm having a real hard time parsing this in my head.

      Hi Voronich ,

      In the OP, given

      1. Convert the snapshot dir's datetime name to epoch 2. In the trip script apply the epochs to a parabolic function: y = x^ +2
      x is the epoch equivalent of the directorys' timestamp and y is the value on which the decision to backup (or not) is made - as determined by applying the function to the epoch.

      Is that of any help ?

      A user level that continues to overstate my experience :-))
Re: Curve fitting for rsync snapshot deletion
by DrHyde (Prior) on Nov 22, 2013 at 12:01 UTC
    Rather than a homebrew system, I recommend that you use rsnapshot.
      I have been using BackupPC for years. It is written in Perl, based on the rsync protocol, and has sophisticated scheduling, file pooling, and expiry systems to minimize disk usage. It has a web UI to view the contents of backups. I backup eight Linux and Windows systems each night to a BackupPC volume on a 1TB hard disk and then periodically snapshot the volume and copy the snapshot to an eSATA hard disk (copy takes about 90 mins) and store that off site. It has worked well for me.
Re: Curve fitting for rsync snapshot deletion
by Preceptor (Deacon) on Nov 23, 2013 at 12:51 UTC

    If your objective is to have degrading resolution of snaps as time passes, might I suggest instead - run 3 'sets' of snaps. A '15m' schedule, a 'daily' scheduled, and a 'weekly' schedule.

    And then automate the deletions in each with e.g. 'find' - when you're >95% work through each, and purge the '15m' backups more aggressively than the daily/weeklies.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1063755]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2024-03-19 03:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found