http://www.perlmonks.org?node_id=1049965

Lady_Aleena has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone.

I wrote a script which backs up certain files and folders which get modified often. Some of the files or files within folders do not get modified as often as I back them up. I was wondering if checking the last modified time and ignoring those which have the same last modified time, would be faster than just overwriting.

Original script which overwrites

#!/usr/bin/perl use strict; use warnings; use File::Copy qw(copy); use File::Mirror qw(mirror); use Try::Tiny; local $\ = "\n"; if (-e "J:/") { while (<DATA>) { my ($source,$destination) = split(/\|/,$_); chomp($source,$destination); print "Copying $source to $destination"; # I guess at one time the copy failed and made the script die, so +I added # Tiny::Try to keep the script from dying. try { if (-f $source) { copy($source,$destination); } elsif (-d $source) { mirror($source,$destination); } else { print "What did you do wrong?"; } } catch { print "Couldn't copy $source to $destination"; } } } else { print "Insert the thumb drive!"; } __DATA__ C:/Documents and Settings/me/My Documents/home/checkbook2.xls|J:/My Do +cuments/home/checkbook2.xls C:/Documents and Settings/me/Local Settings/Application Data/Microsoft +/Outlook/Outlook.pst|J:/application data/Outlook/Outlook.pst C:/Documents and Settings/me/My Documents/gaming|J:/My Documents/gamin +g C:/Documents and Settings/me/My Documents/fantasy|J:/My Documents/fant +asy C:/Documents and Settings/me/Application Data/Notepad++/stylers.xml|J: +/application data/Notepad++/stylers.xml C:/Documents and Settings/me/Application Data/HexChat|J:/application d +ata/HexChat

Script which checks last modified time (untested)

#!/usr/bin/perl use strict; use warnings; use File::Copy qw(copy); use File::Mirror qw(mirror recursive); use Try::Tiny; # return the last modified time sub get_mod_time { my $file = shift; my @stats = stat($file); my $mod = $stats[9]; return $mod; } # compare the two files' last modified times and return false or true sub same_mod_time { my ($src_file,$dst_file) = @_; my $same_mod = get_mod_time($src_file) != get_mod_time($dst_file) ? +0 : 1; return $same_mod; } local $\ = "\n"; if (-e "J:/") { while (<DATA>) { my ($source,$destination) = split(/\|/,$_); chomp($source,$destination); print "Copying $source to $destination"; # I guess at one time the copy failed and made the script die, so +I added # Tiny::Try to keep the script from dying. try { if (-f $source) { copy($source,$destination) if same_mod_time($source,$destinati +on) == 0; } elsif (-d $source) { # I had to change "mirror" to "recursive" to check for the sam +e mod time. recursive { copy($_[0],$_[1]) if same_mod_time($_[0],$_[1]) == 0; } $source, $destination; } else { print "What did you do wrong?"; } } catch { print "Couldn't copy $source to $destination"; } } } else { print "Insert the thumb drive!"; } __DATA__ C:/Documents and Settings/me/My Documents/home/checkbook2.xls|J:/My Do +cuments/home/checkbook2.xls C:/Documents and Settings/me/Local Settings/Application Data/Microsoft +/Outlook/Outlook.pst|J:/application data/Outlook/Outlook.pst C:/Documents and Settings/me/My Documents/gaming|J:/My Documents/gamin +g C:/Documents and Settings/me/My Documents/fantasy|J:/My Documents/fant +asy C:/Documents and Settings/me/Application Data/Notepad++/stylers.xml|J: +/application data/Notepad++/stylers.xml C:/Documents and Settings/me/Application Data/HexChat|J:/application d +ata/HexChat

So my question is, which is the better way to back up my files? If you have any other suggestions, I would like to know.

(I still need to figure out how to clean out files I deleted in the source which are still in my backups.)

Have a cookie and a very nice day!
Lady Aleena

Replies are listed 'Best First'.
Re: Will checking last modified date take more time than just overwriting?
by kcott (Archbishop) on Aug 19, 2013 at 05:56 UTC

    G'day Lady Aleena,

    The answer is going to depend on the size of your files and how many of them have changed; however, unless you have masses and masses of data to backup any only a very, very tiny amount hasn't changed, I would be extremely surprised if the time taken by calls to stat was in any way a limiting factor compared to the amount of time taken to move data between drives (C: to J:). So, while I don't have details on your data, either in terms of size or modifications, I would generally expect an incremental backup to take less time than a full backup.

    You could speed up your script a bit by removing the get_mod_time() subroutine and rewriting same_mod_time() as:

    sub same_mod_time { (stat($_[0]))[9] == (stat($_[1]))[9] }

    However, the backup process is I/O bound and I doubt that would really have any noticeable affect.

    If you're interested, here's my test for that code:

    -- Ken

      This might be a dumb question, but does your same_mod_time return 0 or 1 without the return?

      Data details:

      • checkbook.xls changes 2 or 3 times a month
      • Outlook.pst changes daily
      • gaming has 21 files in 5 folders, and I haven't touched it for a while
      • fantasy has 2252 files in 520 folders (this folder is huge!)
      • stylers.xml hasn't changed in like forever
      • hexchat has 82 files in 6 folders

      Thanks for stopping by.

      Have a cookie and a very nice day!
      Lady Aleena
        "This might be a dumb question, but does your same_mod_time return 0 or 1 without the return?"

        It returns TRUE or FALSE, i.e. whatever (stat($_[0]))[9] == (stat($_[1]))[9] evaluates to. In string context, that would be "1" or ""; in numeric context, that would be 1 or 0.

        Modifying the test code I posted previously to demonstrate this:

        $ > xxx 2> yyy $ perl -Mstrict -Mwarnings -E ' sub same_mod_time { (stat($_[0]))[9] == (stat($_[1]))[9] } say ">>>" . same_mod_time(qw{xxx yyy}) . "<<<"; ' >>>1<<< $ perl -Mstrict -Mwarnings -E ' sub same_mod_time { (stat($_[0]))[9] == (stat($_[1]))[9] } say 0 + same_mod_time(qw{xxx yyy}); ' 1 $ > xxx $ perl -Mstrict -Mwarnings -E ' sub same_mod_time { (stat($_[0]))[9] == (stat($_[1]))[9] } say ">>>" . same_mod_time(qw{xxx yyy}) . "<<<"; ' >>><<< $ perl -Mstrict -Mwarnings -E ' sub same_mod_time { (stat($_[0]))[9] == (stat($_[1]))[9] } say 0 + same_mod_time(qw{xxx yyy}); ' 0

        The presence or absence of the return keyword, in that subroutine, is immaterial. Here's what the doco says:

        "return EXPR
        ...
        Returns from a subroutine, eval, or do FILE with the value given in EXPR.
        ...
        In the absence of an explicit return, a subroutine, eval, or do FILE automatically returns the value of the last expression evaluated. ..."

        -- Ken

Re: Will checking last modified date take more time than just overwriting?
by stefbv (Curate) on Aug 19, 2013 at 05:56 UTC

    The best answer you get is by including benchmarking code in both versions.

    But, i think that the time spent with checking for the modified time on files should be compensated with time gained by skipping some of the files. Copying should be more expensive than checking for the modified time.

    Other options, yes: rsync.

    Thanks for the cookie, a very nice day to you! :-)

    Stefan

Re: Will checking last modified date take more time than just overwriting?
by sundialsvc4 (Abbot) on Aug 19, 2013 at 12:57 UTC

    Emphasizing stefbv’s admonition to check out the rsync command, which is specifically designed to synchronize two directories and to do it as rapidly and therefore as cleverly as possible.   Perhaps you can avoid your present strategy?

Re: Will checking last modified date take more time than just overwriting?
by Laurent_R (Canon) on Aug 19, 2013 at 20:24 UTC

    Hi Milady

    In general terms, I would think that checking the last modification date will almost always be better than simply overwriting files (checking the last modification time is a very fast system call), although it is true that it really depends on the number of files, the file size, the modification frequency and factors more complicated to take into account (for example: are the large files more likely to be modified than the small one, or is it the opposite? etc.)

    Another point: if I understood correctly your code (which I read only very quickly), you are comparing the file last modification time with the same parameter on the backed up file. You should probably have a small database (a tied hash, whatever) so that you don't need to check the back up last modification time. In fact, it can be even simpler: you only need to remember the last time you ran a backup. Any more recently modified file needs to be backed up, any older file needs not.