http://www.perlmonks.org?node_id=11128734

I write science-fiction for teens, and sometimes have to save documents into a non-text format. Not really an issue, but I like to make backups, and backups of my backups, because I've lost work before. I thought I had a slight problem, maybe a dozen or so files with five or six copies. No worries! I could save them as text, put them into version control, and then get rid of the old files. Life is good.

The first pass at code said I had 21 files and 167 copies. Not good, but doable. Then I looked at the directory list and saw a lot of directories that should be searched were missing from the list. Going back to the code, I realized that there wasn't anything to recurse into a directory to see if there were more directories to be searched. Fixed that. Proud of myself, I ran the code again:

With 161 unique files, there are 7621 copies.

Oh boy...maybe I have a problem...

The code below is a work in progress. I'm setting it up so that I can exclude the version controlled originals. Future steps include removing duplicate files and empty directories. Feedback is welcome! Code Repo

#!/usr/bin/env perl # name: find_copies.pl # version: 0.0.2 # date: 20210223 # desc: Tool to help clean up old versions of files. ## TODO use strict; use warnings; use File::Basename; use Data::Dumper; use Getopt::Long; my %known_dirs; my %known_files; my @dir_queue; # Used to see how many copies we have. my $total_file_count = 0; my $actual_file_count = 0; my $log = 0; # set to 1 with the --log option. my $exclude_list_file; my %exclude_list; my $exclude_dir_file; my %exclude_dir; my $seed_file; # This is just a seed file, the results of 'locate +<filename>' sub usage { print "Usage: find_copies.pl --file <path/seed_file> [ --log | --exc +lude_list <file of files to not deal with> ] \n"; exit; }; # Pretty sure these could be made into one method. sub build_exclude_dir { open my $exclude_dirs, '<', $exclude_dir_file or die "Can't open $ex +clude_dir_file: $!"; for my $dir ( <$exclude_dirs> ) { chomp $dir; if ( -d $dir ) { $exclude_dir{$dir} = 1; } } close $exclude_dirs; } sub build_exclude_list { open my $exclude_files, '<', $exclude_list_file or die "Can't open $ +exclude_list_file: $!"; for my $file ( <$exclude_files> ) { chomp $file; $exclude_list{$file} = 1; } close $exclude_files; } sub build_file_list { foreach my $search_dir ( @dir_queue ) { opendir( my $dir, $search_dir ) or die "Can't open $search_dir: $! +"; foreach my $file ( readdir($dir)) { next if $file =~ m/^\.\.?$/; if ( -d "$search_dir/$file" ) { next if ( defined ($exclude_dir{"$search_dir/$file"}) ); $known_dirs{"$search_dir/$file"} = 1; push ( @dir_queue, "$search_dir/$file"); } else { next if ( defined( $exclude_list{$file} )); $total_file_count++; my $size = -s "$search_dir/$file"; $known_files{$file}{$size} = 1; } } closedir($dir); } } sub show_log { print Dumper(%known_files); $actual_file_count = scalar(keys(%known_files)); print "With $actual_file_count unique files, there are $total_file_c +ount copies.\n"; my @single_version_files; my @multiple_version_files; foreach my $file ( keys( %known_files ) ){ my @values = keys(%{$known_files{$file}}); if ( scalar(@values) > 1 ) { push @multiple_version_files, $file; } else { push @single_version_files, $file; } } @multiple_version_files = sort(@multiple_version_files); @single_version_files = sort(@single_version_files); if ( scalar( @multiple_version_files ) ) { print "Files with multiple versions:\n"; foreach my $f ( @multiple_version_files ) { print "\t $f \n"; } } if ( scalar( @single_version_files ) ){ print "Files with a single version:\n"; foreach my $f ( @single_version_files ) { print "\t $f \n"; } } if ( keys(%exclude_dir) ){ print "excluded directories:\n"; foreach my $dir ( keys(%exclude_dir) ){ print "\t $dir\n"; } } if ( keys( %exclude_list ) ) { print "excluded files:\n"; foreach my $file ( keys( %exclude_list ) ){ print "\t $file\n"; } } if ( keys(%known_dirs) ) { print "directory search list:\n"; foreach my $dir ( keys( %known_dirs )) { print "\t $dir\n"; } } } GetOptions( "--log" => \$log, "--file=s" => \$seed_file, "--exclude_files=s" => \$exclude_list_file, "--exclude_dirs=s" => \$exclude_dir_file, ); usage() unless ( defined($seed_file) ); open my $seed_data_file, '<', $seed_file or die "Can't open $seed_file +: $!"; build_exclude_list() if $exclude_list_file; build_exclude_dir() if $exclude_dir_file; # Build the list of directories to search. for my $line ( <$seed_data_file>) { chomp $line; my $dirname = dirname($line); $known_dirs{$dirname} = 1 unless defined( $exclude_dir{$dirname} ); push( @dir_queue, $dirname); } close $seed_data_file; build_file_list(); show_log() if $log;

Chronicler: The Domici War (domiciwar.net)

General Ne'er-do-well (github.com/LeamHall)

Replies are listed 'Best First'.
Re: Finding old copies of files
by hippo (Bishop) on Feb 24, 2021 at 11:22 UTC

    Good to hear that you are finding Perl to be useful in solving problems such as this. In the most part your code looks in good shape too.

    The one obvious thing I notice when reading through this script is that none of your subroutines take any arguments and none of them return anything (not explicitly at least and they are being called in void context anyway). It isn't critical in a script of this size but since you have it working it might be opportune to try to pass some arguments here and there and see how it goes.

    For example you have these:

    build_exclude_list() if $exclude_list_file; build_exclude_dir() if $exclude_dir_file;

    And each of those subs works on the filename declared outside their scope. If I were writing this, I would pass the filename as an argument and have the subs return immediately if the arg is undef. eg:

    sub build_exclude_list { my $file = shift; return unless defined $file; open my $exclude_files, '<', $file or die "Can't open $file: $!"; # ... } # ... build_exclude_list ($exclude_list_file);

    Then think about returning the list which it builds rather than assigning to a global hash. Here is how you might return a ref to the hash.

    sub build_exclude_list { # ... return \%exclude_list; } my $exclude_list = build_exclude_list ($exclude_list_file);

    This makes your subroutine independent from variables declared outside it which in general terms is A Good Thing. It allows for code re-use: you could make the sub perfectly general, put it in a module and use it from multiple scripts without duplicating the code.

    Anyway, just something to consider.


    🦛

      hippo, thanks! I had some time this morning, and started improving the code based on your comment. While there's still a lot of work to do, I added a method:
      sub write_list_to_logfile { my ( $list, $logfile ) = @_; open my $file, '>', $logfile or die "Can't open $logfile: $!"; foreach my $line ( @$list ){ print $file "$line\n"; } close $file; }
      And then changed the logging method to use the new method:
      if ( keys(%known_dirs) ) { my @known_dirs = keys(%known_dirs); my $known_dirs_filename = "$dir/known_dirs.list"; write_list_to_logfile(\@known_dirs, $known_dirs_filename); }
      The new method works, but I ran out of time before I could streamline it more. I'm thinking I can eliminate the two assignments, but need to test the idea.

      Chronicler: The Domici War (domiciwar.net)

      General Ne'er-do-well (github.com/LeamHall)

      I've been updating the code a little at a time, so that it takes references as parameters to subroutines. Just used it to delete 4,423 files; hopefully the right ones!

      Chronicler: The Domici War (domiciwar.net)

      General Ne'er-do-well (github.com/LeamHall)

Re: Finding old copies of files
by parv (Parson) on Feb 27, 2021 at 05:14 UTC

    Looks like you are counting a file to be a duplicate if there is already a file with that name. Correct me please if I have that wrong.

    I personally would have compared the MD5 (or SHA-256) checksums to remove exact duplicates first. Then, I would have created a separate version control repository to collect -- thus to be able to compare later -- the files with the same names.

      Same name and size, in:

      $known_files{$file}{$size} = 1;

      An MD5 or SHA-256 sum would catch different files of the same size. However, they are computationally intense and in this use case overkill. I'm making one authoritative version and then revising after things are cleaned up. Thus any changed files are likely to get changed a few more times.

      Chronicler: The Domici War (domiciwar.net)

      General Ne'er-do-well (github.com/LeamHall)

        An MD5 or SHA-256 sum would catch different files of the same size. However, they are computationally intense and in this use case overkill

        Calculating the MD5 of a file should not be significantly slower than copying the file on a modern computer.

        You could even optimize by delaying the MD5 calculation until you find a second file with same size and base name.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        $known_files{$file}{$size} = 1;

        Change that to ...

        $known_files{$file}{$size}++;

        ... and you know how many (possible) duplicates you have found.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Ah, right; I had missed that. Appreciate the correction.