Multi-directory Change Reporter

I tackled a new (to me) problem today and I thought I'd share my solution. I've never dealt with this particular problem before and I don't remember reading about it. I'd be interested to hear if anyone has solved it in a different way or can think of a better solution.

Problem: I have 20 directories each containing around 50 files. At one point in the past these directories all contained the same 50 files with the same contents. Since then changes have been made, files have been added and files have been deleted. I need to get an overview of which files have changed and which files are still the same. I don't have a copy of the original set of files, all I have is the current set of 20 directories.

My first thought was to use diff, but I quickly realized I'd be drowning in data. I don't need to look at the minutia yet, I just need to get an idea of how much variation there is.

My next idea, which I implemented, was to generate a table like:

dir1 dir2 dir3

file1 A A A

file2 A B C

file3 A

In the above table all three directories have identical copies of file1, and all have different copies of file2. Only dir1 has a file3.

I decided that instead of generating an HTML table I'd produce a CSV and load that in OpenOffice Calc (and when that crashed on me, Excel). Then I can produce pretty output for the managers to explain fully why making a "simple" change across all these directories will take so long.

To do the actual comparisions I used Digest::MD5 to compute MD5 sums for each file. Then I produced the letter designations by creating a hash of MD5s for each row. Here's the code I used:

#!/usr/bin/perl -w
use File::Find;
use Digest::MD5 qw(md5_hex);

my @dirs = sort glob("cms*");
my %files;

foreach my $dir (@dirs) {
    chdir($dir) or die $!;
    my @files = sort (glob("*.tmpl"), glob("*.pl"));
    foreach my $file (@files) {
        open(my $fh, '<', $file) or die $!;
        my $text = join('', <$fh>);
        my $md5 = md5_hex($text);
        $files{$file}{$dir} = $md5;
    }
                     
    chdir('..') or die $!;
}

print ',', join(', ', @dirs), "\n";
foreach my $file (sort keys %files) {
    my %key;
    my $next = 'A';
    my @row = $file;
    foreach my $dir (@dirs) {
        my $md5 = $files{$file}{$dir};
        if ($md5) {
            push(@row, ((exists $key{$md5}) ? 
                        ($key{$md5}) : ($key{$md5} = $next++)));
        } else {
            push(@row, '');
        }
    }
    print join(', ', @row), "\n";
}
[download]

Note that the code assumes a couple things peculiar to my problem - the target directories start with "cms" and the files I'm interested in end in ".pl" and ".tmpl".

The end result provided me with a number of useful insights into the overall variety of the data. I can now apply diff to examine particular changes and use the formatted output to justify my estimates.

-sam

Comment on Multi-directory Change Reporter Download Code

Replies are listed 'Best First'.
Re: Multi-directory Change Reporter by chanio (Priest) on Sep 19, 2003 at 03:14 UTC
The oldest files might be those that were from that time that all were the same! I would sort them by date and integrate their contents in that order (from the oldest to the newest). It wouldn't matter much where they come from... Sorry if I didn't uderstand well your exposition.	[reply]
Re: Re: Multi-directory Change Reporter by samtregar (Abbot) on Sep 19, 2003 at 20:57 UTC
That's a good idea. These aren't actually files on the filesystem, but the source I get them from does track a modifcation date. Thanks! -sam	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks