Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Multi-directory Change Reporter

by samtregar (Abbot)
on Sep 17, 2003 at 21:54 UTC ( [id://292266]=perlmeditation: print w/replies, xml ) Need Help??

I tackled a new (to me) problem today and I thought I'd share my solution. I've never dealt with this particular problem before and I don't remember reading about it. I'd be interested to hear if anyone has solved it in a different way or can think of a better solution.

Problem: I have 20 directories each containing around 50 files. At one point in the past these directories all contained the same 50 files with the same contents. Since then changes have been made, files have been added and files have been deleted. I need to get an overview of which files have changed and which files are still the same. I don't have a copy of the original set of files, all I have is the current set of 20 directories.

My first thought was to use diff, but I quickly realized I'd be drowning in data. I don't need to look at the minutia yet, I just need to get an idea of how much variation there is.

My next idea, which I implemented, was to generate a table like:

dir1dir2dir3
file1AAA
file2ABC
file3A  

In the above table all three directories have identical copies of file1, and all have different copies of file2. Only dir1 has a file3.

I decided that instead of generating an HTML table I'd produce a CSV and load that in OpenOffice Calc (and when that crashed on me, Excel). Then I can produce pretty output for the managers to explain fully why making a "simple" change across all these directories will take so long.

To do the actual comparisions I used Digest::MD5 to compute MD5 sums for each file. Then I produced the letter designations by creating a hash of MD5s for each row. Here's the code I used:

#!/usr/bin/perl -w use File::Find; use Digest::MD5 qw(md5_hex); my @dirs = sort glob("cms*"); my %files; foreach my $dir (@dirs) { chdir($dir) or die $!; my @files = sort (glob("*.tmpl"), glob("*.pl")); foreach my $file (@files) { open(my $fh, '<', $file) or die $!; my $text = join('', <$fh>); my $md5 = md5_hex($text); $files{$file}{$dir} = $md5; } chdir('..') or die $!; } print ',', join(', ', @dirs), "\n"; foreach my $file (sort keys %files) { my %key; my $next = 'A'; my @row = $file; foreach my $dir (@dirs) { my $md5 = $files{$file}{$dir}; if ($md5) { push(@row, ((exists $key{$md5}) ? ($key{$md5}) : ($key{$md5} = $next++))); } else { push(@row, ''); } } print join(', ', @row), "\n"; }

Note that the code assumes a couple things peculiar to my problem - the target directories start with "cms" and the files I'm interested in end in ".pl" and ".tmpl".

The end result provided me with a number of useful insights into the overall variety of the data. I can now apply diff to examine particular changes and use the formatted output to justify my estimates.

-sam

Replies are listed 'Best First'.
Re: Multi-directory Change Reporter
by chanio (Priest) on Sep 19, 2003 at 03:14 UTC
    The oldest files might be those that were from that time that all were the same!

    I would sort them by date and integrate their contents in that order (from the oldest to the newest). It wouldn't matter much where they come from...

    Sorry if I didn't uderstand well your exposition.

      That's a good idea. These aren't actually files on the filesystem, but the source I get them from does track a modifcation date.

      Thanks!
      -sam

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://292266]
Approved by gmax
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-16 04:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found