Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Handling Hash Comparison

by omega_monk (Scribe)
on Jul 23, 2005 at 03:46 UTC ( [id://477415]=perlquestion: print w/replies, xml ) Need Help??

omega_monk has asked for the wisdom of the Perl Monks concerning the following question:

Monks,
I am working on comparing (2) hashes of filenames/md5sum combinations, and I am stuck...What I am doing is using File::Find to get a recursive list of the files from the given(command line parm) dirs, both source, and destination. then I want to md5 all the files that I just made the list of. That is working fine, and here is what I have right now, with the error in the foreach loop.
my %sdirdigest{ "c:/temp/filename.txt"=>"4eb788842f1253903e16a1a0cfe +46f2f", "c:/temp/level2/filename2.txt"=>"fb3e103dee9f40881fb +e8fdccdc4d07a" }; my %ddirdigest{ "e:/temp/filename.txt"=>"4eb7adfaaf1253903e16a1a0cfe +46f2f", "e:/temp/level1/filename2.txt"=>"1fbe8fdccdc4d07afb3 +e103dee9f4088" }; foreach my $key (keys %sdirdigest) { print "Checking $key..." if $verbose == 1; if ((defined($ddirdigest{$key})) && ($sdirdigest{$key} eq $ddirdige +st{$key})) { print "SYNCHRONIZED...\n" if $verbose == 1; } else { print "UPDATING...\n" if $verbose == 1; my $todo = 'cp "'.$key.'" "'.$key.'"'; system($todo); } }

I know that the issue here is that I am using $key wrong, which is from when I was not doing a recursive check, if that makes sense. I am pretty sure that my issue is from design problems, but I am not sure what I need to do. I tried searching, but came up empty handed, due to the fact that I am not really sure what I need to do here. I don't need the answer specifically, just a pointer in the right direction. If I need more information, please let me know what will help.
Thanks.

update: Updated with the rest of my code.
#!/usr/bin/perl use warnings; use strict; use diagnostics; use Cwd; use File::Find; use Digest::MD5; use File::Basename; use Getopt::Long; my %fdirdigest; my %sdirdigest; my $fdircount='0'; my $sdircount='0'; my($fbn,$fdn,$fdg,$bn,$dn,$dg,@fdir_list,@sdir_list); my $fdir=''; my $sdir=''; my $recursive='0'; my $verbose='0'; GetOptions( 'verbose'=>\$verbose, 'recursive'=>\$recursive, 'src=s'=>\$fdir, 'dst=s'=>\$sdir ) or die "Oops, check your command line pa +rms. $^E"; if (!(-e $fdir && -e $sdir)) { print 'I can not proceed, since both dirs do not exist.'."\n"; exit(); } if ($recursive == 1) { find(\&push_to_fdir,$fdir); find(\&push_to_sdir,$sdir); } else { @fdir_list = glob("$fdir*"); @sdir_list = glob("$sdir*"); } foreach my $filef (@fdir_list) { if (-d $filef) { next(); } my $fdg; $fbn=basename($filef); $fdn=dirname($filef); chdir($fdn); my $fmd5 = Digest::MD5->reset; open(FILE,$fbn) or die "Unable to open the file: $^E\n"; binmode(FILE); while (<FILE>) { $fmd5->add($_); } close(FILE); $fdg = $fmd5->hexdigest(); $fdirdigest{"$filef"}=$fdg; $fdircount++; } foreach my $files (@sdir_list) { if (-d $files) { next(); } my $dg; $bn=basename($files); $dn=dirname($files); chdir($dn); my $md5 = Digest::MD5->reset; open(FILE,$bn) or die "Unable to open the file: $^E\n"; binmode(FILE); while (<FILE>) { $md5->add($_); } close(FILE); $dg = $md5->hexdigest(); $sdirdigest{"$files"}=$dg; $sdircount++; } foreach my $key (keys %fdirdigest) { print "Checking $key..." if $verbose == 1; if ((defined($sdirdigest{$key})) && ($fdirdigest{$key} eq $sdirdige +st{$key})) { print "SYNCHRONIZED...\n" if $verbose == 1; } else { print "UPDATING...\n" if $verbose == 1; my $todo = 'cp "'.$key.'" "'.$key.'"'; system($todo); } } print "\nFiles Processed:\t$fdir\t$fdircount\n"; print " \t$sdir\t$sdircount\n"; sub push_to_fdir { my $dir = getcwd; my $fp = "$dir/$_"; #print "$fp\n"; push(@fdir_list, "$fp") unless $_ =~ /^\.$|^\.\.$|[T|t]humbs.db/ or + $fp =~ /\/\//; } sub push_to_sdir { my $dir = getcwd; my $fp = "$dir/$_"; #print "$fp\n"; push(@sdir_list, "$fp") unless $_ =~ /^\.$|^\.\.$|[T|t]humbs.db/ or + $fp =~ /\/\//; }

Replies are listed 'Best First'.
Re: Handling Hash Comparison
by Errto (Vicar) on Jul 23, 2005 at 04:04 UTC

    I see a couple of things here. First of all, not sure if this is a typo, but the way you're creating the hashes isn't right. The normal way to assign to a hash is

    my %hash = ( key1 => value1, key2 => value2 ); ...
    Using parentheses instead of braces is important because the parentheses denote a list of keys and values (which is what you want), whereas the braces denote a single scalar value that happens to be a hash reference (which is not what you want). Update: this paragraph is not applicable to the real code in the OP.

    The other thing is that if your real data is like what you have here, you'll never get a match because the keys in %sdirdigest will never have exactly the same value as any key in %ddirrdigest because the root paths are different. Update: more precisely, the keys that go into both of your hashes are absolute paths. What you really want are the relative paths to the files, based on the respective starting directories. The way to fix this is to use $File::Find::dir instead of getcwd in your subs. Also it looks like you're trying to copy a file onto itself. Is that what you meant?

      no, I posted my code in the readmore tags, in my original post. What happened is that initially I was only doing this in the dir that was specified at the command line, but later decided to add the ability to recurse dirs. my original code concatenated the dir onto the $key so that it copied appropriately, but when I added recursion, I could not longer copy the files in the same way, problem is I have no idea how best to proceed.

      Take a look at the rest of my code? Thanks.

      update: apologies, if that came off inappropriately, I was really just meaning to ask to have another look. Thanks for the pointer, I will take a look at File::Find:dir. One question, though. Does the logic make sense to get the filename for the second dir? I am probably just missing something obvious. I appreciate the help. Thanks again.
        Ok. I posted an update above - basically you can use $File::Find::dir instead of getcwd so that the relative path instead of the absolute path becomes the key. But now I'm unclear - I understand that if two files in corresponding relative paths have matching MD5 sums you want to do something, but I don't understand what that is. Right now what would happen is that you copy the file onto itself.
Re: Handling Hash Comparison
by kvale (Monsignor) on Jul 23, 2005 at 03:56 UTC
    It is not clear what you are trying to do, but let me make a guess. I expect that you want to compare file for duplicates and are using MD5 as a signature to do so.

    If this is the task, it will go easier if the MD5 signatures are the keys of the hash and the filenames are the file paths. that way, when you come upon a new file, simply calculate its MD5 signature, and see if that signature already exists as a hash key.

    Here is some code to illustrate that:

    my %file_sig; find( \&find_dup, $root_dir); sub find_dup { my $name = $_; my $full_name = $File::Find::name; my $dir = $File::Find::dir; if (-f $name) { open IN, "<$full_name"; my $digest = Digest::MD5->new->addfile(*IN)->hexdigest; close IN; if (exists $file_sig{ $digest }) { print "$full_name is a duplicate of $file_sig{ $digest }\n"; } else { $file_sig{ $digest } = $full_name; } } }

    -Mark

      No, not exactly, I am synchronizing 2 dirs. I updated my post with the rest of the code in readmore tags...
Re: Handling Hash Comparison
by TilRMan (Friar) on Jul 23, 2005 at 07:07 UTC

    Take a look at rsync, which does what it seems you are trying to do. I don't know how well it works on Windows, though I expect it should work fine in Cygwin.

Re: Handling Hash Comparison
by polettix (Vicar) on Jul 23, 2005 at 13:45 UTC
    Update: I noticed that Errto already pointed out your mistake with absolute filenames as keys. Repetita juvant :)

    You're comparing files from different disks. So, you simply have that:

    "c:/temp/filename.txt" eq "e:/temp/filename.txt"
    is false, even if you expect it to be true. You should therefore remove all the prefix that makes your paths different, and you'll probably be happy then. In your case, you can remove the disk specifier, for example:
    "/temp/filename.txt" eq "/temp/filename.txt"
    is true at last. Also, keep in mind that Windows filenames do not make distinctions between uppercase and lowercase (at least FAT32), so you'd better use lc on the filenames before using them as keys in your hashes.

    Flavio
    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Don't fool yourself.
Re: Handling Hash Comparison
by BaldPenguin (Friar) on Jul 23, 2005 at 18:16 UTC
    I too woould look at rsync. Most of the ports for windows seem to require cygwin, although a native port is in alpha stage on SourceForge. There is a mod to help called File::Rsync, I haven't tried it but it seems straight forward. Perhaps some other monks will have experience with it.
    You could also try Unison. It has ports for windows already, which I have used for syncing backups across a network. And best of all, it's GPL.

    Don
    WHITEPAGES.COM | INC
    Everything I've learned in life can be summed up in a small perl script!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://477415]
Approved by Errto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (1)
As of 2025-04-18 04:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.