Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

MD5 checksums for Windows

by golux (Chaplain)
on Jan 13, 2015 at 16:58 UTC ( [id://1113107]=CUFP: print w/replies, xml ) Need Help??

This program "sum.pl" (for Windows) generates checksums matching those produced by the "md5sum" program in Linux. I wrote it because I often need to validate whether 2 files on different computers are the same.

Enter "sum.pl" without arguments for a syntax message. Both files and/or directories (ie. "folders") are accepted as arguments. With the switch -R subdirectories are searched recursively. The switches -s <key> and -r control how the output is sorted. The -d switch gives a final report of any duplicate checksums found.

Hope this might be of general use to others as well!

Update:   At Anonymous Monk's suggestion, I've added a "-c" switch which produces a checksum format compatible with "md5sum". It does this by skipping the filesize, and prefixing the path with '*' to signify that the checksum was done in binary mode.

#!/usr/bin/perl -w ############### ## Libraries ## ############### use strict; use warnings; use Data::Dumper; use Digest::MD5 qw{ md5 md5_hex }; use File::Basename; use Getopt::Long; use IO::File; ############# ## Globals ## ############# $| = 1; my $iam = basename $0; my $b_recurse = 0; my $b_reverse = 0; my $b_dups = 0; my $b_compat = 0; my $h_sums = 0; my $sortkey = ""; my $syntax = qq{ syntax: $iam [switches] <file> [file ...] Generates the MD5 checksum for one or more <files>, and displays the file size (in bytes), the checksum and the filename for each. If a given <file> refers to a directory the checksum is generated for all files within it (use the -r switch to recurse through its subdirs as well). Switches -c .......... Compatible output with "md5sum" in binary mode -R .......... Recurse subdirs when <file> is a directory -s <key> .... Sort files in subdirs by the given <key>, where <key> is one of: "name" (default), "size", "sum" -r .......... Reverse the order of the sort -d .......... Find and report files with duplicate sums }; ################## ## Command-line ## ################## Getopt::Long::Configure("bundling"); my $go = GetOptions( "c" => \$b_compat, "R" => \$b_recurse, "r" => \$b_reverse, "s=s" => \$sortkey, "d" => \$b_dups, ); $go or die $syntax; (@ARGV > 0) or die $syntax; ################## ## Main program ## ################## map { md5sum_file($_) } @ARGV; $h_sums and show_duplicates($h_sums); ################# ## Subroutines ## ################# sub fatal { my ($err) = @_; my $lnum = (caller)[2]; my $text = "${iam}[$lnum] FATAL: $err"; die "$text\n"; } sub md5sum_file { my ($fname) = @_; (-f $fname) and return show_md5sum($fname); if (-d $fname) { my $dir = $fname; return md5sum_dir($dir); } } sub generate_md5sum { my ($fname) = @_; my $o_md5 = Digest::MD5->new; my $fh = IO::File->new; open($fh, "<", $fname) or fatal("Failed to open '$fname' ($!)"); binmode($fh); my $sum = $o_md5->addfile($fh)->hexdigest(); close $fh; if ($b_dups) { $h_sums ||= { }; my $a_files = $h_sums->{$sum} ||= [ ]; push @$a_files, $fname; } return $sum; } sub show_md5sum { my ($fname, $a_sum) = @_; $fname =~ s:\\:/:g; $a_sum ||= [ -s $fname, generate_md5sum($fname) ]; my ($size, $sum) = @$a_sum; if ($b_compat) { printf "%s *%s\n", $sum, $fname; } else { printf " %10d %s %s\n", $size, $sum, $fname; } } sub md5sum_dir { my ($dir) = @_; print "\n"; my $fh = IO::File->new; opendir($fh, $dir) or fatal("Can't read dir '$dir' ($!)"); my @files = readdir($fh); closedir $fh; my $h_sorted = { }; my $a_dirs = [ ]; foreach my $fname (@files) { next if ($fname eq '.' or $fname eq '..'); my $path = "$dir/$fname"; (-l $path) and next; if (-d $path) { $b_recurse and push @$a_dirs, $path; next; } (-f $path) or next; if (not $sortkey) { show_md5sum($path); } else { my $size = (-s $path); my $sum = generate_md5sum($path); $h_sorted->{$path} = [ $size, $sum, lc $path ]; } } $sortkey and show_sorted($h_sorted); map { md5sum_dir($_) } @$a_dirs; } sub show_sorted { my ($h) = @_; my @keys = keys %$h; if ($sortkey eq 'size') { @keys = sort { $h->{$a}->[0] <=> $h->{$b}->[0] } @keys; } elsif ($sortkey eq 'sum') { @keys = sort { $h->{$a}->[1] cmp $h->{$b}->[1] } @keys; } else { @keys = sort { $h->{$a}->[2] cmp $h->{$b}->[2] } @keys; } $b_reverse and @keys = reverse @keys; foreach my $path (@keys) { show_md5sum($path, $h->{$path}); } } sub show_duplicates { my ($h) = @_; my @dups = grep { @{$h->{$_}} > 1 } keys %$h_sums; my @sorted = sort { @{$h->{$a}} <=> @{$h->{$b}} } @dups; foreach my $dup (@sorted) { my $a_files = $h->{$dup}; print "\n [Duplicate Sum '$dup']\n"; for (my $i = 0; $i < @$a_files; $i++) { my $fname = $a_files->[$i]; printf " %3d. %s\n", $i+1, $a_files->[$i]; } } }

say  substr+lc crypt(qw $i3 SI$),4,5

Replies are listed 'Best First'.
Re: MD5 checksums for Windows
by Anonymous Monk on Jan 13, 2015 at 21:02 UTC

    Thanks for sharing!

    Just an observation, which may not be relevant, since your script's output format doesn't exactly match that of md5sum (you include the file's size):

    md5sum offers a -b switch, which mainly adds the "b" to the fopen call, which "has no effect; the 'b' is ignored on all POSIX conforming systems, including Linux." It also adds an asterisk to the output line to indicate that the binary mode was used, like so:

    $ md5sum -b * fb8d98be1265dd88bac522e1b2182140 *foo.txt f83a0aa1f9ca0f7dd5994445ba7d9e80 *bar.txt d6a6bc0db10694a2d90e3a69648f3a03 *quz.txt

    But in my experience people usually ignore the -b switch because it has no effect on those systems. On Windows, of course it's a different story, since there, fopen cares about the "b" mode (among other things, "translations involving carriage-return and linefeed characters are suppressed").

    Now whether or not this is an issue at all depends on whether there's a Windows MD5 checksumming tool that defaulted to reading files in text mode. But since you're on Windows and you use binmode, if you felt like making your output more similar to that of md5sum, you could consider adding the asterisk to your output.

      Thanks for the suggestion!

      I was unaware of the -b switch, and you're right that the output format doesn't match that of md5sum. Since it's a pretty easy fix (less than 10 lines), I'll add a "-c" (compatibility) switch to the program now.

      say  substr+lc crypt(qw $i3 SI$),4,5
        you rock! that's been on my todo list for ages
Re: MD5 checksums for Windows
by aussiecoder (Acolyte) on Oct 25, 2016 at 04:30 UTC

    When I run the script on my system, it runs for a while then stops with the following error ...

    C:\tmp\Shared\Downloads>perl sum.pl -R c:/data/db 32768 aa8fe23d2d4c2495ff521b2efbefd30d c:/data/db/collection-0 +-3854299608590736422.wt 90112 540e61cd865a1c8e72255d658c00840b c:/data/db/collection-0 +-6126630059692173099.wt ... c:/data/db/index-4105-3854299608590736422.wt sum.pl[99] FATAL: Failed to open 'c:/data/db/mongod.lock' (Permission + denied)

    I tried putting an if(-r $fname) around the md5 calculation, but I still get the same error.

    Any ideas on how I can modify the script to skip past this file ?
      I've modified generate_md5sum to look like this ...
      sub generate_md5sum { my ($fname) = @_; my $sum = "file-not-readable "; my $o_md5 = Digest::MD5->new; my $fh = IO::File->new; if ( open( $fh, "+<", $fname ) ) { if ( flock( $fh, LOCK_EX | LOCK_NB ) ) { binmode($fh); $sum = $o_md5->addfile($fh)->hexdigest(); flock( $fh, LOCK_UN ); close $fh; if ($b_dups) { $h_sums ||= {}; my $a_files = $h_sums->{$sum} ||= []; push @$a_files, $fname; } } } return $sum; }
      Which allows the script to continue over locked or unreadable files. Is this a good solution to this problem ?
        Hi aussiecoder,

        Just now saw your two responses.

        I'm guessing that because you're on Windows, you have some other process vying for the file. Windows is much more restrictive than Linux when it comes to separate processes trying to access the same file.

        If that's the case, then yes it is probably a fine solution, especially since you're using the LOCK_NB (non-blocking) flag, as long as you're okay with that file's checksum being skipped if the other process accesses it first. The only thing I would worry about is that possibly the other process might fail for the same reason, unless it, too, incorporates file-locking.

        Cheers, golux

        say  substr+lc crypt(qw $i3 SI$),4,5

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://1113107]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-03-19 08:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found