in reply to
Re: Module for Approximate File Comparison

in thread Module for Approximate File Comparison

You mean something like:

`#!/usr/bin/perl
use strict;
use warnings;
use IO::Compress::Gzip qw(gzip);
my ( $f1, $f2 ) = @ARGV[ 0 .. 1 ];
my %comparison;
# Get file contents, and compute gzipped lengths.
for my $file ( $f1, $f2 ) {
get_content( $file, \%{ $comparison{$file} } );
computations( \%{ $comparison{$file} } );
}
# Try both orderings of files to determine which produces
# better results. The content is not needed after lengths
# have been computed.
$comparison{test1}{content} =
$comparison{$f1}{content} . $comparison{$f2}{content};
computations( \%{ $comparison{test1} } );
delete $comparison{test1}{content};
$comparison{test2}{content} =
$comparison{$f2}{content} . $comparison{$f1}{content};
computations( \%{ $comparison{test2} } );
delete $comparison{test2}{content};
# The original content is no longer needed.
foreach my $file ( $f1, $f2 ) {
delete $comparison{$file}{content};
}
printf qq{Comparison: %5.2f%% similarity\n},
compute_ratio( \%comparison, ( $f1, $f2 ) );
sub order_pair {
my ( $i, $j ) = @_;
my ( $min, $max ) = ( $i, $j );
( $min, $max ) = ( $max, $min ) if ( $max < $min );
return ( $min, $max );
}
sub compute_ratio {
my ( $hash, @fl ) = @_;
my ( $min, $min_result, $max, $ratio );
( $min_result, undef ) = order_pair(
$hash->{test1}{gzip_length},
$hash->{test2}{gzip_length}
);
( $min, $max ) = order_pair(
$hash->{ $fl[0] }{gzip_length},
$hash->{ $fl[1] }{gzip_length}
);
# Files are 100% if they match exactly.
# (a + b - a) / b = b / b = 1
# Files are 0% if they do not match exactly.
# (a + b - (a+b)) / b = 0 / b = 0
# Ratio computed as how close to the minimal size
$ratio = 100.0 * ( $min + $max - $min_result ) / $max;
return $ratio;
}
sub computations {
my ($hash) = @_;
# Compute gzipped length of content
$hash->{length} = length $hash->{content};
gzip \$hash->{content}, \$hash->{compressed};
# Gzipped content not needed
# after length has been computed.
$hash->{gzip_length} = length $hash->{compressed};
delete $hash->{compressed};
}
sub get_content {
my ( $fn, $hash ) = @_;
# Slurp in file content in bin mode
open( DF, $fn ) or die $!;
binmode DF;
$/ = undef;
$hash->{content} = <DF>;
close DF;
}
`

The code above uses IO::Compress::Gzip, and takes the two files as arguments on the command line. The ratio in this case is computed by saying that 100% would be if the content of the two files compressed to be the same size as one of them, 0% as compressing to the sum of the sizes of the two files.

(By the way, for the example files from my earlier post, this gave a result of 90.61%, which is a fair approximation.)

**Update:** 2007-09-24

Added comments to code; had reversed description of ratio.