Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^2: Module for Approximate File Comparison

by atcroft (Monsignor)
on Sep 24, 2007 at 18:44 UTC ( #640772=note: print w/ replies, xml ) Need Help??


in reply to Re: Module for Approximate File Comparison
in thread Module for Approximate File Comparison

You mean something like:

#!/usr/bin/perl use strict; use warnings; use IO::Compress::Gzip qw(gzip); my ( $f1, $f2 ) = @ARGV[ 0 .. 1 ]; my %comparison; # Get file contents, and compute gzipped lengths. for my $file ( $f1, $f2 ) { get_content( $file, \%{ $comparison{$file} } ); computations( \%{ $comparison{$file} } ); } # Try both orderings of files to determine which produces # better results. The content is not needed after lengths # have been computed. $comparison{test1}{content} = $comparison{$f1}{content} . $comparison{$f2}{content}; computations( \%{ $comparison{test1} } ); delete $comparison{test1}{content}; $comparison{test2}{content} = $comparison{$f2}{content} . $comparison{$f1}{content}; computations( \%{ $comparison{test2} } ); delete $comparison{test2}{content}; # The original content is no longer needed. foreach my $file ( $f1, $f2 ) { delete $comparison{$file}{content}; } printf qq{Comparison: %5.2f%% similarity\n}, compute_ratio( \%comparison, ( $f1, $f2 ) ); sub order_pair { my ( $i, $j ) = @_; my ( $min, $max ) = ( $i, $j ); ( $min, $max ) = ( $max, $min ) if ( $max < $min ); return ( $min, $max ); } sub compute_ratio { my ( $hash, @fl ) = @_; my ( $min, $min_result, $max, $ratio ); ( $min_result, undef ) = order_pair( $hash->{test1}{gzip_length}, $hash->{test2}{gzip_length} ); ( $min, $max ) = order_pair( $hash->{ $fl[0] }{gzip_length}, $hash->{ $fl[1] }{gzip_length} ); # Files are 100% if they match exactly. # (a + b - a) / b = b / b = 1 # Files are 0% if they do not match exactly. # (a + b - (a+b)) / b = 0 / b = 0 # Ratio computed as how close to the minimal size $ratio = 100.0 * ( $min + $max - $min_result ) / $max; return $ratio; } sub computations { my ($hash) = @_; # Compute gzipped length of content $hash->{length} = length $hash->{content}; gzip \$hash->{content}, \$hash->{compressed}; # Gzipped content not needed # after length has been computed. $hash->{gzip_length} = length $hash->{compressed}; delete $hash->{compressed}; } sub get_content { my ( $fn, $hash ) = @_; # Slurp in file content in bin mode open( DF, $fn ) or die $!; binmode DF; $/ = undef; $hash->{content} = <DF>; close DF; }

The code above uses IO::Compress::Gzip, and takes the two files as arguments on the command line. The ratio in this case is computed by saying that 100% would be if the content of the two files compressed to be the same size as one of them, 0% as compressing to the sum of the sizes of the two files.

(By the way, for the example files from my earlier post, this gave a result of 90.61%, which is a fair approximation.)

Update: 2007-09-24
Added comments to code; had reversed description of ratio.


Comment on Re^2: Module for Approximate File Comparison
Download Code
Re^3: Module for Approximate File Comparison
by salva (Monsignor) on Sep 24, 2007 at 19:46 UTC
    If I recall correctly, zlib (and so, gzip) operate over a data window of 64Kb or less, so your method would not work if the files are bigger than that.

      True-any compression method that uses a data window in a similar manner would experience the same issue, albeit the size at which it would occur would be dependent upon the size of said window. The code above was merely an example.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://640772]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2014-09-20 10:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (158 votes), past polls