Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

You mean something like:

#!/usr/bin/perl use strict; use warnings; use IO::Compress::Gzip qw(gzip); my ( $f1, $f2 ) = @ARGV[ 0 .. 1 ]; my %comparison; # Get file contents, and compute gzipped lengths. for my $file ( $f1, $f2 ) { get_content( $file, \%{ $comparison{$file} } ); computations( \%{ $comparison{$file} } ); } # Try both orderings of files to determine which produces # better results. The content is not needed after lengths # have been computed. $comparison{test1}{content} = $comparison{$f1}{content} . $comparison{$f2}{content}; computations( \%{ $comparison{test1} } ); delete $comparison{test1}{content}; $comparison{test2}{content} = $comparison{$f2}{content} . $comparison{$f1}{content}; computations( \%{ $comparison{test2} } ); delete $comparison{test2}{content}; # The original content is no longer needed. foreach my $file ( $f1, $f2 ) { delete $comparison{$file}{content}; } printf qq{Comparison: %5.2f%% similarity\n}, compute_ratio( \%comparison, ( $f1, $f2 ) ); sub order_pair { my ( $i, $j ) = @_; my ( $min, $max ) = ( $i, $j ); ( $min, $max ) = ( $max, $min ) if ( $max < $min ); return ( $min, $max ); } sub compute_ratio { my ( $hash, @fl ) = @_; my ( $min, $min_result, $max, $ratio ); ( $min_result, undef ) = order_pair( $hash->{test1}{gzip_length}, $hash->{test2}{gzip_length} ); ( $min, $max ) = order_pair( $hash->{ $fl[0] }{gzip_length}, $hash->{ $fl[1] }{gzip_length} ); # Files are 100% if they match exactly. # (a + b - a) / b = b / b = 1 # Files are 0% if they do not match exactly. # (a + b - (a+b)) / b = 0 / b = 0 # Ratio computed as how close to the minimal size $ratio = 100.0 * ( $min + $max - $min_result ) / $max; return $ratio; } sub computations { my ($hash) = @_; # Compute gzipped length of content $hash->{length} = length $hash->{content}; gzip \$hash->{content}, \$hash->{compressed}; # Gzipped content not needed # after length has been computed. $hash->{gzip_length} = length $hash->{compressed}; delete $hash->{compressed}; } sub get_content { my ( $fn, $hash ) = @_; # Slurp in file content in bin mode open( DF, $fn ) or die $!; binmode DF; $/ = undef; $hash->{content} = <DF>; close DF; }

The code above uses IO::Compress::Gzip, and takes the two files as arguments on the command line. The ratio in this case is computed by saying that 100% would be if the content of the two files compressed to be the same size as one of them, 0% as compressing to the sum of the sizes of the two files.

(By the way, for the example files from my earlier post, this gave a result of 90.61%, which is a fair approximation.)

Update: 2007-09-24
Added comments to code; had reversed description of ratio.


In reply to Re^2: Module for Approximate File Comparison by atcroft
in thread Module for Approximate File Comparison by neversaint

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others surveying the Monastery: (15)
    As of 2015-07-07 14:20 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (89 votes), past polls