Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Module for Approximate File Comparison

by neversaint (Deacon)
on Sep 24, 2007 at 06:31 UTC ( #640673=perlquestion: print w/ replies, xml ) Need Help??
neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,

Is there any CPAN module similar to File::Compare, but only this time we would also compare based on approximate matching. Meaning we want to check how similare are the two files (in percentage etc).

---
neversaint and everlastingly indebted.......

Comment on Module for Approximate File Comparison
Re: Module for Approximate File Comparison
by Anonymous Monk on Sep 24, 2007 at 07:57 UTC
Re: Module for Approximate File Comparison
by atcroft (Monsignor) on Sep 24, 2007 at 08:32 UTC

    The complex part would seem to be how to compute a meaningful "percentage".

    For instance, I have 2 files of 39 lines each, that differ due to formatting. Just looking at line comparisons (using eq), 27 of the 39 lines match (69.23%). Using Algorithm::Diff, though, there are 4 "hunks" of changes reported by the diff() function, but those 4 "hunks" consisted of 17 changes (8 additions, 9 subtractions).

    Thoughts?

Re: Module for Approximate File Comparison
by stark (Pilgrim) on Sep 24, 2007 at 09:13 UTC

    Maybe you can use String::Approx for this task. Depending on the types of files you want to compare it should be possible to build a small tool based on this module.

    If you are only interested to count the lines that differ, you can use standard diff.

Re: Module for Approximate File Comparison
by sago (Scribe) on Sep 24, 2007 at 10:14 UTC

    Try this program

    #!/usr/bin/perl
    #
    # `Diff' program in Perl
    # Copyright 1998 M-J. Dominus. (mjd-perl-diff@plover.com)
    #
    # This program is free software; you can redistribute it and/or modify it
    # under the same terms as Perl itself.
    #

    use Algorithm::Diff qw(diff);

    #bag("Usage: $0 oldfile newfile") unless @ARGV == 2;

    chdir("C:/Documents and Settings/San_raj/Desktop");
    $file1="C:/Documents and Settings/San_raj/Desktop/file1.txt";
    $file2="C:/Documents and Settings/San_raj/Desktop/file2.txt";

    #my ($file1, $file2) = @ARGV;

    # -f $file1 or bag("$file1: not a regular file");
    # -f $file2 or bag("$file2: not a regular file");

    -T $file1 or bag("$file1: binary");
    -T $file2 or bag("$file2: binary");

    open (F1, $file1) or bag("Couldn't open $file1: $!");
    open (F2, $file2) or bag("Couldn't open $file2: $!");
    chomp(@f1 = <F1>);
    close F1;
    chomp(@f2 = <F2>);
    close F2;

    $diffs = diff(\@f1, \@f2);
    exit 0 unless @$diffs;

    foreach $chunk (@$diffs) {

    foreach $line (@$chunk) {
    my ($sign, $lineno, $text) = @$line;
    printf "%4d$sign %s\n", $lineno+1, $text;
    }
    print "--------\n";
    }
    exit 1;

    sub bag {
    my $msg = shift;
    $msg .= "\n";
    warn $msg;
    exit 2;
    }

      Why do you comment out the lines involving @ARGV, which allow MJD's script to be usable by anyone anywhere, and then add lines that make the script only work on two particular files found on a particular machine?

      Why did you not use <code> tags, as per Writeup Formatting Tips, so others could download it more easily?

      (I was also going to ask why didn't you just post a link to the site where the script was originally posted, but at the moment, my browser is telling me it can't find http://perl.plover.com -- weird.)

Re: Module for Approximate File Comparison
by blokhead (Monsignor) on Sep 24, 2007 at 13:19 UTC
    A pretty neat and simple method is one outlined by Zaxo in Re: similar texts !?. Basically, you measure how much it helps a compression algorithm to concatenate the two files together, compared to when you compress them independently.

    If you think of a compression algorithm as a mild approximation to Shannon entropy, then this approach is essentially computing the corresponding approximation of mutual information (normalized to the range 0.5 - 1), which is the intuitive concept you seem to be looking for.

    blokhead

      You mean something like:

      The code above uses IO::Compress::Gzip, and takes the two files as arguments on the command line. The ratio in this case is computed by saying that 100% would be if the content of the two files compressed to be the same size as one of them, 0% as compressing to the sum of the sizes of the two files.

      (By the way, for the example files from my earlier post, this gave a result of 90.61%, which is a fair approximation.)

      Update: 2007-09-24
      Added comments to code; had reversed description of ratio.

        If I recall correctly, zlib (and so, gzip) operate over a data window of 64Kb or less, so your method would not work if the files are bigger than that.
Re: Module for Approximate File Comparison
by CountZero (Bishop) on Sep 24, 2007 at 20:19 UTC
    how similar are the two files

    For whatever definition of "similar".

    Does changing the formatting or the character-encoding of the file, make your file less similar? By adding or deleting the word "not" in a sentence, the file only becomes a little less "similar" but its meaning changes drastically.

    I think that without a very clear definition of "similarity" your percentage will be meaningless.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://640673]
Approved by andreas1234567
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (10)
As of 2014-07-31 10:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (248 votes), past polls