Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Module for Approximate File Comparison

by blokhead (Monsignor)
on Sep 24, 2007 at 13:19 UTC ( #640714=note: print w/ replies, xml ) Need Help??


in reply to Module for Approximate File Comparison

A pretty neat and simple method is one outlined by Zaxo in Re: similar texts !?. Basically, you measure how much it helps a compression algorithm to concatenate the two files together, compared to when you compress them independently.

If you think of a compression algorithm as a mild approximation to Shannon entropy, then this approach is essentially computing the corresponding approximation of mutual information (normalized to the range 0.5 - 1), which is the intuitive concept you seem to be looking for.

blokhead


Comment on Re: Module for Approximate File Comparison
Re^2: Module for Approximate File Comparison
by atcroft (Monsignor) on Sep 24, 2007 at 18:44 UTC

    You mean something like:

    The code above uses IO::Compress::Gzip, and takes the two files as arguments on the command line. The ratio in this case is computed by saying that 100% would be if the content of the two files compressed to be the same size as one of them, 0% as compressing to the sum of the sizes of the two files.

    (By the way, for the example files from my earlier post, this gave a result of 90.61%, which is a fair approximation.)

    Update: 2007-09-24
    Added comments to code; had reversed description of ratio.

      If I recall correctly, zlib (and so, gzip) operate over a data window of 64Kb or less, so your method would not work if the files are bigger than that.

        True-any compression method that uses a data window in a similar manner would experience the same issue, albeit the size at which it would occur would be dependent upon the size of said window. The code above was merely an example.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://640714]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (12)
As of 2014-07-23 22:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (154 votes), past polls