Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

As noted, the speed of MD5ing files, is dominated my the file IO. So the easiest way fo speeding up the digest is to do less IO

Here's a simple method of speeding up the process. Instead of doing a full MD5 of the file each time, using this subroutine which combines the files length, with the 4k block at the start of every 1MB block in the file:

sub quickMD5 { my $fh = shift; my $md5 = new Digest::MD5->new; $md5->add( -s $fh ); my $pos = 0; until( eof $fh ) { seek $fh, $pos, 0; read( $fh, my $block, 4096 ) or last; $md5->add( $block ); $pos += 1024**2; } return $md5; }

This runs ~4 times faster than a full md5. It will however, statistically, give false positives -- say that two files are identical when they are actually different -- in approximately 1% of cases.

So, when the quickMD5 says two files are the same, you must then run the real MD5 on both files. But since you only have to do that in 1% of cases, and you sped up the other 99% by 4 times, the overall effect is to speed up the process with no loss of accuracy.

(Note: It should go without saying that this is no good for files less than 1MB at all, and you should probably use a proper MD5 for any file < 10MB. but MD5 doesn't take too long on small files anyway.)

A simple test script:

And a couple of test runs comparing the speed of both algorithms. The apparently eclectic ordering of the runs is to ensure that subsequent runs against the same file don't benefit from the "hot cache" effect:

C:\test>md5t 500MB.csv Processing 500MB.csv : 536870913 bytes Full MD5 took 6.350180 seconds Full MD5: 3c81ccb7d2d7febc96c92b4d7dd4c797 C:\test>quickMD5 25GB.csv Processing 25GB.csv : 26843545600 bytes Partial MD5 took 88.150608 seconds Partial MD5: 408ee1dabed25f1fbe4b25511c8b8287 C:\test>quickMD5 500MB.csv Processing 500MB.csv : 536870913 bytes Partial MD5 took 2.081898 seconds Partial MD5: 894c2792caeaac64072d7189d5724ecc C:\test>md5t 25GB.csv Processing 25GB.csv : 26843545600 bytes Full MD5 took 302.419120 seconds Full MD5: 24ce5b913f2f49876f0f24031b9b5d9b

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


In reply to Re: [OT]:Faster signature algorithm than md5? (4x faster) by BrowserUk
in thread [OT]:Faster signature algorithm than md5? by swampyankee

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others having an uproarious good time at the Monastery: (6)
    As of 2014-12-27 08:04 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (176 votes), past polls