http://www.perlmonks.org?node_id=951861

faber has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys,

I'm wondering if anyone here has thought about a high speed checksum for video finger printing. I'd like to avoid if possible reading an entire video file to determine it's checksum, rather I would like to only read segments of the data to try and determine this.

My first thoughts were to use crc32 against selected segments of the files, (say 1 megabyte every 2 megabytes of data) or something like that.

I understand that without checksuming the entire file it's very hard to guarantee uniqueness, however I'm more concerned with speed.

Any thoughts?

---

Alright guys, first generation of File::Fingerprint::Huge is up on cpan. I'll update it with some further refinements as I move forward. Thanks for all of you help!

  • Comment on high speed checksum for video finger printing?

Replies are listed 'Best First'.
Re: high speed checksum for video finger printing?
by BrowserUk (Patriarch) on Feb 04, 2012 at 22:01 UTC

    Use the filesize to seed a random number generator, and then read 100 random 4- or 8-byte chunks from the file, stick'em together and checksum them.

    The odds of duplicates are billions to 1.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Ah yes, This is a great idea and could be very useful for the right types of data management cases. I think I'm going to do this, likely call it as suggested File::Fingerprint::Huge if no one has anything similar to this already.
        if no one has anything similar to this already.

        Nothing I've seen, so go for it.

        My suggestion would be to use Math::Random::MT as the PRNG. It is portable and reproducible cross-platform.

        Then something like:

        use Math::Random::MT qw[ rand srand ]; use Digest::CRC qw[ crc64 ]; sub fingerPrintFile{ my $file = shift; my $filesize = -s( $file ); srand $filesize; open my $fh, "<', $file or die $!; ## assuming CRC-64 my $chunks = int( $filesize / 8 ) - 1; ## Added sort per RichardK's suggestion below. my @posns = sort{ $a <=> $b } map 8*int( rand $chunks ), 1 .. 100; my $rawSample = join '', map{ seek $fh, $_, 0; read( $fh, my $chun +k, 8 ); $chunk } @posns; close $fh; return crc64( $rawSample ); }

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: high speed checksum for video finger printing?
by InfiniteSilence (Curate) on Feb 04, 2012 at 23:14 UTC

    I was going to recommend File::Fingerprint but I realized that your files are likely to be HUGE so this will not work efficiently. However, I would take BrowserUk's recommendation, build a new module called File::Fingerprint::Enormous or perhaps File::Fingerprint::BigVideo, package it ,and post it back up to CPAN.

    Celebrate Intellectual Diversity