Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Most of the complication is in place to reduce file reading to a bare minimum. Say you have two 1 Gbyte files. The size is exactly the same, but the files are very different. I wouldn't want to read and digest both files to understand they are different, when it's enough to read a few bytes in the same position. My program deals rather well with these cases. It starts by reading a small chunk from all files of the same size and uses that chunk as key to partition the group of files. If any subset contains more than one file, then read another chunk starting from another (preferably far) position and iterate.

It's more or less like the naif "real life" way of comparing things. If you have two books with a blank cover, to check if they are different you first compare the size. If it's the same, you open the same page from both and check if they differ. Only if the books are the same you need to keep on reading until the end.

Moreover, by using byte by byte comparison instead of hashing, you don't even risk false positives. As small as the risk may be, it will most surely happen for your presentation due tomorrow.

Package Finder::Looper takes care of the iteration. Each call to $looper->next returns a new pair ( start, length ) within a given range, so that consecutive calls sample from different parts of the file. That's the "interlaced" part (which I should maybe have called "interleaved", but hey! this side of the world it's not the best time for choosing names in foreign languages).

Having said this, the program probably needs some tweaking to better exploit filesystem/buffering/head-positioning optimizations.


Antonio Bellezza

The stupider the astronaut, the easier it is to win the trip to Vega - A. Tucket

In reply to Re: •Re: Interlaced duplicate file finder by abell
in thread Interlaced duplicate file finder by abell

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others surveying the Monastery: (5)
    As of 2018-04-23 06:08 GMT
    Find Nodes?
      Voting Booth?