Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Verifying data in large number of textfiles

by SciDude (Friar)
on Aug 18, 2004 at 01:40 UTC ( #383834=note: print w/ replies, xml ) Need Help??


in reply to Verifying data in large number of textfiles

Using a consistent algorithm may provide you with a consistent set of identical "rips" from your webpage. Just for a moment, lets consider that unlikely reality to be true.

You must combine the methods for parsing over all files in a directory with your comparison and sorting options.

The first line may not give you the best indication for comparison. I would suggest Digest::MD5 instead, and the following untested code - mostly ripped from the docs:

use Digest::MD5; use strict; %seen = (); $dirname = "/path/to/files"; # Parse over files in directory opendir(DIR, $dirname) or die "can't open $dirname: $!"; # Take a careful look at each file in $dirname while (defined($file = readdir(DIR))) { my $file = "$dirname/$file"; open(FILE, $file) or die "Can't open '$file': $!"; binmode(FILE); # make a $hash of each file my $hash = Digest::MD5->new->addfile(*FILE)->hexdigest, " $file\n" +; # store a copy of this $hash and compare it with all others seen unless ($seen{$hash}++ { # this is a unique file # do something with it here - perhaps move it to a /unique loc +ation } } closedir(DIR);
...code is untested

SciDude
The first dog barks... all other dogs bark at the first dog.


Comment on Re: Verifying data in large number of textfiles
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://383834]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (14)
As of 2015-07-06 17:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (80 votes), past polls