ram has asked for the wisdom of the Perl Monks concerning the following question:
Folks,
I've got a problem.I want to compare two files and check if that's a duplicate.If so remove it.I used File::Compare and it works great,but to an extent.
Whitespaces aren't ignored but considered as a character.If there are whitespaces, those are treated as characters and they come out as different files.
I tried using diff with the suitable options to ignore the whitespace and it works.But when i try it against a ton of text files(i will be running my script against these tons of files every day) it is very very slow.IS there any other way of comparing the two text files?
Thanks
Ram
Re: Comparing two files
by IraTarball (Monk) on May 22, 2001 at 06:20 UTC
|
use File::Compare qw (compare_text);
sub one_space {
$line1 = shift;
$line2 = shift;
$line1 =~ s/\s+/ /g;
$line2 =~ s/\s+/ /g;
return $line1 ne $line2;
}
if (compare_text ('spaces.txt', 'more_spaces.txt', \&one_space ) == 0)
{
print "they're equal\n"
} else {
print "they're different\n"
}
| [reply] [d/l] |
Re: Comparing two files
by Tuna (Friar) on May 22, 2001 at 06:28 UTC
|
#!/usr/bin/perl -w
use strict;
use Array::Compare;
my $comp = Array::Compare->new(Sep => '|', WhiteSpace => 0, Case => 0)
+;
my $file1 = "/etc/modules.conf";
my $file2 = "/etc/modules.conf2";
open FILE1, $file1 || die "Can't open file 1:$!\n"; my @lines1=<FILE1
+>;
open FILE2, $file2 || die "Can't open file 2:$!\n"; my @lines2=<FILE2
+>;
if ($comp->compare(\@lines1, \@lines2)) {
print "Arrays are the same\n";
} else {
print "Arrays are different\n";
}
| [reply] [d/l] |
|
| [reply] |
Re (tilly) 1: Comparing two files
by tilly (Archbishop) on May 22, 2001 at 07:18 UTC
|
If you have enough memory, you could just have a hash that
goes from the normalized contents of the files to the name
of the file. That moves the logic to a hash lookup. But
if you have tons of files, well you probably don't have
that much memory.
But you can still use the same strategy using md5 hashes.
And indeed here is some (partially tested) sample code
for this problem:
#! /usr/bin/perl -w
use strict;
use Digest::MD5 qw(md5);
my %file_hash;
foreach my $file (@ARGV) {
my $key = md5(normalize_text(slurp_file($file)));
push @{$file_hash{$key}}, $file;
}
foreach my $files (values %file_hash) {
if (@$files < 2) {
next;
}
else {
# $files is an anonymous array of files, which
# are *probably* all duplicates of each other.
# Put appropriate logic here. Were it not for
# memory limits, *this* would be the whole
# script!
my %file_of;
foreach my $file (@$files) {
my $text = normalize_text(slurp_file($file));
if (exists $file_of{ $text }) {
print "$file_of{$text} and $file are dups\n";
unlink($file) or die "Cannot delete $file: $!";
}
else {
$file_of{$text} = $file;
}
}
}
}
# Takes text, normalizes whitespace and returns it.
sub normalize_text {
my $text = shift;
$text =~ s/\s+/ /g;
$text =~ s/^ //;
$text =~ s/ \z//;
$text;
}
# Takes a file, returns the contents in a string
sub slurp_file {
local @ARGV = shift;
local $/;
<>;
}
| [reply] [d/l] |
Re: Comparing two files
by Beatnik (Parson) on May 22, 2001 at 12:18 UTC
|
#!/usr/bin/perl
use Algorithm::Diff qw(diff LCS);
use strict;
my @seq1 = ("A".."N");
my @seq2 = ("F".."Z");
my @diff = ();
my @lcs = ();
if (@diff = diff( \@seq1, \@seq2 )) { }
# Not equal
# Same for (@lcs = diff( \@seq1, \@seq2 ))
Greetz
Beatnik
... Quidquid perl dictum sit, altum viditur. | [reply] [d/l] |
|
Thanks guys for pitching in.I tried using the Array::compare.
Here's my test.I took 15 folders containing the same 285 files.I ran File::compare,diff and array::compare.It took me 14 secs for the first,10min 30 sec for the second and close to 8 min for the third.
I need something that's as fast as file::compare but tone that can take care of the whitespace and case issues.
Is there any c function or the likes that are available that can be called from my perl code?
Ram
| [reply] |
|
I suggest you check elsewhere in this very
thread to see something that will likely be very fast
while still being very easy to write. [ I've really
grown to like sorting replies by reputation (see
your User Settings) ]
Update: Though, if you want to ignore blank lines,
then that won't work. So here:
my( $lineA, $lineB );
while( 1 ) {
$lineA= do { while(<FILE1>){ last if /\S/ }; $_ };
$lineB= do { while(<FILE2>){ last if /\S/ }; $_ };
last if ! defined $lineA || ! defined $lineB;
for( $lineA, $lineB ) {
s/\s+/ /g;
s/^ //;
s/ $//;
$_= lc $_;
}
last if $lineA ne $lineB;
}
if( defined($lineA) || defined($lineB) ) {
warn "The files are different!\n";
}
-
tye
(but my friends call me "Tye") | [reply] [d/l] |
|
The reason why Algorithm::Diff is slow, is because is has a totally different purpose (but can be used to do what you need). Algorithm::Diff is an implementation of UNIX's diff, which basically shows you the difference between files. Diff is used with patch when bugs are found/fixed.
Anyway, more information is available on the Algorithm::Diff POD, Dominus has a page on it, and ofcourse, you can check the diff manpages
Greetz
Beatnik
... Quidquid perl dictum sit, altum viditur.
| [reply] |
|
|