Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Comparing two files

by ram (Initiate)
on May 22, 2001 at 04:27 UTC ( [id://82142]=perlquestion: print w/replies, xml ) Need Help??

ram has asked for the wisdom of the Perl Monks concerning the following question:

Folks, I've got a problem.I want to compare two files and check if that's a duplicate.If so remove it.I used File::Compare and it works great,but to an extent. Whitespaces aren't ignored but considered as a character.If there are whitespaces, those are treated as characters and they come out as different files. I tried using diff with the suitable options to ignore the whitespace and it works.But when i try it against a ton of text files(i will be running my script against these tons of files every day) it is very very slow.IS there any other way of comparing the two text files? Thanks Ram

Replies are listed 'Best First'.
Re: Comparing two files
by IraTarball (Monk) on May 22, 2001 at 06:20 UTC
    Did you try writing your own line comparison routine and pasing it into File::Compare::compare_text?

    Check out perldoc File::Compare.

    use File::Compare qw (compare_text); sub one_space { $line1 = shift; $line2 = shift; $line1 =~ s/\s+/ /g; $line2 =~ s/\s+/ /g; return $line1 ne $line2; } if (compare_text ('spaces.txt', 'more_spaces.txt', \&one_space ) == 0) { print "they're equal\n" } else { print "they're different\n" }
Re: Comparing two files
by Tuna (Friar) on May 22, 2001 at 06:28 UTC
    Use our very own davorg's Array::Compare
    #!/usr/bin/perl -w use strict; use Array::Compare; my $comp = Array::Compare->new(Sep => '|', WhiteSpace => 0, Case => 0) +; my $file1 = "/etc/modules.conf"; my $file2 = "/etc/modules.conf2"; open FILE1, $file1 || die "Can't open file 1:$!\n"; my @lines1=<FILE1 +>; open FILE2, $file2 || die "Can't open file 2:$!\n"; my @lines2=<FILE2 +>; if ($comp->compare(\@lines1, \@lines2)) { print "Arrays are the same\n"; } else { print "Arrays are different\n"; }

      Thanks for the plug :)

      Worth pointing out a couple of things:

      1. This depends on reading your files into memory. If your files are large, this might not be such a good idea.
      2. You should only change the 'Sep' character to something that won't appear in your data. In general the default value (^G) is fine.

      "Perl makes the fun jobs fun
      and the boring jobs bearable" - me

Re (tilly) 1: Comparing two files
by tilly (Archbishop) on May 22, 2001 at 07:18 UTC
    If you have enough memory, you could just have a hash that goes from the normalized contents of the files to the name of the file. That moves the logic to a hash lookup. But if you have tons of files, well you probably don't have that much memory.

    But you can still use the same strategy using md5 hashes. And indeed here is some (partially tested) sample code for this problem:

    #! /usr/bin/perl -w use strict; use Digest::MD5 qw(md5); my %file_hash; foreach my $file (@ARGV) { my $key = md5(normalize_text(slurp_file($file))); push @{$file_hash{$key}}, $file; } foreach my $files (values %file_hash) { if (@$files < 2) { next; } else { # $files is an anonymous array of files, which # are *probably* all duplicates of each other. # Put appropriate logic here. Were it not for # memory limits, *this* would be the whole # script! my %file_of; foreach my $file (@$files) { my $text = normalize_text(slurp_file($file)); if (exists $file_of{ $text }) { print "$file_of{$text} and $file are dups\n"; unlink($file) or die "Cannot delete $file: $!"; } else { $file_of{$text} = $file; } } } } # Takes text, normalizes whitespace and returns it. sub normalize_text { my $text = shift; $text =~ s/\s+/ /g; $text =~ s/^ //; $text =~ s/ \z//; $text; } # Takes a file, returns the contents in a string sub slurp_file { local @ARGV = shift; local $/; <>; }
Re: Comparing two files
by Beatnik (Parson) on May 22, 2001 at 12:18 UTC
    Ofcourse you could use Algorithm::Diff...
    #!/usr/bin/perl use Algorithm::Diff qw(diff LCS); use strict; my @seq1 = ("A".."N"); my @seq2 = ("F".."Z"); my @diff = (); my @lcs = (); if (@diff = diff( \@seq1, \@seq2 )) { } # Not equal # Same for (@lcs = diff( \@seq1, \@seq2 ))
    ... Quidquid perl dictum sit, altum viditur.
      Thanks guys for pitching in.I tried using the Array::compare. Here's my test.I took 15 folders containing the same 285 files.I ran File::compare,diff and array::compare.It took me 14 secs for the first,10min 30 sec for the second and close to 8 min for the third. I need something that's as fast as file::compare but tone that can take care of the whitespace and case issues. Is there any c function or the likes that are available that can be called from my perl code? Ram

        I suggest you check elsewhere in this very thread to see something that will likely be very fast while still being very easy to write. [ I've really grown to like sorting replies by reputation (see your User Settings) ]

        Update: Though, if you want to ignore blank lines, then that won't work. So here:

        my( $lineA, $lineB ); while( 1 ) { $lineA= do { while(<FILE1>){ last if /\S/ }; $_ }; $lineB= do { while(<FILE2>){ last if /\S/ }; $_ }; last if ! defined $lineA || ! defined $lineB; for( $lineA, $lineB ) { s/\s+/ /g; s/^ //; s/ $//; $_= lc $_; } last if $lineA ne $lineB; } if( defined($lineA) || defined($lineB) ) { warn "The files are different!\n"; }

                - tye (but my friends call me "Tye")
        The reason why Algorithm::Diff is slow, is because is has a totally different purpose (but can be used to do what you need). Algorithm::Diff is an implementation of UNIX's diff, which basically shows you the difference between files. Diff is used with patch when bugs are found/fixed.

        Anyway, more information is available on the Algorithm::Diff POD, Dominus has a page on it, and ofcourse, you can check the diff manpages

        ... Quidquid perl dictum sit, altum viditur.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://82142]
Approved by root
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-06-14 00:11 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.