With 700 lines you can easily read one file into a hash and
then compare each row in the other file against that hash.
With larger files (1 MB and up), you may wish to save
a lot of memory by noticing, that the files are sorted,
alphabetically, it seems. Also, in this case most of the
lines will be present in both files, so storing the
differing rows will not consume insanious amounts of memory :-)
Here is an mergesortish way to do it:
=head1 compare_sorted_files_by_line($filename1, $filename2)
Finds lines that are present in only one of the files, whose names are
given as arguments. This function assumes that the lines in the files are
in alphabetical order.
Returns the unique rows in each file, in two list references. The first one
points to an array containing the rows that are present in $filename1 only,
and the second one similarly for $filename2.
Returns an empty list if either of the files could not be opened for reading.
=cut
sub compare_sorted_files_by_line( $$ )
{
my($filename1, $filename2) = @_;
my(@in1only, @in2only); # The unique rows ("matches") are stored in these
unless(open(FILE1, "< $filename1"))
{ warn "$0: Could not open $filename1: $!\n"; return (); }
unless(open(FILE2, "< $filename2"))
{ warn "$0: Could not open $filename2: $!\n"; close FILE1; return ();}
my $line1 = <FILE1>;
my $line2 = <FILE2>;
while(defined($line1) and defined($line2))
{
my $compare = $line1 cmp $line2;
if($compare == 0)
{
$line1 = <FILE1>;
$line2 = <FILE2>;
next;
}
elsif($compare > 0)
{
push(@in2only, $line2);
$line2 = <FILE2>;
next;
}
else
{
push(@in1only, $line1);
$line1 = <FILE1>;
}
}
# were there differences at end of file?
if(defined($line1))
{
push(@in1only, $line1);
push(@in1only, $_) while(<FILE1>);
}
if(defined($line2))
{
push(@in2only, $line2);
push(@in2only, $_) while(<FILE2>);
}
close FILE1;
close FILE2;
# we happen to like strings without newlines.
chomp(@in1only);
chomp(@in2only);
return(\@in1only, \@in2only);
}
-Bass