comparing 2 files

garyboyd has asked for the wisdom of the Perl Monks concerning the following question:

Hi perl monks, I am trying to compare 2 files with similar data. I want to check if any of the lines in file 1 are contained in File2 and if they aren't to print that line in File 1 (with an additional column 0) and if they are present to print the line from File 2

File 1:

133-1452_chromosomal_replication_initiation_protein_

1457-2557_DNA_polymerase_III_subunit_beta_

2579-3670_recombination_protein_F_

3687-6104_DNA_gyrase_subunit_B_

c8268-7159_aspartate-semialdehyde_dehydrogenase_

c8692-8471_prophage_p2_ogr_protein_

File 2:

c8268-7159_aspartate-semialdehyde_dehydrogenase_ 33

c8692-8471_prophage_p2_ogr_protein_ 574

1457-2557_DNA_polymerase_III_subunit_beta_ 123

Output file:

133-1452_chromosomal_replication_initiation_protein_ 0

1457-2557_DNA_polymerase_III_subunit_beta_ 123

2579-3670_recombination_protein_F_ 0

3687-6104_DNA_gyrase_subunit_B_ 0

c8268-7159_aspartate-semialdehyde_dehydrogenase_ 33

c8692-8471_prophage_p2_ogr_protein_ 574

The files are not sorted in any way, so the lines are not consecutive in either file. I have tried adapting a number of bits of code that I have found on the web, but none of these are working properly. So far I have this:

#!/usr/bin/perl

use strict;
use warnings;

open (OUT,">outputfile.txt");
open my $fh1, '<', 'file1.txt';
open my $fh2, '<', 'file2.txt';

while(
  defined( my $line1 = <$fh1> )
  and
  defined( my $line2 = <$fh2> )
){
  chomp $line1;
  chomp $line2;
  
  my $string = $line2;
  $string =~ m{^*\t};
 print $string."\n";
  if( $line1 eq $string ){
    print OUT $line2."\n";
  }else{
    print OUT $line1."\n";
  }
}

close $fh1;
close $fh2;
close OUT;
[download]

But this just print out a list of lines from the second file. Any help would be appreciated!

Comment on comparing 2 files Download Code

Replies are listed 'Best First'.
Re: comparing 2 files by ikegami (Patriarch) on Apr 06, 2011 at 16:49 UTC
`diff -u file1 file2` If you want to customise the output, you can build your own diff tool around Algorithm::Diff. If order isn't important and each item appears no more than once per file, `my %seen; ++$seen{$_} while <$fh_in1>; --$seen{$_} while <$fh_in2>; for (keys(%seen)) { print($fh_out1 $_) if $seen{$_} > 0; print($fh_out2 $_) if $seen{$_} < 0; }` [download] If order isn't important and some items might appear more than once in a file, `my %seen1; ++$seen1{$_} while <$fh_in1>; my %seen2; ++$seen2{$_} while <$fh_in2>; print $fh_out1 grep !$seen2{$_}, keys %seen1; print $fh_out2 grep !$seen1{$_}, keys %seen2;` [download] Update: Added tons.	[reply] [d/l] [select]
Re: comparing 2 files by Eliya (Vicar) on Apr 06, 2011 at 17:18 UTC
Create a lookup table from file 2 with the keys being the part of the line excluding the last column. Then go through file 1 and check if the entry is in the lookup table: `open my $fh1, '<', 'file1.txt' or die $!; open my $fh2, '<', 'file2.txt' or die $!; open my $outfh, '>', "outputfile.txt" or die $!; my %lines; while (<$fh2>) { chomp; my ($key, $n) = /(.*)\s(\d+)$/; $lines{$key} = $n; } while (<$fh1>) { chomp; if (exists $lines{$_}) { print $outfh "$_ $lines{$_}\n"; } else { print $outfh "$_ 0\n"; } }` [download] (this assumes your lookup keys are unique — in case the same key can occur multiple times in file 2, you'd have to think about which entry you then want to have in the output file...)	[reply] [d/l]
Re^2: comparing 2 files by garyboyd (Acolyte) on Apr 07, 2011 at 13:07 UTC
Thankyou to all of the monks for their suggestions. The code written by Eliya works very well.	[reply]
Re: comparing 2 files by sundialsvc4 (Abbot) on Apr 06, 2011 at 17:24 UTC
The best way to handle such things, in my experience, is always to sort the two files (or copies thereof) using a disk-based sort, then simply compare the two files sequentially. It’s a practical strategy that dates back to before digital computers, and it’s “startlingly effective,” no matter what kind of data volume you might be dealing with. Yes indeed, there are CPAN modules.


The stupid question is the question not asked
	PerlMonks