Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

comparing 2 files

by garyboyd (Acolyte)
on Apr 06, 2011 at 16:43 UTC ( [id://897820]=perlquestion: print w/replies, xml ) Need Help??

garyboyd has asked for the wisdom of the Perl Monks concerning the following question:

Hi perl monks, I am trying to compare 2 files with similar data. I want to check if any of the lines in file 1 are contained in File2 and if they aren't to print that line in File 1 (with an additional column 0) and if they are present to print the line from File 2

eg

File 1:

133-1452_chromosomal_replication_initiation_protein_

1457-2557_DNA_polymerase_III_subunit_beta_

2579-3670_recombination_protein_F_

3687-6104_DNA_gyrase_subunit_B_

c8268-7159_aspartate-semialdehyde_dehydrogenase_

c8692-8471_prophage_p2_ogr_protein_

File 2:

c8268-7159_aspartate-semialdehyde_dehydrogenase_ 33

c8692-8471_prophage_p2_ogr_protein_ 574

1457-2557_DNA_polymerase_III_subunit_beta_ 123

Output file:

133-1452_chromosomal_replication_initiation_protein_ 0

1457-2557_DNA_polymerase_III_subunit_beta_ 123

2579-3670_recombination_protein_F_ 0

3687-6104_DNA_gyrase_subunit_B_ 0

c8268-7159_aspartate-semialdehyde_dehydrogenase_ 33

c8692-8471_prophage_p2_ogr_protein_ 574

The files are not sorted in any way, so the lines are not consecutive in either file. I have tried adapting a number of bits of code that I have found on the web, but none of these are working properly. So far I have this:

#!/usr/bin/perl use strict; use warnings; open (OUT,">outputfile.txt"); open my $fh1, '<', 'file1.txt'; open my $fh2, '<', 'file2.txt'; while( defined( my $line1 = <$fh1> ) and defined( my $line2 = <$fh2> ) ){ chomp $line1; chomp $line2; my $string = $line2; $string =~ m{^*\t}; print $string."\n"; if( $line1 eq $string ){ print OUT $line2."\n"; }else{ print OUT $line1."\n"; } } close $fh1; close $fh2; close OUT;

But this just print out a list of lines from the second file. Any help would be appreciated!

Replies are listed 'Best First'.
Re: comparing 2 files
by ikegami (Patriarch) on Apr 06, 2011 at 16:49 UTC
    diff -u file1 file2

    If you want to customise the output, you can build your own diff tool around Algorithm::Diff.

    If order isn't important and each item appears no more than once per file,

    my %seen; ++$seen{$_} while <$fh_in1>; --$seen{$_} while <$fh_in2>; for (keys(%seen)) { print($fh_out1 $_) if $seen{$_} > 0; print($fh_out2 $_) if $seen{$_} < 0; }

    If order isn't important and some items might appear more than once in a file,

    my %seen1; ++$seen1{$_} while <$fh_in1>; my %seen2; ++$seen2{$_} while <$fh_in2>; print $fh_out1 grep !$seen2{$_}, keys %seen1; print $fh_out2 grep !$seen1{$_}, keys %seen2;

    Update: Added tons.

Re: comparing 2 files
by Eliya (Vicar) on Apr 06, 2011 at 17:18 UTC

    Create a lookup table from file 2 with the keys being the part of the line excluding the last column. Then go through file 1 and check if the entry is in the lookup table:

    open my $fh1, '<', 'file1.txt' or die $!; open my $fh2, '<', 'file2.txt' or die $!; open my $outfh, '>', "outputfile.txt" or die $!; my %lines; while (<$fh2>) { chomp; my ($key, $n) = /(.*)\s(\d+)$/; $lines{$key} = $n; } while (<$fh1>) { chomp; if (exists $lines{$_}) { print $outfh "$_ $lines{$_}\n"; } else { print $outfh "$_ 0\n"; } }

    (this assumes your lookup keys are unique — in case the same key can occur multiple times in file 2, you'd have to think about which entry you then want to have in the output file...)

      Thankyou to all of the monks for their suggestions. The code written by Eliya works very well.

Re: comparing 2 files
by sundialsvc4 (Abbot) on Apr 06, 2011 at 17:24 UTC

    The best way to handle such things, in my experience, is always to sort the two files (or copies thereof) using a disk-based sort, then simply compare the two files sequentially.   It’s a practical strategy that dates back to before digital computers, and it’s “startlingly effective,” no matter what kind of data volume you might be dealing with.   Yes indeed, there are CPAN modules.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://897820]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-19 22:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found