Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Simple comparison of 2 files

by Q.and (Novice)
on Jul 27, 2016 at 18:47 UTC ( [id://1168667] : perlquestion . print w/replies, xml ) Need Help??

Q.and has asked for the wisdom of the Perl Monks concerning the following question:

I know this is an elementary problem, so if this is a repeated question, please kindly flag, but I have yet to find something that works in the way I'm thinking about the problem. If the code seems overcomplicated for the task at hand, it's in part because the real script and files are more complex and because I'm new, so any suggested solutions that stay as close as possible to the code given here will be most appreciated.

Say I have two files, FILE1 contains:
A  1_1
A  1_2
B  1_3
C  1_4

and FILE2 is:
A  2_1
B  2_2

I would like to compare both files and have it print:

A from FILE1 with number 1_1 and A from FILE2 with number 2_1 match
A from FILE1 with number 1_1 and B from FILE2 with number 2_2 DO NOT match
A from FILE1 with number 1_2 and A from FILE2 with number 2_1 match
A from FILE1 with number 1_2 and B from FILE2 with number 2_2 DO NOT match
B from FILE1 with number 1_3 and A from FILE2 with number 2_1 DO NOT match
B from FILE1 with number 1_3 and B from FILE2 with number 2_2 match
C from FILE1 with number 1_4 and A from FILE2 with number 2_1 DO NOT match
C from FILE1 with number 1_4 and B from FILE2 with number 2_2 DO NOT match

The way I've been constructing the code so far however, obviously does not loop over the files in the way that I'm aiming for. That code is below, along with example output.

#!perl open (FILE1, $ARGV[0]); open (FILE2, $ARGV[1]); while ($_ = <FILE1>) { chomp; @FILE1 = split; ($FILE1letter, $FILE1number) = @FILE1; @FILE2 = split(' ',<FILE2>); ($FILE2letter, $FILE2number) = @FILE2; # print "$FILE1letter from FILE1 with number $FILE1number and $FILE +2letter from FILE2 with $FILE2number match\n"; #prints the same as below if ($FILE1letter == $FILE2letter) { print "$FILE1letter from FILE1 with number $FILE1number and $F +ILE2letter from FILE2 with number $FILE2number match\n"; } else { print "$FILE1letter from FILE1 with number $FILE1number and $F +ILE2letter from FILE2 with number $FILE2number DO NOT match\n"; }

Output from above code:
A from FILE1 with number 1_1 and A from FILE2 with number 2_1 match
B from FILE1 with number 1_2 and C from FILE2 with number 2_2 match
C from FILE1 with number 1_3 and from FILE2 with number match

I would appreciate any direction, brief explanation of what about my current code is not feasible for producing my desired output, and/or suggestions for better constructing the script. Thanks in advance.

Clarification: The goal of the real script is really not to just print matches and mismatches, but to do many things within each loop. However, I realized that my if statement was not evaluating correctly, as it was printing even when there was no match and I think the larger problem is the looping structure of the entire code. In the real script, I would like it to evaluate lines from the two files ONLY when $FILE1letter is equal to $FILE2letter, but have it simplified to just a difference detection problem above in order to try to help myself learn how to solve smaller problems within larger ones.

Replies are listed 'Best First'.
Re: Simple comparison of 2 files
by AnomalousMonk (Archbishop) on Jul 27, 2016 at 23:00 UTC

    Here's an approach that combines a nested while-loop/for-loop with validation of input. As has been mentioned before, reading the "small" file to an array is practical for files of several million to a few score million lines, depending on how much system RAM you have available. (Of course, you may want to die rather than warn if you see an invalid input line.)

    File iter_2_files_1.pl:

    use warnings; use strict; use autodie; use Data::Dump qw(pp); # data extraction and validation regexes. my $rx_L = qr{ [[:upper:]] }xms; my $rx_N = qr{ \d+ _ \d+ }xms; my $rx_line = qr{ \A ($rx_L) \s+ ($rx_N) \s* \z }xms; # in-memory test files (for convenience only). my $f_large = qq{A 1_1\nA 1_2\nB 1_3\nLaRgE\nC 1_4}; my $f_small = qq{A 2_1\nSmAlL\nB 2_2}; # read, validate and process small file, hold in array. open my $fh_small, '<', \$f_small; my @small_file_line_fields = map { my $valid = my ($letter, $number) = $_ =~ $rx_line; warn qq{bad small file line '$_'} unless $valid; $valid ? [ $letter, $number ] : (); } <$fh_small> ; close $fh_small; print 'small file: ', pp(\@small_file_line_fields), qq{\n\n}; # process large file line-by-line. open my $fh_large, '<', \$f_large; LARGE: while (my $line_large = <$fh_large>) { my $valid = my ($large_L, $large_N) = $line_large =~ $rx_line; warn qq{bad large file line '$line_large'} and next LARGE unless $valid; # iterate over all lines of small file for each line of large file. SMALL: for my $ar_fields (@small_file_line_fields) { my ($small_L, $small_N) = @$ar_fields; printf qq{%s from %s with number %s and %s from %s with number %s +}, $large_L, 'FILE1', $large_N, $small_L, 'FILE2', $small_N; print 'DO NOT ' if $large_L ne $small_L; print qq{match \n}; } # end for SMALL loop } # end while LARGE loop close $fh_large;
    Output:
    c:\@Work\Perl\monks\Q.and>perl iter_2_files_1.pl bad small file line 'SmAlL ' at iter_2_files_1.pl line 107, <$_[...]> line 3. small file: [["A", "2_1"], ["B", "2_2"]] A from FILE1 with number 1_1 and A from FILE2 with number 2_1 match A from FILE1 with number 1_1 and B from FILE2 with number 2_2 DO NOT m +atch A from FILE1 with number 1_2 and A from FILE2 with number 2_1 match A from FILE1 with number 1_2 and B from FILE2 with number 2_2 DO NOT m +atch B from FILE1 with number 1_3 and A from FILE2 with number 2_1 DO NOT m +atch B from FILE1 with number 1_3 and B from FILE2 with number 2_2 match bad large file line 'LaRgE ' at iter_2_files_1.pl line 121, <$_[...]> line 4. C from FILE1 with number 1_4 and A from FILE2 with number 2_1 DO NOT m +atch C from FILE1 with number 1_4 and B from FILE2 with number 2_2 DO NOT m +atch


    Give a man a fish:  <%-{-{-{-<

Re: Simple comparison of 2 files
by haukex (Archbishop) on Jul 28, 2016 at 10:38 UTC

    Hi Q.and,

    Here's another alternate implementation using Tie::File. This core module lets you access a file with records (lines) via a normal Perl @array, this is mostly transparent; internally it manages caching and writing to the file. It's relatively efficient and lets you process large files without worrying about those things. In this case it lets you write your loops fairly simply as two nested for loops. If you wanted to keep one (or both) of the files cached in memory, you can increase the memory option of Tie::File.

    Your posting actually contains some Unicode characters (U+2003 EM SPACE), so I'm going to guess that your source file contains those too, and I've added handling for that to the following script (if you've got a modern version of Perl the regexps will handle Unicode fairly well too). If your input files are instead plain ASCII you can remove the UTF-8 handling from the script if you like. (Update: My code assumes the files are encoded in UTF-8; there are of course other Unicode encodings possible.)

    #!/usr/bin/env perl use warnings; use strict; use open qw/:std :utf8/; # STDIN/OUT/ERR in utf8 use Tie::File; # normally "tie my @file1, 'Tie::File', '/tmp/file1' or die ...", # but we need utf8, note the following opens the files read-only open my $fh1, '<:utf8', '/tmp/file1' or die $!; tie my @file1, 'Tie::File', $fh1 or die $!; open my $fh2, '<:utf8', '/tmp/file2' or die $!; tie my @file2, 'Tie::File', $fh2 or die $!; for (@file1) { my ($l1, $n1) = /^(\w+)\s+(\S+)\s*$/ or die "Bad file1 line: $_"; for (@file2) { my ($l2, $n2) = /^(\w+)\s+(\S+)\s*$/ or die "Bad file2 line: $_"; print "$l1 from FILE1 with number $n1 ", "and $l2 from FILE2 with number $n2 "; if ($l1 eq $l2) { print "match\n" } else { print "DO NOT match\n" } } } untie @file1; close $fh1; untie @file2; close $fh2;

    This approach makes sense if you really need to operate on every line of file1 combined with every line of file2. However, later on in your post you say "In the real script, I would like it to evaluate lines from the two files ONLY when $FILE1letter is equal to $FILE2letter", which leads me to think that maybe there is a different way of approaching the problem that could be more efficient: maybe what you're trying to do is like a JOIN? There are many different ways to approach that problem in Perl, for example using hashes, or even using a database approach (e.g. put your data in a database, even an in-memory one like DBD::SQLite; or DBD::CSV... although I'm not sure the latter one would be more efficient on large files). If you could tell us more about the problem you're trying to solve, and give sample input/code/output that is more representative of that problem, then perhaps we can suggest a better solution.

    Hope this helps,
    -- Hauke D

Re: Simple comparison of 2 files
by Marshall (Canon) on Jul 27, 2016 at 19:05 UTC
    You've got some code with issues. It's great that you have provided some example input and output. I'm having trouble understanding the algorithm between "match" and "no match". Can you explain further with some text? Then I can comment further on the code.

    B from FILE1 with number 1_3 and B from FILE2 with number 2_2 match What is it that "matches".

      Of course. The real input files have many more variables, but I've tried to simplify both the code and input files without losing the main point of the script. By match, I just mean does $FILE1letter equal $FILE2letter, as in A==A. By no match, I mean $FILE1letter = A and $FILE2letter=B, so A!=B. The goal of the real script is really not to just print matches and mismatches, but to do many things within each conditional, but I realized that my if statement was not working correctly, as it was printing even when there was no match and I think the larger problem is the looping structure of the entire code. Thanks for the help.
        Ok, that makes it clear. Your issue is with the line:
        if ($FILE1letter == $FILE2letter) {
        Use the eq operator to compare strings. == compares numerically.
        Ok, try this below. You need an array to describe file2 so that you can loop over it for each entry in file1. use strict; and use warnings; as a huge help. I would only use all CAPS for bareword file handles. I changed the names accordingly.
        #!usr/bin/perl use warnings; use strict; use Data::Dumper; my $file1 = 'A 1_1 A 1_2 B 1_3 C 1_4 '; my $file2 = 'A 2_1 B 2_2 '; open FILE1, "<" , \$file1 or die "unable file1 $!"; open FILE2, "<", \$file2 or die "unable file2 $!"; my @file2; #need to compare each line in file1 against all while (<FILE2>) #lines in file2, so create an array of file 2 { chomp; push @file2, $_; } while (<FILE1>) { chomp; my ($file1letter, $file1number) = split (' ', $_); foreach my $line2 (@file2) { my ($file2letter, $file2number) = split (' ', $line2); if ($file1letter eq $file2letter) { print "$file1letter from FILE1 with number $file1number and +$file2letter from FILE2 with number $file2number match\n"; } else { print "$file1letter from FILE1 with number $file1number and +$file2letter from FILE2 with number $file2number DO NOT match\n"; } } } __END__ A from FILE1 with number 1_1 and A from FILE2 with number 2_1 match A from FILE1 with number 1_1 and B from FILE2 with number 2_2 DO NOT m +atch A from FILE1 with number 1_2 and A from FILE2 with number 2_1 match A from FILE1 with number 1_2 and B from FILE2 with number 2_2 DO NOT m +atch B from FILE1 with number 1_3 and A from FILE2 with number 2_1 DO NOT m +atch B from FILE1 with number 1_3 and B from FILE2 with number 2_2 match C from FILE1 with number 1_4 and A from FILE2 with number 2_1 DO NOT m +atch C from FILE1 with number 1_4 and B from FILE2 with number 2_2 DO NOT m +atch
        update: I didn't see the code from pryrt before I hit the send button. The code above reads FILE2 into a memory array. Opening and reading a file is an "expensive" operation both CPU and time wise. Unless FILE2 is humongous, reading it once from the disk is the best approach.
Re: Simple comparison of 2 files
by neilwatson (Priest) on Jul 27, 2016 at 19:02 UTC

    Why not use the diff command instead? Another option is Test::Differences.

    Neil Watson
    watson-wilson.ca

      This is a super simplified version of a script I'm writing, also with super simplified input files. The real goal of the script is not to diff, but I realized that that is a core issue within the script (the if statement is not operating correctly, as it prints even when the variables $FILE1letter and $FILE2letter are not the same and I think the problem lies within the larger looping structure of the entire code. Hopefully trying to simplify the problems at the core of the script hasn't overcomplicated it further.

      It's not a diff, it's a cross-product.

Re: Simple comparison of 2 files
by Anonymous Monk on Jul 27, 2016 at 20:29 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1168667 use strict; use warnings; my $file1 = <<END; A 1_1 A 1_2 B 1_3 C 1_4 END my $file2 = <<END; A 2_1 B 2_2 END "$file1\0\n$file2" =~ / ^(\S+)\ +(\S+)\n .* \0\n .*? ^(\S+)\ +(\S+)\n (?{ print "$1 from FILE1 with number $2 and $3 from FILE2 with number $4 ", $1 eq $3 ? "match" : "DO NOT match", "\n" }) (*FAIL) /msx;

    prints your desired output