Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Column by Column Comparison

by documents9900 (Initiate)
on Mar 31, 2013 at 12:52 UTC ( #1026350=perlquestion: print w/ replies, xml ) Need Help??
documents9900 has asked for the wisdom of the Perl Monks concerning the following question:

I have to create a script for the file comparison. The file comparison has to be done by row by row and then column by column. This script will run for 350 set of files i.e. total of 700 files.There are around 50 files which contains more than 1 million records.The key in all the files is first field in each file.

Now i started the code with an approach that read first file and search for the key in second file. Once found and then do a column by column comparison and if not same log the output in new file which can then later be used to populate csv file.The code which i started is as follows

open(INFILE1,"File1.txt"); my @file1=<INFILE1>; close INFILE1; open(INFILE1,"File2.txt"); my @file2=<INFILE1>; while (<INFILE1>) { my @elements = split /\t/, $_; my $rowid = @elements[0]; my @filtered = grep /$rowid/, @file1; if ($#filtered ==0) { --I will write this in new file..this one's easy;} else { my $numelements=@elements; my $count=1; while ($count <= $numelements) { if (@filtered[$count] != @elements[$count]) { my $str="File 1 Value-".@filtered[$count]." File2 Value-".@ +elements[$count]; print $str; } $count=$count+1; } } }
Now this isn't working,I am not able to read array values (@filtered) which contains the row from second file after searching the row id. This part (@filtered$count != @elements$count) is not working. Is this ok. I tried using -ne also. Though the data contained in the files can be string, number or date. But I am assuming that since it is in text file, for my script it can be considered as text for comparison Can you please help me in identifying the issue.

Also, I read that hash will be faster to compare this huge set of data. Can someone help with hash approach. Can this be run for 700 odd files in loop using hash/array.

Comment on Column by Column Comparison
Download Code
Re: Column by Column Comparison
by poj (Curate) on Mar 31, 2013 at 13:40 UTC
    Using a hash approach
    use strict; my %data=(); scan (0,'File1.txt'); scan (1,'File2.txt'); compare('log.txt'); # input sub scan { my ($ix, $infile) = @_; open IN, '<', $infile or die "Could not open $infile : $!"; my $count=0; while (<IN>){ chomp; my ($key,$line) = split "\t",$_,2; $data{$key}[$ix] = $line; ++$count; } close IN; print "$count lines read from $infile\n"; } # output sub compare { my $logfile = shift; open LOG,'>',$logfile or die "Could not open $logfile : $!"; my $count = 0; # compare lines of data using key for my $key (sort keys %data){ if ($data{$key}[0] ne $data{$key}[1]){ my @f1 = split "\t",$data{$key}[0]; my @f2 = split "\t",$data{$key}[1]; for my $c (1..@f1){ if ($f1[$c-1] ne $f2[$c-1]){ print LOG "Row $key Column $c File 1 Value- $f1[$c-1] File 2 + Value- $f2[$c-1]\n"; ++$count; } } } } close LOG; print "$count lines written to $logfile\n"; }
    poj
Re: Column by Column Comparison
by moritz (Cardinal) on Mar 31, 2013 at 13:53 UTC

    You should really begin your programs with

    use strict; use warnings;

    It catches several potential errors in your script:

    Scalar value @elements[0] better written as $elements[0] at foo.pl lin +e 13. Scalar value @filtered[$count] better written as $filtered[$count] at +foo.pl line 21. Scalar value @elements[$count] better written as $elements[$count] at +foo.pl line 21. Scalar value @filtered[$count] better written as $filtered[$count] at +foo.pl line 22. Scalar value @elements[$count] better written as $elements[$count] at +foo.pl line 22. Bareword "easy" not allowed while "strict subs" in use at foo.pl line +15. foo.pl had compilation errors.

    (The latter is just an artifact from you not using a comment when you should).

    Second you should indent your code consistently, indenting the contents of each block by a fix amount more than the code outside the block.

    And third, it would be much easier to help you if your provided some example input data so that we can run your code, and then tell us what you get, and what you expected instead.

Re: Column by Column Comparison
by Anonymous Monk on Mar 31, 2013 at 22:46 UTC
Re: Column by Column Comparison
by hdb (Parson) on Apr 01, 2013 at 08:43 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1026350]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2014-07-25 02:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (167 votes), past polls