Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Compare and group unmatched records from 2 CSV files together

by KIASohc (Novice)
on Jun 24, 2014 at 09:07 UTC ( #1091021=perlquestion: print w/replies, xml ) Need Help??
KIASohc has asked for the wisdom of the Perl Monks concerning the following question:


Good day to All!

Hope all is well with everyone. I just would like to ask for some help with a few items with my code. I am trying compare 2 CSV files and print/display their differences. So far, it is working up to that point.

My problem now are the following:

I would like to group the records in such a way that

- It groups the unmatched records together

- Display the row number corresponding to these records row position from the file

- A comma is used as a separator

Unfortunately, I am unable to do them. Below is a sample data and a sample of what is currently happening and how I want it to appear.

Here is a sample of my data:



FILE 1 Content:

NAME,SURNAME,AGE,GENDER

PAUL,SMITH,28,MALE

RICK,SULLIVAN,31,MALE

SILVIA,CONOR,24,FEMALE

VANI,LOVE,26,FEMALE



FILE 2 Content:

NAME,SURNAME,AGE,GENDER

PAUL,SMITH,28,MALE

RICK,SULLIVAN,30,MALE < -- this row contains a mismatch

SILVIA,CONOR,24,MALE < -- this row contains a mismatch

VANI,LOVE,26,FEMALE



These 2 files are expected to have the same contents



CURRENT OUTPUT:

-------------------------------------------------------------

Row 1 RICKSULLIVAN31MALE

Row 2 SILVIACONOR24FEMALE

Row 3 RICKSULLIVAN30MALE

Row 4 SILVIACONOR24MALE

-------------------------------------------------------------



Note: I am aware that using print $count++ causes the row number just to increment. I just want to see how it would look like on my output.



EXPECTED OUTPUT:

-------------------------------------------------------------

Row 3 RICK,SULLIVAN,31,MALE

Row 3 RICK,SULLIVAN,30,MALE



Row 4 SILVIA,CONOR,24,FEMALE

Row 4 SILVIA,CONOR,24,MALE

-------------------------------------------------------------

Thanks in advance!

Here is the code:

#! /usr/bin/perl use strict; use warnings; my @arr1; my @arr2; my $a; #COMPARE 2 FILES open(FIL,'FILE1.CSV') or die "$!"; #open(FIL,$file) or die "$!"; while (<FIL>) { $a=$_; $a =~ s/[\t;, ]*//g; push @arr1, $a if ($a ne ''); }; close(FIL); open(FIL,'FILE2.csv') or die "$!"; while (<FIL>) { $a=$_; $a =~ s/[\t;, ]*//g; push @arr2, $a if ($a ne ''); }; close(FIL); my %arr1hash; my %arr2hash; my @diffarr; foreach(@arr1) {$arr1hash{$_} = 1; } foreach(@arr2) {$arr2hash{$_} = 1; } foreach $a(@arr1) { if (not defined($arr2hash{$a})) { push @diffarr, $a; } } foreach $a(@arr2) { if (not defined($arr1hash{$a})) { push @diffarr, $a; } } my $dir = '/perl/test/run'; opendir(DIR, $dir) or die $!; while (my $file = readdir(DIR)) { next unless (-f "$dir/$file"); next unless ($file =~ m/\.csv$/); print "\n"; printf "Source: $file "; print "\n"; my $count=1; foreach $a(@diffarr) { print "Item "; print $count++ ; print " - "; print ($a); } } print "\n";

Replies are listed 'Best First'.
Re: Compare and group unmatched records from 2 CSV files together
by Tux (Abbot) on Jun 24, 2014 at 09:45 UTC

    Would csvdiff meet your needs? Also able to spit out the diff in HTML

    $ csvdiff --help usage: csvdiff [--no-color] [--html] file.csv file.csv provides colorized diff on sorted CSV files assuming first line is header and first field is the key $

    Enjoy, Have FUN! H.Merijn
Re: Compare and group unmatched records from 2 CSV files together
by AppleFritter (Vicar) on Jun 24, 2014 at 09:52 UTC

    The reason you're not getting commas in your output is that you're stripping them from the lines you read from the files. Putting them back later on is impossible, for obvious reasons (no information on where they should be put is retained); you'll need different data structures to retain the, well, structure of your data.

    That said, I'd suggest avoiding rolling your own CSV handling, and instead resorting to CPAN. Looking for useful modules there, I just found Tie::Array::CSV, and was able to use it thusly:

    #!/usr/bin/perl use feature qw(say); use strict; use warnings; use Tie::Array::CSV; tie my @file1, 'Tie::Array::CSV', 'FILE1.CSV'; tie my @file2, 'Tie::Array::CSV', 'FILE2.csv'; foreach my $row (0..$#file1) { my @row1 = @{$file1[$row]}; my @row2 = @{$file2[$row]}; foreach my $col (0..$#row1) { if($row1[$col] ne $row2[$col]) { say "Row " . ($row + 1) . " - " . join ",", @row1; say "Row " . ($row + 1) . " - " . join ",", @row2; } } }

    This produces the following output:

    $ perl test.pl Row 3 - RICK,SULLIVAN,31,MALE Row 3 - RICK,SULLIVAN,30,MALE Row 4 - SILVIA,CONOR,24,FEMALE Row 4 - SILVIA,CONOR,24,MALE $

    BTW, Tie::CSV::Array seems to work best when there's no empty lines in your CSV files, so make sure they don't have 'em.

      If you use a decent module to *parse* CSV, why not also use a decent module to *produce* CSV?

      If Tie::Array::CSV, which uses mod::Text::CSV for parsing, has problems with blank lines, consider Spreadsheet::Read, which uses Text::CSV_XS. Both background parsers should have no problem with empty lines.


      Enjoy, Have FUN! H.Merijn

      Thanks AppleFritter! I tried using the sample code but I am getting the error "CSV parse error: EIF - CR char inside unquoted, not part of EOL..."

      I tried looking up the error and found that this could be a data problem. I have recreated the same sample I gave and made sure that there are no extra characters/spaces but still getting the same result. Any idea why I am getting this?

        This smells like a binary \r is inside an unquoted field and the binary option is not passed to the parser. What versions of Text::CSV (and if installed Text::CSV_XS) and Tie::Array::CSV are you using?

        It might help to also show us a hextdump of the data that fails.


        Enjoy, Have FUN! H.Merijn
        Or this smells like a file prepared under Windows and used under Linux or Unix. If this is the case, just remove the CR (carriage returns, or \r) characters from your file before processing. One possible command to strip a file from these noisy Windows files under your shell:
        perl -pi.bak -e 's/\r//g;' my_file.txt
        This removes the Windows CR characters from the file and saves the original fila as my_file.txt.bak (just in case something goes wrong).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1091021]
Approved by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2017-11-25 01:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In order to be able to say "I know Perl", you must have:













    Results (355 votes). Check out past polls.

    Notices?