Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

comparing csv files in perl

by ray15 (Initiate)
on Sep 26, 2014 at 09:35 UTC ( [id://1102109]=perlquestion: print w/replies, xml ) Need Help??

ray15 has asked for the wisdom of the Perl Monks concerning the following question:

I have many csv file. I want to compare particular filds of all file(columns: fragmnet, id, index) of that i wat to compare only fragment field of every files with each other. and in output i wat (colums:fragment, id_file1,file1(1 if present or 0),id_file2,file2(i if present or 0) etc.). i wrote the code in hash but it require file with 1 comlumn only.

file1

fragment id index accb 10 A bbc 11 B ccd 12 C

file2

fragment id index ccd 15 D llk 11 B kks 12 C
fragment id_file file 1 id_file2 file 2 accb 10 1 0 bbc 11 1 14 1 ccd 12 1 15 1 llk 0 11 1 kks 0 12 1
use strict; use warnings; use feature qw(say); use autodie; use Text::CSV_XS; use constant { FILE_1 => "1.csv", FILE_2 => "2.csv", }; my %hash; # # Load the Hash with value from File #1 # open my $file1_fh, "<", FILE_1; while ( my $value = <$file1_fh> ) { chomp $value; $hash{$value}++; } close $file1_fh; # # Add File #2 to the Hash # open my $file2_fh, "<", FILE_2; while ( my $value = <$file2_fh> ) { chomp $value; $hash{$value} += 10; # if the key already exists, the value will + now be 11 # if it did not exist, the value will be 10 } close $file2_fh; open my $file3_fh, "<", FILE_3; while ( my $value = <$file3_fh> ) { chomp $value; $hash{$value} += 100; } close $file3_fh; for my $k ( sort keys %hash ) { if ($hash{$k} == 1) { # only in file 1 say "$k\t1\t0"; } elsif ($hash{$k} == 10) { # only in file 2 say "$k\t0\t1"; } else { # in both file 1 and file 2 say "$k\t1\t1"; } } open (OUT, ">final.csv") or die "Cannot open OUT for writing \n"; $, = " \n"; print OUT "fragment\tid_file\tfile1\tid_file2\tfile2\n\n"; print OUT (sort keys %hash); close OUT;

Replies are listed 'Best First'.
Re: comparing csv files in perl
by Athanasius (Archbishop) on Sep 27, 2014 at 03:24 UTC

    Hello ray15, and welcome to the Monastery!

    Here’s a solution using Text::CSV_XS:

    File “1.csv”

    fragment,id,index accb,10,A bbc,11,B ccd,12,C

    File “2.csv”

    fragment,id,index bbc,14,E ccd,15,D llk,11,B kks,12,C

    Script in file “main.pl”

    #!perl use strict; use warnings; use List::MoreUtils 'uniq'; use Text::CSV_XS; my %files = (file1 => '1.csv', file2 => '2.csv'); my %hashes; my $csv = Text::CSV_XS->new( { binary => 1 } ); for my $file (keys %files) { open(my $in, '<', $files{$file}) or die "Cannot open file '$files{$file}' for reading: $!"; <$in>; # Discard column headings while (my $row = $csv->getline($in)) { my $key = shift @$row; $hashes{$file}{$key} = [ @$row ]; } close $in or die "Cannot close file '$files{$file}': $!"; } separator_line(); print join("\t", qw(frag id1 file1 id2 file2)), "\n"; separator_line(); my @keys; push @keys, keys %$_ for values %hashes; @keys = uniq @keys; for my $fragment (sort @keys) { my $f1 = exists $hashes{file1}{$fragment} ? 1 : 0; my $f2 = exists $hashes{file2}{$fragment} ? 1 : 0; printf "%s\t%s\t%s\t%s\t%s\n", $fragment, $f1 ? $hashes{file1}{$fragment}->[0] : '', $f1, $f2 ? $hashes{file2}{$fragment}->[0] : '', $f2, } separator_line(); sub separator_line { print '-' x 37, "\n"; }

    Output:

    13:06 >perl main.pl ------------------------------------- frag id1 file1 id2 file2 ------------------------------------- accb 10 1 0 bbc 11 1 14 1 ccd 12 1 15 1 kks 0 12 1 llk 0 11 1 ------------------------------------- 13:07 >

    Note: I do not try to access $hashes{file1}{$fragment}->[0] until I have confirmed that $hashes{file1}{$fragment} already exists in the hash. This is to avoid autovivification, which is a great Perl feature but is not wanted in this case. (See e.g. Uri Guttman’s tutorial for the gory details.)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: comparing csv files in perl
by blindluke (Hermit) on Sep 26, 2014 at 18:07 UTC

    Your question is barely readable, your code does not compile (you still have file3 all over the code), and you are declaring the use of a Text::CSV module without making use of it anywhere in the code. This has homework written all over it.

    Still, for someone with a similar problem, a solution could help. Let's define a problem: we have a few files, organized in columns. The first column is the ID, the second is an item name, the rest are not interesting from our point of view. We are making a few assumptions for the input - the ID's cannot be equal to 0, and there is only one line in a file for a given item name.

    The input:

    ### test1.csv ### 1 aaa ignored_field 3 ccc ignored_field 4 ddd ignored_field ### test2.csv ### 11 aaa ignored_field 22 bbb ignored_field 44 ddd ignored_field ### test3.csv ### 333 ccc ignored_field 555 eee ignored_field 666 fff ignored_field 777 ggg ignored_field

    What we want to accomplish is to produce a report with all items, clearly stating the ID under which an item is stored in a file. If a file does not contain this item, instead of the ID, a 0 will be shown.

    #!/usr/bin/perl use v5.14; my %data; sub cnt_fields { my @fields = split "\t", $_[0]; return scalar @fields; } sub gather_file { state $count = 0; return $count unless @_; my ($filename, $d) = @_; open my $fh, "<", $filename; while (<$fh>) { chomp; my @field = split; my $offset = $count - cnt_fields($d->{$field[1]}); $d->{$field[1]} .= "0\t" x $offset; $d->{$field[1]} .= "$field[0]\t"; } close $fh; for my $key (keys %$d) { if (cnt_fields($d->{$key}) <= $count) { $d->{$key} .= "0\t"; } } $count++; } for (1..3) { gather_file("test$_.csv", \%data); } my $header = "Key\t"; for (1..gather_file()) { $header .= "f$_\t"; } say $header; for my $key (sort keys %data) { say "$key\t$data{$key}"; }

    The output:

    $ ./report.pl Key f1 f2 f3 aaa 1 11 0 bbb 0 22 0 ccc 3 0 333 ddd 4 44 0 eee 0 0 555 fff 0 0 666 ggg 0 0 777

    regards,
    Luke Jefferson

Re: comparing csv files in perl
by GotToBTru (Prior) on Sep 26, 2014 at 18:25 UTC

    You define 5 columns in your output, but your samples have either 3 or 4 columns. The file_3 code is pointless (actually, worse than, because it will cause a compilation error). You need both the fragment and the id from each file, but you store only the first as the hash key.

    Perhaps you need to replace the scalar value that indicates which file with a more sophisticated data structure that store all the information you need. Then you can generate your output from the hash values.

    1 Peter 4:10

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1102109]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-03-29 11:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found