Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Merging Two CSV files Based on Two Columns

by TheCanadian (Initiate)
on Sep 07, 2012 at 21:47 UTC ( [id://992384]=perlquestion: print w/replies, xml ) Need Help??

TheCanadian has asked for the wisdom of the Perl Monks concerning the following question:

Hi: I am new to PERL and basically am self taught through web articles and examples. My question is how do I merge two comma separated CSV files based on two key columns being equal. I have tried 4 methods I borrowed from the web. These examples are below but first I will give samples of the two CSV files. My preferred method is #1 for it's simplicity.

Sample File #1 (final2.csv):

username, date_modified 2010-nhw-es-ea-285, 2012-10-10 aab.external.audit.-.bvr.verif.externes, 2012-02-22 aashee.aziz, 2012-01-27 abbas.rangwala, 2012-01-27 abbie.gardner, 2012-09-10 ab.border-frontiere, 2012-08-11 abdool.yasseen, 2012-01-31 abdullah.aboukarr, 2012-08-10 abdullahi.sheikh, 2012-02-28 (There is 544 lines)

Sample File #2 (HCDir2.csv):

fullname, branch aaron.northway, CHIEF FINANCIAL OFFICER BRAN +CH aashee.aziz, HEALTH PRODUCTS AND FOOD BRA +NCH abbey.klugerman, HEALTHY ENVIRONMENTS AND CON +SUMER SAFETY BRANCH abby.hoffman, STRATEGIC POLICY BRANCH abderrahim.boussanni, CHIEF FINANCIAL OFFICER BRAN +CH abdiaziz.nur, HEALTHY ENVIRONMENTS AND CON +SUMER SAFETY BRANCH abdool.yassin, HEALTH PRODUCTS AND FOOD BRA +NCH abdoulie.manjang, N/A abdullah.hassen, HEALTH PRODUCTS AND FOOD BRA +NCH abdullah.patel, REGIONS AND PROGRAMS BRANCH (There is 12489 lines)

Intended output file (For sample data):

username, branch, date_modified aashee.aziz, HEALTH PRODUCTS AND FOOD BRANCH, 2012-01-27 abdool.yasseen, HEALTH PRODUCTS AND FOOD BRANCH, 2012-01-31
In reality all records in final.csv should match with a name in HCDir.csv but not all HCDir.csv data will be found in final.csv

Method #1 #!F:\Perl_program\perl\bin\perl -w #I am using Portable Strawberry PE +RL on Windows XP. use strict; use warnings; use DBI; my $datafolder = "F:/Perl_program"; my $dbh = DBI->connect( "dbi:CSV:f_dir=$datafolder;f_ext=.csv;csv_eol=\n;" ); $dbh->{'csv_tables'}->{"HCDir2.csv"} = {'col_names' => ["fullname" +, "branch"]}; $dbh->{'csv_tables'}->{"final2.csv"} = {'col_names' => ["username" +, "date_modified"]}; $dbh->do(" CREATE TABLE completed.csv AS SELECT final.username,HCDir.branch,final.date_modified FROM HCDir INNER JOIN final ON final.username = HCDir.fullname ");
Method #2 #!F:\Perl_program\perl\bin\perl -w #I am using Portable Strawberry PE +RL on Windows XP. use strict; use warnings; use DBI; open FINAL, "<final2.csv" or die(); open HCDIR, "<HCDir2.csv" or die(); open OUTFILE, ">completed.csv" or die(); our %hash1; our %hash2; while (<FINAL>) { our ($username, $date_modified) = split; $hash1{$username} = $date_modified; } while (<HCDIR>) { our ($fullname, $branch) = split; $hash2{$fullname} = $branch; } if (our $username = our $fullname){ print OUTFILE "$username $hash2{$fullname} $hash1{$username}\n"; } close FINAL; close HCDIR; close OUTFILE;
Method #3 #!F:\Perl_program\perl\bin\perl -w #I am using Portable Strawberry PE +RL on Windows XP. use strict; use warnings; open (HCDIR, "<HCDir2.csv") or die "$!"; open (FINAL, "<final2.csv") or die "$!"; open(NBMD, ">nbmd.csv") or die "$!"; our $line1; our $line2; our @lines1 = (<FINAL>); our @lines2 = (<HCDIR>); foreach $line2(@lines2){ our @linearr2 = split (',', $line2); our $fullname = $linearr2[0]; our $branch = $linearr2[1]; } foreach $line1(@line1){ our @linearr1 = split (',', $line1); our $username = $linearr1[0]; our $date_modified = $linearr1[1]; } while (<FINAL>){ if ($username eq $fullname){ our $nbmd = join (',', "$username,$branch$,date_modified") +; chomp $nbmd; print NBMD ("$nbmd\n"); } } close (FINAL); close (HCDIR); close (NBMD);
Method #4 #!F:\Perl_program\perl\bin\perl -w #I am using Portable Strawberry PE +RL on Windows XP. use strict; use warnings; our $file1 = "F:/Perl_program/final2.csv"; our $file2 = "F:/Perl_program/HCDir2.csv"; our $fileOut = "F:/Perl_program/completed.csv"; our %file1Map = (); our %file2Map = (); open (FILE1, "<final.csv") or die("Could not open $file1 file."); foreach our $line1 () { (our $username, our $date_modified) = split(',',$line1,2); our $File1Map{"$username"} = "$date_modified"; } close(FILE1); open (INPUT2, "<modhcdir2") or die("Could not open $file2 file."); foreach $line2() { (our $fullname,our $branch) = split(',',$line2); our $File2Map{"$fullname"}="$branch"; } close(FILE2); open (OUTFILE, ">completed.csv") or die("Could not open $fileout file. +"); foreach $fileout(){ (our $username2,our $dumour)=split('\n',$fileout); our $existingValue = $File1Map{"$username"}; our $newValue = $File2Map{"$fullname"}; if (! defined $existingValue){ $existingValue = ","; } if (! defined $newValue){ $newValue = "\n"; } print OUTFILE "" . $username2 . "," . $existingValue . "," . $newValue +; } close (OUTFILE);

I intend to create a new CSV file (completed.csv) containing two columns from final.csv and one from HCDir.csv (final.username,HCDir.branch,final.date_modified). The match or key is column "username" from final.csv and "fullname" from HCDir.csv.

The issues from the methods displayed above is as follows. Method #1: No results are produced just an empty file. Method #2: Produces error "Use of uninitialized value in concatenation (.) or string". Method #3: Produces "Variable "@line1" is not imported" and Global symbol "@line1" requires explicit package name errors. Method #4: Produces "syntax error at ..." and "Global symbol "$fileout" requires explicit ..." errors.

Would someone be so kind and give me a little assistance to solving this simple issue for something quite important to me. Thank you in advance. JGW 20120907-1720 :-}

Replies are listed 'Best First'.
Re: Merging Two CSV files Based on Two Columns
by ig (Vicar) on Sep 08, 2012 at 03:03 UTC

    I might use the Text::CSV module to parse and compose CSV, perhapse something like the following:

    use strict; use warnings; use Data::Dumper; use IO::String; use Text::CSV; my $file1 = IO::String->new(<<EOF) or die "$!"; username, date_modified 2010-nhw-es-ea-285, 2012-10-10 aab.external.audit.-.bvr.verif.externes, 2012-02-22 aashee.aziz, 2012-01-27 abbas.rangwala, 2012-01-27 abbie.gardner, 2012-09-10 ab.border-frontiere, 2012-08-11 abdool.yasseen, 2012-01-31 abdullah.aboukarr, 2012-08-10 abdullahi.sheikh, 2012-02-28 EOF my $file2 = IO::String->new(<<EOF) or die "$!"; fullname, branch aaron.northway, CHIEF FINANCIAL OFFICER BRANCH aashee.aziz, HEALTH PRODUCTS AND FOOD BRANCH abbey.klugerman, HEALTHY ENVIRONMENTS AND CONSUMER SAFETY BRANCH abby.hoffman, STRATEGIC POLICY BRANCH abderrahim.boussanni, CHIEF FINANCIAL OFFICER BRANCH abdiaziz.nur, HEALTHY ENVIRONMENTS AND CONSUMER SAFETY BRANCH abdool.yassin, HEALTH PRODUCTS AND FOOD BRANCH abdoulie.manjang, N/A abdullah.hassen, HEALTH PRODUCTS AND FOOD BRANCH abdullah.patel, REGIONS AND PROGRAMS BRANCH asdf "asdf" asdf,, EOF my $csv1 = Text::CSV->new({ binary => 1, allow_whitespace => 1, sep_char => ',', }); my %date_modified_by_username; $csv1->column_names( $csv1->getline($file1) ); # get header line while(my $href = $csv1->getline_hr($file1)) { $date_modified_by_username{$href->{username}} = $href->{date_modif +ied}; } my $csv2 = Text::CSV->new({ binary => 1, allow_whitespace => 1, sep_char => ',', eol => "\n", }); $csv2->print(\*STDOUT, [ qw{username branch date_modified} ]); $csv2->column_names( $csv2->getline($file2) ); # get header line while(my $href = $csv2->getline_hr($file2)) { my $date_modified = $date_modified_by_username{$href->{fullname}} || 'unknown'; $csv2->print(\*STDOUT, [ $href->{fullname}, $href->{branch}, $date_modified ]); }

    With your sample data this produces:

    username,branch,date_modified aaron.northway,"CHIEF FINANCIAL OFFICER BRANCH",unknown aashee.aziz,"HEALTH PRODUCTS AND FOOD BRANCH",2012-01-27 abbey.klugerman,"HEALTHY ENVIRONMENTS AND CONSUMER SAFETY BRANCH",unkn +own abby.hoffman,"STRATEGIC POLICY BRANCH",unknown abderrahim.boussanni,"CHIEF FINANCIAL OFFICER BRANCH",unknown abdiaziz.nur,"HEALTHY ENVIRONMENTS AND CONSUMER SAFETY BRANCH",unknown abdool.yassin,"HEALTH PRODUCTS AND FOOD BRANCH",unknown abdoulie.manjang,N/A,unknown abdullah.hassen,"HEALTH PRODUCTS AND FOOD BRANCH",unknown abdullah.patel,"REGIONS AND PROGRAMS BRANCH",unknown

      Hi: ig

      Thank you for your post. I am testing various ways of getting the job done. Your method does interest me however, due to my lack of knowledge, I have been having difficulty reading though the code with full understanding. I feel I need this to enable me to effectively replace variables and files for the script to work on.

      TheCanadian

        You don't need to worry about IO::String. That's just a way to get a file handle that reads from a string. You can use a file handle from open to read a real file.

        I you have any particular questions about the code, feel free to ask.

Re: Merging Two CSV files Based on Two Columns
by kcott (Archbishop) on Sep 08, 2012 at 03:37 UTC

    G'day TheCanadian,

    Welcome to the monastery.

    It's good to see you've attempted several solutions yourself before seeking help.

    Here's a rundown of issues you have with these different methods (some issues occur in more than one method):

    Method #1
    • my $datafolder = "F:/Perl_program";

      I suspect that should have a backslash. In single quotes: 'F:\Perl_program'; in double quotes: "F:\\Perl_program".

    • my $dbh = DBI->connect( ... );

      You should check to see if you've actually made the connection. From DBD::CSV:

      DBI->connect ( ... ) or die "Cannot connect: $DBI::errstr";

      This probably would have pointed you to the $datafolder issue.

      Also take a look at DBI; in particular, $DBI::err, $DBI::errstr, PrintError and RaiseError (all listed in the table of contents).

    Method #2
    • open FINAL, "<final2.csv" or die();

      You've used open inconsistently with various degrees of error checking. Also, the 3-argument form is preferable. The first two examples from the open documentation:

      open(my $fh, "<", "input.txt") or die "cannot open < input.txt: $!"; open(my $fh, ">", "output.txt") or die "cannot open > output.txt: $!";

      In case of error, both of these will tell you: what you're trying to open; how you're trying to open it; and, why it didn't work.

    • our %hash1;

      You've used our in this and all subsequent methods. my would have been a better choice. I've mostly retained the our in the code snippets below (to avoid correcting dozens of lines of code) but I do recommend you change them to my.

    • while (<FINAL>) { our ($username, $date_modified) = split; $hash1{$username} = $date_modified; }

      The two variables (that you've declared within the scope of the while loop) go out of scope once you exit the while loop. This will be the cause of some of your "uninitialized value" messages. Compare these two pieces of code:

      $ perl -Mstrict -Mwarnings -le 'for (0..1) { our $x = $_ } print $x' Variable "$x" is not imported at -e line 1. Global symbol "$x" requires explicit package name at -e line 1. Execution of -e aborted due to compilation errors. $ perl -Mstrict -Mwarnings -le 'our $x; for (0..1) { $x = $_ } print $ +x' 1

      So, if you want to use data you've captured inside a block, declare the variables holding that data outside the block.

    • if (our $username = our $fullname){

      I suspect you were grasping at straws somewhat with this syntax :-)

      Firstly, to compare two strings, the operator is eq (= is an assignment operator) - see perlop. If you declared the variables outside the while loop (as indicated above), this would be the appropriate code to use here:

      if ($username eq $fullname) {
    Method #3
    • our $line1; ... our @lines1 = (<FINAL>); ... foreach $line1(@line1){

      You declared @lines1 (with an 's') then used @line1 (without an 's'). As you only use $line1 in that single foreach loop, you can localise it to that loop with my. Were you aware that foreach and for are synonymous - see perlsyn - Foreach Loops. Putting all that together, those three lines can be written as these two lines:

      our @lines1 = (<FINAL>); ... for my $line1 (@line1) {
    • foreach $line2(@lines2){ didn't have the missing 's' problem but otherwise the same comments apply here.

    Method #4
    • (our $username, our $date_modified) = split(',',$line1,2);

      You have three lines like this. The syntax for declaring a list of variables with our is:

      our ($username, $date_modified) = ...
    • This method also has many of the same problems addressed above.

    -- Ken

      I suspect that should have a backslash.

      Nope

      Hi: Ken

      This was actually the first reply to my question I viewed. I learned quite a bit about proper syntax from your suggestions and tried some new things. However, my script still did not produce the product I was looking for. My errors on two of the methods dropped to zero but what was produced was an empty CSV file. No worries, I am continuing to modify with the help from others in this forum.

      TheCanadian

Re: Merging Two CSV files Based on Two Columns
by Tux (Canon) on Sep 08, 2012 at 06:59 UTC

    Please consider adding some code tags in your post, so your data looks more like data.

    Note that your first example uses DBD::CSV, which uses Text::CSV_XS underneath to parse the CSV data. Example 2 uses <FINAL> plus invalid split. Examples 3 uses the same reading mechanism as example 2, but with a split that is still unsafe and example 4 uses two endless loops.

    If you want to do the "work" yourself and not rely on DBD::CSV, switch to using one of the two proved CSV parsers: Text::CSV (as ig already suggested) or Text::CSV_XS (the fast one).

    In what you describe, getline_hr might be very helpful, maybe even getline_hr_all.

    use text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1, allow_whit +espace => 1 }); open my $fh, "<", "final2.csv" or die "final2.csv: $!"; $csv->column_names ($csv->getline ($fh)); # Read the header line while (my $r = $csv->getline_hr ($fh)) { $File1Map{$r->{username}} = $r->{date_modified}; } : : $csv->eol ("\n"); # Now use it for output # inside the for/while loop $csv->print ($fh, [ $usename, $oldvalue, $newvalue ]);

    Enjoy, Have FUN! H.Merijn

      Hi: Tux

      Your post got a lot of my attention as it is short with the promise that Text::CSV_XS is the fast one. I tried implementing your code that you presented, however, there is a part that is missing regarding the output. I tried filling in the blanks but my code still did not work correctly.

      Can Text::CSV be substituted by Text_CSV_XS?

      TheCanadian

        Text::CSV_XS is written (mostly) in XS code: that is the C API to perl and can thus be optimized at a very low level. Text::CSV is a wrapper over Text::CSV_XS and Text::CSV_PP, the Pure-Perl implementation that does exactly the same, but up to a 100 time slower. The documentation should be identical. The wrapper will automatically choose Text::CSV_XS when it is installed and available.

        If you are now using Text::CSV and you install Text::CSV_XS the code doesn't have to change at all but will immediately gain speed.

        Where are the "blanks"? I think that by now the documentation is quite elaborate, and comes with clear examples.


        Enjoy, Have FUN! H.Merijn
Re: Merging Two CSV files Based on Two Columns
by Kenosis (Priest) on Sep 08, 2012 at 01:44 UTC

    Welcome to PerlMonks, TheCanadian!

    What may help is if you would post some csv lines from both files--within code tags--including the headers. If those lines contain information that's not suitable for public consumption, redact that information.

      Hi: Kenosis

      I am new to this, I did not know I could format the CSV contents this way. Thanks for the tip.

      TheCanadian

Re: Merging Two CSV files Based on Two Columns
by philiprbrenan (Monk) on Sep 07, 2012 at 22:10 UTC

    Assuming your data is valid, please consider:

    use feature ":5.14"; use warnings FATAL => qw(all); use strict; use Data::Dump qw(dump pp); my @f1 = split /\n/, <<'END'; 2010-nhw-es-ea-285, 2012-10-10 aab.external.audit.-.bvr.verif.externes, 2012-02-22 aashee.aziz, 2012-01-27 abbas.rangwala, 2012-01-27 abbie.gardner, 2012-09-10 ab.border-frontiere, 2012-08-11 abdool.yasseen, 2012-01-31 abdullah.aboukarr, 2012-08-10 abdullahi.sheikh, 2012-02-28 END my @f2 = split /\n/, <<'END'; aaron.northway, CHIEF FINANCIAL OFFICER BRANCH aashee.aziz, HEALTH PRODUCTS AND FOOD BRANCH abbey.klugerman, HEALTHY ENVIRONMENTS AND CONSUMER SAFETY BRANCH abby.hoffman, STRATEGIC POLICY BRANCH abderrahim.boussanni, CHIEF FINANCIAL OFFICER BRANCH abdiaziz.nur, HEALTHY ENVIRONMENTS AND CONSUMER SAFETY BRANCH abdool.yassin, HEALTH PRODUCTS AND FOOD BRANCH abdoulie.manjang, N/A abdullah.hassen, HEALTH PRODUCTS AND FOOD BRANCH abdullah.patel, REGIONS AND PROGRAMS BRANCH END my (%f1, %f2); for(@f1) {my ($username, $date) = split /,\s*/; $f1{$username} = $date; } for(@f2) {my ($fullname, $branch) = split /,\s*/; $f2{$fullname} = $branch; } my %names = map {($_, 1)} (keys %f1, keys %f2); say "$_, ", ($f1{$_}//'undefined'), ", ", ($f2{$_}//'undefined') for s +ort keys %names;

    Produces:

    2010-nhw-es-ea-285, 2012-10-10, undefined aab.external.audit.-.bvr.verif.externes, 2012-02-22, undefined aaron.northway, undefined, CHIEF FINANCIAL OFFICER BRANCH aashee.aziz, 2012-01-27, HEALTH PRODUCTS AND FOOD BRANCH ab.border-frontiere, 2012-08-11, undefined abbas.rangwala, 2012-01-27, undefined abbey.klugerman, undefined, HEALTHY ENVIRONMENTS AND CONSUMER SAFETY B +RANCH abbie.gardner, 2012-09-10, undefined abby.hoffman, undefined, STRATEGIC POLICY BRANCH abderrahim.boussanni, undefined, CHIEF FINANCIAL OFFICER BRANCH abdiaziz.nur, undefined, HEALTHY ENVIRONMENTS AND CONSUMER SAFETY BRAN +CH abdool.yasseen, 2012-01-31, undefined abdool.yassin, undefined, HEALTH PRODUCTS AND FOOD BRANCH abdoulie.manjang, undefined, N/A abdullah.aboukarr, 2012-08-10, undefined abdullah.hassen, undefined, HEALTH PRODUCTS AND FOOD BRANCH abdullah.patel, undefined, REGIONS AND PROGRAMS BRANCH abdullahi.sheikh, 2012-02-28, undefined

      Thank you very much for your response. Since you had an entry that showed 5.14 version of PERL, I upgraded to the latest version from 5.12 to 5.16. Thank you for that. I am working on testing a number of responses from this forum. All in all I have not been successful. On a positive note I just found another way that my script does not work. (aka Thomas Edison)

      The results you showed is a true merge however, what I am looking for is to show only the entries that that match in both CSV files.

      Thanks again philiprbrenan.

      TheCanadian

Re: Merging Two CSV files Based on Two Columns
by clueless newbie (Curate) on Sep 08, 2012 at 13:58 UTC

    With regards to method 1: Your SQL is at fault for the completed file being empty. It should read (Note the "2")

    CREATE TABLE completed AS SELECT final2.username,HCDir2.branch,final2.date_modified FROM HCDir2 INNER JOIN final2 ON final2.username = HCDir2.fullname

      Hi: clueless newbie

      Thank you for your reply. This method interested me because I wanted to learn SQL as well; not to mention the slick short code to produce the product. I still cannot get this code to work even after the changes you suggested. Right now I am going to focus on straight PERL or PERL with the addition of Text::CSV or Text::CSV_XS.

      I want to thank you as well for having a keen eye for anomalies in file names and variables. Keep up the good work. Sometimes a simple slip in names like this code causes the code to fail with little message as to the reason for the fail.

      TheCanadian

Re: Merging Two CSV files Based on Two Columns
by aaron_baugher (Curate) on Sep 15, 2012 at 22:29 UTC

    If you have access to Unix-type utilities:

    join -t ',' file2 file1 >output

    One note: you need to sort the files before passing them to join, if they aren't already.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://992384]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-03-19 09:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found