http://www.perlmonks.org?node_id=482271

DrAxeman has asked for the wisdom of the Perl Monks concerning the following question:

After I process data in my csv files, I need to combine/join them into a single file. Should I use my current scripts to output results into new csv files, and then write a script to combine/join them? Or should I use the %arrays to do a SQL JOIN on them all in one script? If I can/should use the second method, can I get a simple code example of how to use SQL to access the data? Thanks

Edited by Arunbear: Changed title from 'Which is easier?', as per Monastery guidelines

  • Comment on Process and combine two CSV files into one

Replies are listed 'Best First'.
Re: Process and combine two CSV files into one
by Joost (Canon) on Aug 09, 2005 at 16:02 UTC
      Sorry. That was a horrible way to ask for help on something. Now that I'm trying, I find it's harder to ask properly than I thought.

      My CSV files look like:
      IP,Domain,ServerName 10.160.0.8,0, 10.160.0.9,0, 10.160.0.10,EURORSCG,ERWWDC1 10.160.0.11,EURORSCG,ERWWSQLWEB 10.160.0.12,EURORSCG,ERWWSEARCH 10.160.0.13,EURORSCG,ERWWNAB 10.160.0.14,EURORSCG,ERWWCOMMUNITIES
      and
      IP,DaysUptime,OS,RAM,OSSP,InstallDate,CPUSpeed,CPUCount,CPUType 10.160.0.10,19,Microsoft_Windows_2000_Server_Domain_Controller,1024,4, +12_2_2003,995,1,Intel_Pentium_III 10.160.0.11,266,Microsoft_Windows_2000_Server,2048,3,11_21_2003,3000,4 +,Intel_Xeon 10.160.0.12,21,Microsoft_Windows_2000_Server,1024,3,11_24_2003,1000,1, +Intel_Pentium_III 10.160.0.13,25,Microsoft_Windows_2000_Server,1024,3,11_20_2003,1000,1, +Intel_Pentium_III 10.160.0.14,3,Microsoft_Windows_2000_Server,1024,4,8_28_2002,1000,1,In +tel_Pentium_III 10.160.0.15,25,Microsoft_Windows_2000_Server,1024,4,5_23_2003,995,2,In +tel_Pentium_III

      The files get inputted as hash arrays, %hosts and %sysinfo. I'd like to combined both into %combined, using the IP column as a key. I'd like it to look like:
      IP,Domain,ServerName,DaysUptime,OS,RAM,OSSP,InstallDate,CPUSpeed,CPUCo +unt,CPUType 10.160.0.8,0, 10.160.0.9,0, 10.160.0.10,EURORSCG,ERWWDC1,19,Microsoft_Windows_2000_Server_Domain_C +ontroller,1024,4,12_2_2003,995,1,Intel_Pentium_III 10.160.0.11,EURORSCG,ERWWSQLWEB,266,Microsoft_Windows_2000_Server,2048 +,3,11_21_2003,3000,4,Intel_Xeon

      The final step of my process is causing me the most headache. I have all but the last table combined. Ths issues are:

      The data in the last table looks like:

      Server,Statistic,Average "ERWWCOMMUNITIES","MemoryPagessec",16.436370 "ERWWCOMMUNITIES","NetworkInterfaceCompaqEthernetFastEthernetAdapterMo +duleBytesT otalsec",157380.618090 "ERWWCOMMUNITIES","PhysicalDisk0CPCTDiskTime",33960995.990210 "ERWWCOMMUNITIES","PhysicalDisk1DPCTDiskTime",3158277.602200 "ERWWCOMMUNITIES","PhysicalDiskTotalPCTDiskTime",2541682.536690 "ERWWCOMMUNITIES","PhysicalDiskTotalAvgDiskQueueLength",508335953.6655 +60 "ERWWCOMMUNITIES","ProcessorTotalPCTProcessorTime",1.838210

      The first column contains the data I will use to join to the other table. The second column will determine where the data in column 3 will go.

      How do I perform a join that will collect this data? My other table looks like:
      IP,ServerName,Domain,DaysUptime,OS,RAM,OSSP,InstallDate,CPUSpeed,CPUCo +unt,CPUType,PartitionSize,PartitionFree 10.160.0.1,, 10.160.0.2,, 10.160.0.3,, 10.160.0.4,, 10.160.0.5,, 10.160.0.6,, 10.160.0.7,, 10.160.0.8,, 10.160.0.9,, 10.160.0.10,ERWWDC1,EURORSCG,19,Microsoft_Windows_2000_Server_Domain_C +ontroller,1024,4,12_2_2003,995,1,Intel_Pentium_III,8400,4600 10.160.0.11,ERWWSQLWEB,EURORSCG,266,Microsoft_Windows_2000_Server,2048 +,3,11_21_2003,3000,4,Intel_Xeon,205000,113700 10.160.0.12,ERWWSEARCH,EURORSCG,21,Microsoft_Windows_2000_Server,1024, +3,11_24_2003,1000,1,Intel_Pentium_III,67700,61000

      I need all the data from the second table (a left join, right?) and the relevant data from the first table.

      Ideally, the columns for the final table should look something like:
      IP,ServerName,Domain,DaysUptime,OS,RAM,OSSP,InstallDate,CPUSpeed,CPUCo +unt, CPUType,PartitionSize,PartitionFree,MemoryPagessec,NetworkBytessec, PhysicalDiskTotalPCTDiskTime,PhysicalDiskTotalAvgDiskQueueLength, ProcessorTotalPCTProcessorTime
Re: Process and combine two CSV files into one
by jcoxen (Deacon) on Aug 09, 2005 at 16:25 UTC
    I try to take the KISS approach. If you don't need the SQL functionality, why not just open 2 output files - the individual .csv file and a "Master" .csv file - and print your data to both files.

    # Setup the Master list open(MASTERFILE, ">>$dir/MasterList.csv") or die "Can't open MasterList.csv: $!"; # Set up a loop to "Do Stuff" tm my @src_files = <$src_dir/*.csv>; foreach my $src_file (@src_files) { open (INFILE, "$src_file") or die "Can't open $src_file for input: $!\n"; while (<INFILE>) { chomp; # # Do Stuff Here! # open (OUTFILE, ">>$dest_file") or die "Can't open $dest_file for output: $!\n"; print OUTFILE "Your records go here\n"; print MASTERFILE "Your records go here\n"; close OUTFILE; } close MASTERFILE;

    If you don't need the individual .csv files just leave out the appropriate open and print statements.

    HTH,

    Jack

      I tried something similar to this, but got unexpexcted results. (My skills with SQL is even worse than my skills with Perl.)

      # Connect to the database, (the directory containing our csv file(s)) my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_eol=\n;"); # Associate our csv disk file with the table name 'results' $dbh->{'csv_tables'}->{'info'} = { 'file' => "psinfooutputfile.csv"}; $dbh->{'csv_tables'}->{'hosts'} = { 'file' => "netbiosoutputfile.csv"} +; ###### my @row = $dbh->selectall_arrayref("SELECT * FROM hosts, info WHERE ho +sts.IP = info.IP"); print Dumper @row;

      Produced

      $VAR1 = [ [ '10.160.0.10', '10.160.0.10', '10.160.0.10', '10.160.0.10', '10.160.0.10', '10.160.0.10', '10.160.0.10', '10.160.0.10', '10.160.0.10', '10.160.0.10', '10.160.0.10', '10.160.0.10' ],

      I am assuming that there is an issue with my SQL.

      I figured it out. I need to name each Column in my statement, not just an *.

      My problem is that it isn't joining data from one of the tables.
      # Connect to the database, (the directory containing our csv file(s)) my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_eol=\n;"); # Associate our csv disk file with the table name 'results' $dbh->{'csv_tables'}->{'info'} = { 'file' => "psinfooutputfile.csv"}; $dbh->{'csv_tables'}->{'hosts'} = { 'file' => "netbiosoutputfile.csv"} +; ###### my @row = $dbh->selectall_arrayref("SELECT IP, Domain, ServerName, Day +sUptime, OS, RAM, OSSP, InstallDate, CPUSpeed, CPUCount, CPUType FROM + hosts, info WHERE hosts.IP = info.IP"); print Dumper @row;

      Output;
      $VAR1 = [ [ '10.160.0.10', '0', '', '19', 'Microsoft_Windows_2000_Server_Domain_Controller', '1024', '4', '12_2_2003', '995', '1', 'Intel_Pentium_III' ], ]

      UPDATE:
      I modified the top line of the file who's data wasn't getting joined and found that the data included there was the ONLY data being used. I changed it to an INNER JOIN and it looks like it's working.

Re: Process and combine two CSV files into one
by jZed (Prior) on Aug 09, 2005 at 23:29 UTC
    I've pointed you to DBD::CSV's "CREATE TABLE AS SELECT" syntax in several other threads but you don't seem to have seen it, so here goes again: You can create a new CSV file from one or more old CSV files in a single step with this kind of syntax:
    $dbh->do(" CREATE TABLE new_csv_file AS SELECT (columns) FROM old_csv_file1 JOIN old_csv_file2 ");
    The JOIN may be natural, inner, outer, left, right, or full. You can use table and column aliases, supply a WHERE clause, etc. This creates the new CSV file from the data result set you specify in the SELECT clause. You don't need to put the results into an array or hash or any other structure, just go from CSV file to CSV file using SQL.

    update : Make sure you have the latest version of SQL::Statement, DBD::CSV's SQL engine, the syntax is supported in SQL::Statement version 1.14 and later.

      You're right. I was just kinda skipping over this because it looked like more work. I'll just go and whimper in the corner.
      I tried it and got a error:
      SQL ERROR: Can't find column definitions!

      So I tried adding column headings as shown in the DBD:CSV page. Same error. Google doesn't produce anything for that error either.
      # Associate our csv disk file with the table name 'results' $dbh->{'csv_tables'}->{'info'} = { 'file' => "psinfooutputfile.csv"}; $dbh->{'csv_tables'}->{'disks'} = { 'file' => "disk_results.csv"}; $dbh->{'csv_tables'}->{'hosts'} = { 'file' => "netbiosoutputfile.csv"} +; $dbh->{'csv_tables'}->{'hosts'} = { 'file' => "template.csv"}; ###### $dbh->do("DROP TABLE IF EXISTS template"); $dbh->do("CREATE TABLE template (IP CHAR(64), ServerName CHAR(64), Dom +ain CHAR(64), DaysUptime CHAR(64), OS CHAR(64), RAM CHAR(64), OSSP CH +AR(64), InstallDate CHAR(64), CPUSpeed CHAR(64), CPUCount CHAR(64), C +PUType CHAR(64)) AS SELECT IP, ServerName, Domain, DaysUptime, OS, RAM, OSSP, I +nstallDate , CPUSpeed, CPUCount, CPUType FROM hosts LEFT JOIN info ON hosts.IP = info.IP ");

      Please note, the CHAR (64) was just a dummy value used for testing ONLY.
      Do I have a syntax error?
        That's not the syntax supported by SQL::Statement (see SQL::Statement::Syntax). The columns for creating the table are specified in the SELECT part and created in the new table automatically. So change the create statement to:
        $dbh->do(" CREATE TABLE template AS SELECT IP, ServerName, Domain, DaysUptime, OS, RAM, OSSP, InstallDate, CPUSpeed, CPUCount, CPUType FROM hosts LEFT JOIN info ON hosts.IP = info.IP ");
Re: Process and combine two CSV files into one
by anonymized user 468275 (Curate) on Aug 09, 2005 at 17:02 UTC
    Depending on what easier means and what you actually want, the following trivial assignment will feed the entries from %a into %c and from %b into %c, replacing any duplicates from %a with the entry having the same key in %b. This seemed to fit your example data.
    %c = ( %a, %b);
    Update: or from the shell (unsophisticated sorting though, else it lengthens a bit):
    cat file1 file2 | perl -e 'while(<>){ /^((\d+\.){3}\d+)(.*)$/ and $c{ +$1 } = $1 . $2; } print values %c; ' | sort > file3;

    One world, one people

      That seems easy enough! I'll keep this one in mind.
      I've got the two files combined in an array. If I use Dumper I can see the contents of the array and they are correct. My problem now is how to print them out in a CSV format. This is what I tried:

      my @row = $dbh->selectall_arrayref("SELECT IP, Domain, ServerName, Day +sUptime, OS, RAM, OSSP, InstallDate, CPUSpeed, CPUCount, CPUType FROM + hosts INNER JOIN info ON hosts.IP = info.IP"); #print Dumper @row; foreach my $data ( @row ) { print("$data"); };
      How can I get each row of the array to print out on a seperate line each?

        hmmm if you are just wondering on how to print to a csv file you can just use the IO::File and Text::CSV_XS. If you go check out the documentation of Text::CSV_XS and IO::File what happens is that you scan in your csv file 1 line at a time and each line is put into an array. Anyways, you can do something like

        use Text::CSV_XS; use IO::File; use strict; my $csv = Text::CSV_XS->new({sep_char => "\t"}); my %hashed; open my $in, '<', $input1 or die "Read Failed: $!\n"; while(<$in>) { $csv->parse( $_ ) or warn "Bad data: $_\n", next; my @row = $csv->fields(); $hashed{ $row[0] } = \@row; } close $in; my $fh = new IO::File "> $output"; foreach my $hash ( %hashed) { if($hashed{$hash}) { $csv->combine(@{$hashed{$hash}}); print "$hash has data $hashed{ $hash }->[0]\n"; $fh->print($csv->string, "\n"); } } $fh->close;

        This is just a basic read in of a csv file that is tab separated. All of the input and csv manipulation can be found in Text::CSV_XS documentation. But to output the file in csv format the easiest way I found was with IO::File.

      I'm just not getting this join right. If I do a
      my @row = $dbh->selectall_arrayref("SELECT IP, ServerName, Domain, Day +sUptime, OS, RAM, OSSP, InstallDate, CPUSpeed, CPUCount, CPUType FROM + hosts LEFT JOIN info ON hosts.IP = info.IP");
      My 2 tables get joined. The only issue is that I get a warning:
      Execution ERROR: Ambiguous column name 'IP' called from /usr/lib/perl5 +/vendor_perl/5.8.6/i586-linux-thread-multi/DBI.pm at 1557.

      But if I try and change a column name, every field returned contains the "IP" value, regardless of what the true value should be.

      Also, when I try to join a 3rd table it takes a long time. The 2 tables are done in less than 15 seconds. When I add the 3rd table it's just over 5 minutes.
      my @row = $dbh->selectall_arrayref("SELECT IP, ServerName, Domain, Day +sUptime, OS, RAM, OSSP, InstallDate, CPUSpeed, CPUCount, CPUType, Par +titionFree FROM hosts, disks LEFT JOIN info ON hosts.IP = info.IP AND + hosts.IP = disks.IP");
      In the end, I want ALL data from hosts, and the relevant data from the other tables.
      Once I have the these 3 tables combined, I'd like to print them out in a CSV type format. How do I do that?
        The ambiguity error is because the IP just after the SELECT needs to be qualified by its table name (or alternatively could be by an alias name had you used aliases) - just put the table name follwed by a dot in front of that IP.

        Changing the column name (all other things being equal) simply forced it to interpret the same as a literal - you want to undo that change.

        It is normal for three tables to take much longer than two to join. The performance strategy for joining tables 1..n where n>2 is as follows:-

        - Join the first two tables placing the required columns in a temporary table.

        - Then join the third table with the temporary table, putting the results in a second temporary table and drop the first temporary table.

        - Continue this iterative process of joining a results temporary table with the next real table until joining the last real table with the last temporary table at which point the final results can be obtained directly instead of storing in a temporary table - this way no more than two tables are physically joined at once, whereas any number of tables have been logically joined.

        - If this query is intended to be re-used it should be placed in a stored procedure, not inside perl code, to prevent unnecessary communications overheads during execution, especially now that it has been split up. For this reason, ideally in terms of performance as well as other considerations, any re-used process of more than one SQL statement should be placed inside a stored procedure.

        Hope this helps!

        -S

        One world, one people