slavailn has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys,

I face the problem of joining several tab delimited files into one table, which is certainly possible to do by a lot of clicking in excel, but when the number of those files is close to a hundred it becomes a real pain. Here are the details:

Our files: file_1, file_2, ..., file_n contain only two columns as follows:

A 12

B 152

C 123

D 456

First column contains "categories" and it stays the same in every file, second contains some numeric values. I would like to create a table, where the first column is "categories" and the rest are second columns from every file with headers corresponding to file names. Is there an easy way to do something like that in perl? I'm have only beginner skills and whatever I was trying to hack together doesn't work.

Alternatively, I could use paste command from shell, but its doesn't work the way I want.

Thanks a lot in advance!

Replies are listed 'Best First'.
Re: simple table join script
by roboticus (Chancellor) on Jun 01, 2012 at 03:13 UTC

    slavailn:

    Yes, that's pretty easy to do in perl. You'll want to get Spreadsheet::ParseExcel or Spreadsheet::XLSX which will help you disassemble the current spreadsheets. The first page of the documentation will help you dig the data out of your spreadsheet files. Then you can print the data table out as a .CSV file which you can load into a spreadsheet fairly easily, or you can go the extra mile and get Spreadsheet::WriteExcel to generate a new spreadsheet with the information you want.

    The tricky bit will be when some of your spreadsheets are missing a row or have an extra row and you want to get the data lined up. So I'd suggest using a hash of arrays, where the first column of the array would be the first file, the second column for the second file, etch.

    A rough sketch of what I'd do is like this:

    my @Files; my %Table; while (my $FName = shift) { push @Files, $FName; # open spreadsheet & get worksheet, as shown in Spreadsheet::Parse +Excel docs for my $row ($row_min .. $row_max) { my $category = $worksheet->get_cell($row, 0); # first column h +as category my $value = $worksheet->get_cell($row, 1); # second column +has value $Table{$category}[@Files] = $val; # Store into slot in table } }

    When you're done, the resulting data might look something like:

    @Files = ('file1', 'file2'); %Table = ( # Cat file1 file2 'A' => [ 12, 34 ], 'B' => [ 152, 567 ], );

    Let us know if you need any further hints.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: simple table join script
by NetWallah (Canon) on Jun 01, 2012 at 03:17 UTC
    This one works with a list of simple files:
    use strict; use warnings; my( %file, $idx, %cat,$x); $idx=1; # Start at 1 to avoid using the //= o +perator while (<>){ # read all files in @ARGV chomp; my ($c,$val)=split; # Separate the category from the valu +e next unless $c; # Skip blank lines $file{$ARGV} ||= $idx++; # Increment $idx for each new file $cat{$c}[ $file{$ARGV} ] = $val; # Save in $cat{Category}[FileIdx] } print "\t",map ({"$_\t"} sort keys %file),"\n"; # List of files my @FileIndexes = map {$file{$_}} sort keys %file; for my $k (sort keys %cat){ print "$k\t",map({(defined($x=$cat{$k}[$_])?$x:"")."\t"} @FileIndexes),"\n"; }
    Run and results:
    $ perl test-tab1.pl test-tabd* test-tabdata.txt test-tabdata2.txt test-tabdata2.txt~ A 12 33 9 B 152 221 333 C 123 423 444 D 456 43 555 E 99
    update:Noticed you were "new to perl", so I added comments, and simplified a bit
    update2:Fixed subtle bug in first value not being reported as noted by sauoq below (FWIW,I found it independently)

                 I hope life isn't a big joke, because I don't get it.
                       -SNL

      That doesn't work correctly.

      • It changes the order of the rows.
      • It drops a value from the first row.
      • (Minor) It splits on spaces, not tabs.

      You are jumping through a lot of contortions, seemingly to avoid explicitly opening files. That results in a lot of inefficiency; all that sorting could get expensive if the data sets are big. Using hashes for data that are already ordered and whose order you want to retain is a bad design decision.

      And even with your comments, it's not particularly easy to read (for me; and I'm not new to Perl.)

      The following might not win any style points, but it is straight forward and relatively efficient. And working.

      my $header; # Handle the first file separately so we can keep the "categories." my $file = shift; $header = "CATS \t$file"; # CATS is category column header. open my $fh, '<', $file or die "$file: $!"; my @lines = <$fh>; close $fh; chomp(@lines); for my $file (@ARGV) { # Add filename to header. $header .= "\t$file"; open my $fh, '<', $file or die "$file: $!"; # Iterate through file my $i = 0; while (my $line = <$fh>) { chomp $line; # Append tab and 2nd column to appropriate line. $lines[$i++] .= "\t" . (split /\t/, $line)[1]; } close $fh; } # print our header and each of our new lines. print $header, "\n"; print "$_\n" for @lines;

      -sauoq
      "My two cents aren't worth a dime.";
        Addressing your points:
        • It changes the order of the rows. : Yes - it SORTS them. No change if they are already sorted.
        • It drops a value from the first row.: Agreed - I found and fixed this bug this AM (original post was late at night)
        • (Minor) It splits on spaces, not tabs.: Correction - it splits on WHITESPACE, that includes tab.

        I agree - it is not the easiest code to read, but I think it is more data-tolerant than yours, which complains under "use warnings", if the data contains blank lines (which are present in the OP).

        Regarding efficiency - sorting a million pre-sorted records takes less than a second, on modern computers, so I don't see an issue.

        I do appreciate your critique, and enjoy the discussion, but, at times, i have a low threshold for responding to nits, so I apologize in advance if I appear to be un-responsive. to subsequent posts.

                     I hope life isn't a big joke, because I don't get it.
                           -SNL

Re: simple table join script
by Anonymous Monk on Jun 01, 2012 at 02:37 UTC

    Thanks a lot in advance!

    Use the search, find your answer, you'll be surprised how easy it is :)