http://www.perlmonks.org?node_id=1012948

linseyr has asked for the wisdom of the Perl Monks concerning the following question:

I used UNIX to join 2 files based on the first column, but now i want to use it on windows, so UNIX won't work anymore. I wrote a script to do it in Perl, but this takes for ever. I read something about hashes and that it should be faster this way. Could somebody help me with this? The two files have a different nr of columns and rows and when column 1 is equal the two files should be merged. The files are tab separated. What I did in Unix was just sort the files and join. My code in Perl looks like this:
my $file1 = $ARGV[0]; my $file2 = $ARGV[1]; open(first_file,'<', $file1) or die $!; my @FILE1 = <first_file>; close(first_file); open(sec_file,'<', $file2) or die $!; my @FILE2 = <sec_file>; close(sec_file); @RESULTS; for my $line(@FILE1){ my($ID, @values) = split("\t", $line); for my $sec_line(@FILE2){ my($ID2, @values2) = split("\t", $sec_line); if($ID eq $ID2){ push (@RESULTS, "$ID @values @values2"); } } } open(RESULTS,'>','results.txt') or die $!; foreach(@results){ print RESULTS "$_\n"; } close(RESULTS);
Could somebody help me do this on a faster way? Thanks!

Replies are listed 'Best First'.
Re: Join files using perl
by AnomalousMonk (Archbishop) on Jan 11, 2013 at 20:03 UTC

    In addition to Cygwin, there are also the standalone GNU utilities for Win32 (including sort and join):

    Here are some ports of common GNU utilities to native Win32. In this context, native means the executables do only depend on the Microsoft C-runtime (msvcrt.dll) and not an emulation layer like that provided by Cygwin tools.
Re: Join files using perl
by Cristoforo (Curate) on Jan 11, 2013 at 20:17 UTC
    Using this data (file 1 and file 2):
    ID1 50 ID2 60 ID3 100
    ID1 20 ID2 100 ID3 10
    I got the results:
    C:\Old_Data\perlp>perl t9.pl o44.txt o55.txt ID1 50 20 ID2 60 100 ID3 100 10 C:\Old_Data\perlp>
    The code is:
    #!/usr/bin/perl use strict; use warnings; my %data; while (<>) { # reads 2 files from @ARGV - filenames are on the command + line my ($id, $val) = split; push @{ $data{$id} }, $val; } for my $id (sort keys %data) { print join("\t", $id, @{ $data{$id} }), "\n"; }
Re: Join files using perl
by johngg (Canon) on Jan 11, 2013 at 18:50 UTC

    Not answering your Perl question but have a look at Cygwin which provides a Unix environment on a Windows PC.

    Cheers,

    JohnGG

Re: Join files using perl
by blue_cowdawg (Monsignor) on Jan 11, 2013 at 19:23 UTC
        Could somebody help me do this on a faster way? Thanks!

    Faster? <shrug!> dunno, but pull up a chair. Here are the two input files..

    $ cat file1.txt 1 2 3 4 5 $ cat file2.txt 5 4 3 2 1
    and here's some code:
    #!/usr/bin/perl -w use strict; use Tie::File; my ($file1,$file2,$fileout) = @ARGV; tie my @ry1,"Tie::File",$file1 or die "$file1:$!"; tie my @ry2,"Tie::File",$file2 or die "$file2:$!"; tie my @out,"Tie::File",$fileout or die "$fileout:$!"; @out=(@ry1,@ry2); untie @out; untie @ry2; untie @ry1;
    which gives you this as an output:
    $ cat out.txt 1 2 3 4 5 5 4 3 2 1


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
      Thanks, but this wasn't what I wanted. My files looks like:
      ID1 50 ID2 60 ID3 100 FILE2: ID1 20 ID2 100 ID3 10 OUTPUT: ID1 50 20 ID2 60 100 ID3 100 10

        my mistake. try this:

        #!/usr/bin/perl -w use strict; use Tie::File; my ($file1,$file2,$fileout) = @ARGV; tie my @ry1,"Tie::File",$file1 or die "$file1:$!"; tie my @ry2,"Tie::File",$file2 or die "$file2:$!"; tie my @out,"Tie::File",$fileout or die "$fileout:$!"; my %een=(); my %dat=(); map{ $dat{$_}=[]} grep !$een{$_}++,map { (split(/[\s\t\n]+/,$_))[0] } +(@ry1,@ry2); foreach my $line((@ry1,@ry2)){ my ($key,@vals)=split(/[\s\t\n]+/,$line); push @{$dat{$key}},@vals; } @out=map { join("\t",($_,@{$dat{$_}})) } keys %dat; untie @out; untie @ry2; untie @ry1;
        Using your input files this was tested and gave output that you are looking for...


        Peter L. Berghold -- Unix Professional
        Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
        Consider this:
        #!/usr/bin/perl -w use strict; my $FILE1 = <<END; ID1 50 ID2 60 ID3 100 END my $FILE2 = <<END; ID1 20 ID2 100 ID3 10 END my %ids; foreach my $file (\$FILE1, \$FILE2) #just put path name of #FILE1 and FILE2 here. #This ref is special because of #putting the file contents within #the code. #FILE1 and 2 are "hereis" docs. { open (FILE, "<", $file) or die "unable to open $file for read $!"; while (<FILE>) { chomp; # delete trailing \n # here I split on one or more space characters, # A tab char doesn't show up well on this forum's text my ($id, $value) = split (/\s+/, $_); push @{$ids{$id}}, $value; } } #Each key of the hash of %ids contains a reference to #an array of id's. This is called a HoA - Hash of Array foreach my $id (sort keys %ids) { print "$id @{$ids{$id}}\n"; } #This code will run very fast because each line is #only read one time - Input/Output (I/O) is very #"expensive" __END__ OUTPUT: ID1 50 20 ID2 60 100 ID3 100 10
        Oh and both files contain more columns, that was my main problem. How do I assign an array as values in an hash?
Re: Join files using perl
by choroba (Cardinal) on Jan 11, 2013 at 23:15 UTC
    Years ago, I wrote this, and I still use it:
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Join files using perl
by Anonymous Monk on Jan 12, 2013 at 03:01 UTC