Join files using perl

linseyr has asked for the wisdom of the Perl Monks concerning the following question:

I used UNIX to join 2 files based on the first column, but now i want to use it on windows, so UNIX won't work anymore. I wrote a script to do it in Perl, but this takes for ever. I read something about hashes and that it should be faster this way. Could somebody help me with this? The two files have a different nr of columns and rows and when column 1 is equal the two files should be merged. The files are tab separated. What I did in Unix was just sort the files and join. My code in Perl looks like this:


my $file1 = $ARGV[0];
my $file2 = $ARGV[1];

open(first_file,'<', $file1) or die $!;
my @FILE1 = <first_file>;
close(first_file);

open(sec_file,'<', $file2) or die $!;
my @FILE2 = <sec_file>;
close(sec_file);

@RESULTS;

for my $line(@FILE1){
        my($ID, @values) = split("\t", $line);

        for my $sec_line(@FILE2){
                my($ID2, @values2) = split("\t", $sec_line);
                

                if($ID eq $ID2){
                        push (@RESULTS, "$ID    @values @values2");

                }
        }
}

open(RESULTS,'>','results.txt') or die $!;
foreach(@results){
   print RESULTS "$_\n";
}
close(RESULTS);
[download]

Could somebody help me do this on a faster way? Thanks!

Comment on Join files using perl Download Code

Replies are listed 'Best First'.

Re: Join files using perl
by AnomalousMonk (Archbishop) on Jan 11, 2013 at 20:03 UTC

In addition to Cygwin, there are also the standalone GNU utilities for Win32 (including sort and join):

Here are some ports of common GNU utilities to native Win32. In this context, native means the executables do only depend on the Microsoft C-runtime (msvcrt.dll) and not an emulation layer like that provided by Cygwin tools.

[reply]

Re: Join files using perl
by Cristoforo (Curate) on Jan 11, 2013 at 20:17 UTC

ID1     50
ID2     60
ID3     100
[download]

ID1     20
ID2     100
ID3      10
[download]

C:\Old_Data\perlp>perl t9.pl o44.txt o55.txt
ID1     50      20
ID2     60      100
ID3     100     10

C:\Old_Data\perlp>
[download]

#!/usr/bin/perl
use strict;
use warnings;

my %data;
while (<>) { # reads 2 files from @ARGV - filenames are on the command
+ line
    my ($id, $val) = split;
    push @{ $data{$id} }, $val;
}

for my $id (sort keys %data) {
    print join("\t", $id, @{ $data{$id} }), "\n";    
}
[download]

[reply]
[d/l]
[select]

Re: Join files using perl
by johngg (Canon) on Jan 11, 2013 at 18:50 UTC

Not answering your Perl question but have a look at Cygwin which provides a Unix environment on a Windows PC.

Cheers,

JohnGG

[reply]

Re: Join files using perl
by blue_cowdawg (Monsignor) on Jan 11, 2013 at 19:23 UTC

Could somebody help me do this on a faster way? Thanks!

Faster? <shrug!> dunno, but pull up a chair. Here are the two input files..

$ cat file1.txt 
1
2
3
4
5

$ cat file2.txt 
5
4
3
2
1
[download]

#!/usr/bin/perl -w 
use strict;
use Tie::File;

my ($file1,$file2,$fileout)  = @ARGV;

tie my @ry1,"Tie::File",$file1 or die "$file1:$!";
tie my @ry2,"Tie::File",$file2 or die "$file2:$!";
tie my @out,"Tie::File",$fileout or die "$fileout:$!";

@out=(@ry1,@ry2);

untie @out;
untie @ry2;
untie @ry1;
[download]

$ cat out.txt 
1
2
3
4
5
5
4
3
2
1
[download]

Peter L. Berghold -- Unix Professional
Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

[reply]
[d/l]
[select]

Re^2: Join files using perl

by linseyr (Acolyte) on Jan 11, 2013 at 19:42 UTC

ID1     50
ID2     60
ID3     100

FILE2:
ID1     20
ID2     100
ID3      10

OUTPUT:
ID1     50     20
ID2     60     100
ID3     100    10
[download]

[reply]
[d/l]

Re^3: Join files using perl

by blue_cowdawg (Monsignor) on Jan 11, 2013 at 20:58 UTC

my mistake. try this:

#!/usr/bin/perl -w 
use strict;
use Tie::File;

my ($file1,$file2,$fileout)  = @ARGV;

tie my @ry1,"Tie::File",$file1 or die "$file1:$!";
tie my @ry2,"Tie::File",$file2 or die "$file2:$!";
tie my @out,"Tie::File",$fileout or die "$fileout:$!";
my %een=();

my %dat=();
map{ $dat{$_}=[]} grep !$een{$_}++,map { (split(/[\s\t\n]+/,$_))[0] } 
+(@ry1,@ry2);
foreach my $line((@ry1,@ry2)){
        my ($key,@vals)=split(/[\s\t\n]+/,$line);
        push @{$dat{$key}},@vals;
}

         
@out=map { join("\t",($_,@{$dat{$_}})) } keys %dat;

untie @out;
untie @ry2;
untie @ry1;
[download]

Peter L. Berghold -- Unix Professional
Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg

[reply]
[d/l]

Re^3: Join files using perl

by Marshall (Canon) on Jan 12, 2013 at 22:28 UTC

#!/usr/bin/perl -w
use strict;

my $FILE1 = <<END;
ID1     50
ID2     60
ID3     100
END

my $FILE2 = <<END;
ID1     20
ID2     100
ID3      10
END

my %ids;

foreach my $file (\$FILE1, \$FILE2) #just put path name of
                                    #FILE1 and FILE2 here.
                                    #This ref is special because of 
                                    #putting the file contents within
                                    #the code.
                                    #FILE1 and 2 are "hereis" docs.
{
   open (FILE, "<", $file) or die "unable to open $file for read $!";
   while (<FILE>)
   {
      chomp; # delete trailing \n
      
      # here I split on one or more space characters,
      # A tab char doesn't show up well on this forum's text
      
      my ($id, $value) = split (/\s+/, $_);
      push @{$ids{$id}}, $value;
   }
}

#Each key of the hash of %ids contains a reference to
#an array of id's. This is called a HoA - Hash of Array

foreach my $id (sort keys %ids)
{
   print "$id @{$ids{$id}}\n";
}

#This code will run very fast because each line is
#only read one time - Input/Output (I/O) is very 
#"expensive" 

__END__

OUTPUT:
ID1 50 20
ID2 60 100
ID3 100 10
[download]

[reply]
[d/l]

Re^3: Join files using perl

by linseyr (Acolyte) on Jan 11, 2013 at 19:46 UTC

Oh and both files contain more columns, that was my main problem. How do I assign an array as values in an hash?

[reply]

Re^4: Join files using perl

by linseyr (Acolyte) on Jan 11, 2013 at 19:51 UTC

Re^5: Join files using perl

by ww (Archbishop) on Jan 11, 2013 at 20:19 UTC

Re: Join files using perl
by choroba (Cardinal) on Jan 11, 2013 at 23:15 UTC