Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Join files using perl

by linseyr (Acolyte)
on Jan 11, 2013 at 18:27 UTC ( #1012948=perlquestion: print w/ replies, xml ) Need Help??
linseyr has asked for the wisdom of the Perl Monks concerning the following question:

I used UNIX to join 2 files based on the first column, but now i want to use it on windows, so UNIX won't work anymore. I wrote a script to do it in Perl, but this takes for ever. I read something about hashes and that it should be faster this way. Could somebody help me with this? The two files have a different nr of columns and rows and when column 1 is equal the two files should be merged. The files are tab separated. What I did in Unix was just sort the files and join. My code in Perl looks like this:
my $file1 = $ARGV[0]; my $file2 = $ARGV[1]; open(first_file,'<', $file1) or die $!; my @FILE1 = <first_file>; close(first_file); open(sec_file,'<', $file2) or die $!; my @FILE2 = <sec_file>; close(sec_file); @RESULTS; for my $line(@FILE1){ my($ID, @values) = split("\t", $line); for my $sec_line(@FILE2){ my($ID2, @values2) = split("\t", $sec_line); if($ID eq $ID2){ push (@RESULTS, "$ID @values @values2"); } } } open(RESULTS,'>','results.txt') or die $!; foreach(@results){ print RESULTS "$_\n"; } close(RESULTS);
Could somebody help me do this on a faster way? Thanks!

Comment on Join files using perl
Download Code
Re: Join files using perl
by johngg (Abbot) on Jan 11, 2013 at 18:50 UTC

    Not answering your Perl question but have a look at Cygwin which provides a Unix environment on a Windows PC.

    Cheers,

    JohnGG

Re: Join files using perl
by blue_cowdawg (Monsignor) on Jan 11, 2013 at 19:23 UTC
        Could somebody help me do this on a faster way? Thanks!

    Faster? <shrug!> dunno, but pull up a chair. Here are the two input files..

    $ cat file1.txt 1 2 3 4 5 $ cat file2.txt 5 4 3 2 1
    and here's some code:
    #!/usr/bin/perl -w use strict; use Tie::File; my ($file1,$file2,$fileout) = @ARGV; tie my @ry1,"Tie::File",$file1 or die "$file1:$!"; tie my @ry2,"Tie::File",$file2 or die "$file2:$!"; tie my @out,"Tie::File",$fileout or die "$fileout:$!"; @out=(@ry1,@ry2); untie @out; untie @ry2; untie @ry1;
    which gives you this as an output:
    $ cat out.txt 1 2 3 4 5 5 4 3 2 1


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
      Thanks, but this wasn't what I wanted. My files looks like:
      ID1 50 ID2 60 ID3 100 FILE2: ID1 20 ID2 100 ID3 10 OUTPUT: ID1 50 20 ID2 60 100 ID3 100 10
        Oh and both files contain more columns, that was my main problem. How do I assign an array as values in an hash?

        my mistake. try this:

        #!/usr/bin/perl -w use strict; use Tie::File; my ($file1,$file2,$fileout) = @ARGV; tie my @ry1,"Tie::File",$file1 or die "$file1:$!"; tie my @ry2,"Tie::File",$file2 or die "$file2:$!"; tie my @out,"Tie::File",$fileout or die "$fileout:$!"; my %een=(); my %dat=(); map{ $dat{$_}=[]} grep !$een{$_}++,map { (split(/[\s\t\n]+/,$_))[0] } +(@ry1,@ry2); foreach my $line((@ry1,@ry2)){ my ($key,@vals)=split(/[\s\t\n]+/,$line); push @{$dat{$key}},@vals; } @out=map { join("\t",($_,@{$dat{$_}})) } keys %dat; untie @out; untie @ry2; untie @ry1;
        Using your input files this was tested and gave output that you are looking for...


        Peter L. Berghold -- Unix Professional
        Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
        Consider this:
        #!/usr/bin/perl -w use strict; my $FILE1 = <<END; ID1 50 ID2 60 ID3 100 END my $FILE2 = <<END; ID1 20 ID2 100 ID3 10 END my %ids; foreach my $file (\$FILE1, \$FILE2) #just put path name of #FILE1 and FILE2 here. #This ref is special because of #putting the file contents within #the code. #FILE1 and 2 are "hereis" docs. { open (FILE, "<", $file) or die "unable to open $file for read $!"; while (<FILE>) { chomp; # delete trailing \n # here I split on one or more space characters, # A tab char doesn't show up well on this forum's text my ($id, $value) = split (/\s+/, $_); push @{$ids{$id}}, $value; } } #Each key of the hash of %ids contains a reference to #an array of id's. This is called a HoA - Hash of Array foreach my $id (sort keys %ids) { print "$id @{$ids{$id}}\n"; } #This code will run very fast because each line is #only read one time - Input/Output (I/O) is very #"expensive" __END__ OUTPUT: ID1 50 20 ID2 60 100 ID3 100 10
Re: Join files using perl
by AnomalousMonk (Abbot) on Jan 11, 2013 at 20:03 UTC

    In addition to Cygwin, there are also the standalone GNU utilities for Win32 (including sort and join):

    Here are some ports of common GNU utilities to native Win32. In this context, native means the executables do only depend on the Microsoft C-runtime (msvcrt.dll) and not an emulation layer like that provided by Cygwin tools.
Re: Join files using perl
by Cristoforo (Deacon) on Jan 11, 2013 at 20:17 UTC
    Using this data (file 1 and file 2):
    ID1 50 ID2 60 ID3 100
    ID1 20 ID2 100 ID3 10
    I got the results:
    C:\Old_Data\perlp>perl t9.pl o44.txt o55.txt ID1 50 20 ID2 60 100 ID3 100 10 C:\Old_Data\perlp>
    The code is:
    #!/usr/bin/perl use strict; use warnings; my %data; while (<>) { # reads 2 files from @ARGV - filenames are on the command + line my ($id, $val) = split; push @{ $data{$id} }, $val; } for my $id (sort keys %data) { print join("\t", $id, @{ $data{$id} }), "\n"; }
Re: Join files using perl
by choroba (Abbot) on Jan 11, 2013 at 23:15 UTC
    Years ago, I wrote this, and I still use it:
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Join files using perl
by Anonymous Monk on Jan 12, 2013 at 03:01 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1012948]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (12)
As of 2014-11-24 09:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (137 votes), past polls