http://www.perlmonks.org?node_id=11131885

Mauri1313 has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I have a little problem that I can´t solve. I have two data frames (DF): the first (Main DF) has 8 columns and 45918 rows, the second (Complementary DF) has 4 columns and 97969 rows. If you want to see the complete DF, here is the link to my GitHub: https://github.com/MauriAndresMU1313/Example_Merge_Dataframes Here an example that how look like the DF. Main DF:

ID1 ID2 dN dS Omega Label_ID1 Label_ID2 + Group AVP78042 AVP78031 0.0059 0.1188 0.0500 SARSr-bat-CoV SARSr-bat-CoV + Intra ATO98108 AVP78031 0.1373 1.4673 0.0936 SARSr-bat-CoV SARSr-bat-CoV + Intra ATO98108 AVP78042 0.1371 1.4224 0.0964 SARSr-bat-CoV SARSr-bat-CoV + Intra

Complementary DF:

Distance ID_1 ID_2 Value DISTANCE AVP78042.1 ATO98108.1 0.29731 DISTANCE AVP78042.1 ATO98120.1 0.29281 DISTANCE AVP78042.1 ATO98132.1 0.33095

In this case, my main goal is to merge the complementary´s columns (ID_1, ID_2 and Value) to the main DF. The ID_1 and ID_2 in both cases are the same, however, have a different order. To do that, I think in to do a match between ID_1 and 2 between both data frames and when the match is true to add the column Value (Complementary DF) to the Main DF. I think in an "if" condition to do that. The problem here maybe is the different amount of rows and the order of the labels (ID_1 and ID_2) between each DF. Previously, I use a Perl script to merge DF:

use strict; use warnings; use feature qw{ say }; sub load { my ($file, $table, $phase) = @_; open my $in, '<', $file or die "$file: $!"; while (<$in>) { chomp; my @columns = split /\t/; my $id = join '_', @columns[0, 1]; die "Duplicate $id." if 'first' eq $phase && exists $table->{$id}; push @{ $table->{$id} }, $columns[2]; say join "\t", @columns[0, 1], @{ $table->{$id} } if 'print' eq $phase; } } my %table; my $phase = 'first'; while (my $file = shift @ARGV) { load($file, \%table, $phase); $phase = 1 == @ARGV ? 'print' : ''; }

I think to complete this script to work in my new goal, I need an "if" condition to generate a match between IDs. Here is an example of a possible output:

ID1 ID2 dN dS Omega Value Label +_ID1 Label_ID2 Group AVP78042 AVP78031 0.0059 0.1188 0.0500 0.29731 SARSr +-bat-CoV SARSr-bat-CoV Intra ATO98108 AVP78031 0.1373 1.4673 0.0936 - SARSr +-bat-CoV SARSr-bat-CoV Intra

I put the "-" in the column Value to show an example of "not match" between ID_1 and ID_2. I tried different ways, however, I can´t solve the problem yet. Any suggestion or comment to do that?