Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Is it possible to merge data frames with different amounts of rows?

by Mauri1313 (Initiate)
on Apr 29, 2021 at 17:00 UTC ( #11131885=perlquestion: print w/replies, xml ) Need Help??

Mauri1313 has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I have a little problem that I can´t solve. I have two data frames (DF): the first (Main DF) has 8 columns and 45918 rows, the second (Complementary DF) has 4 columns and 97969 rows. If you want to see the complete DF, here is the link to my GitHub: https://github.com/MauriAndresMU1313/Example_Merge_Dataframes Here an example that how look like the DF. Main DF:

ID1 ID2 dN dS Omega Label_ID1 Label_ID2 + Group AVP78042 AVP78031 0.0059 0.1188 0.0500 SARSr-bat-CoV SARSr-bat-CoV + Intra ATO98108 AVP78031 0.1373 1.4673 0.0936 SARSr-bat-CoV SARSr-bat-CoV + Intra ATO98108 AVP78042 0.1371 1.4224 0.0964 SARSr-bat-CoV SARSr-bat-CoV + Intra

Complementary DF:

Distance ID_1 ID_2 Value DISTANCE AVP78042.1 ATO98108.1 0.29731 DISTANCE AVP78042.1 ATO98120.1 0.29281 DISTANCE AVP78042.1 ATO98132.1 0.33095

In this case, my main goal is to merge the complementary´s columns (ID_1, ID_2 and Value) to the main DF. The ID_1 and ID_2 in both cases are the same, however, have a different order. To do that, I think in to do a match between ID_1 and 2 between both data frames and when the match is true to add the column Value (Complementary DF) to the Main DF. I think in an "if" condition to do that. The problem here maybe is the different amount of rows and the order of the labels (ID_1 and ID_2) between each DF. Previously, I use a Perl script to merge DF:

use strict; use warnings; use feature qw{ say }; sub load { my ($file, $table, $phase) = @_; open my $in, '<', $file or die "$file: $!"; while (<$in>) { chomp; my @columns = split /\t/; my $id = join '_', @columns[0, 1]; die "Duplicate $id." if 'first' eq $phase && exists $table->{$id}; push @{ $table->{$id} }, $columns[2]; say join "\t", @columns[0, 1], @{ $table->{$id} } if 'print' eq $phase; } } my %table; my $phase = 'first'; while (my $file = shift @ARGV) { load($file, \%table, $phase); $phase = 1 == @ARGV ? 'print' : ''; }

I think to complete this script to work in my new goal, I need an "if" condition to generate a match between IDs. Here is an example of a possible output:

ID1 ID2 dN dS Omega Value Label +_ID1 Label_ID2 Group AVP78042 AVP78031 0.0059 0.1188 0.0500 0.29731 SARSr +-bat-CoV SARSr-bat-CoV Intra ATO98108 AVP78031 0.1373 1.4673 0.0936 - SARSr +-bat-CoV SARSr-bat-CoV Intra

I put the "-" in the column Value to show an example of "not match" between ID_1 and ID_2. I tried different ways, however, I can´t solve the problem yet. Any suggestion or comment to do that?

Replies are listed 'Best First'.
Re: Is it possible to merge data frames with different amounts of rows?
by Fletch (Chancellor) on Apr 29, 2021 at 18:33 UTC

    (Stepping back a step: you might see if you couldn't use the standard *NIX join command as this kind of thing is possibly right up its alley. But if you need to do more involved processing later, you probably want to stick with perl . . .)

    If your "second" file is small enough you could probably just read over and parse it first and build a hash (%value_for_id) to map from your "id" to the corresponding value (presuming that's a 1-to-1 mapping; more on that later). Then you'd just read through the "main" file and append/insert the value column by retrieving it from that hash. If it's really really big then you could use a database backed hash (DBM_File, GDBM_File) and let it be stored to a file rather than having to keep everything in RAM.

    If your mapping isn't 1-to-1 (so for id "AVP78042,AVP78031" there were (say) three values (0.29731, 0.8675309, 0.2112)) you need to figure out what that means. One possible solution would be to output three copies of the line (one with each value). In that case you'd do something similar but build a Hash Of Arrayrefs (similar to what you're doing in your sample) rather than just a flat hash. While you're processing the main file you'd iterate over the matching values with something like:

    ## presume @row has the current (main) row and we're adding value as l +ast column ## replace with (e.g.) splice or whatever as appropriate for my $value ( @{ $values_for_id{ $id } // [ "-" ] } ) { say join( qq{\t}, @row, $value ); }

    Also you may want to look at Text::CSV_XS for parsing your files.

    Edit: Derp I completely missed where you gave the numbers of lines; I thought you were talking much bigger files. the reading the first file into a hash approach should be more than fine and shouldn't stress the available memory on any modern box.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Is it possible to merge data frames with different amounts of rows?
by haj (Curate) on Apr 29, 2021 at 18:46 UTC

    I agree with Fletch's suggestion that a good structure for merging your data is a "hash of hashes", so let me just point out some gotchas in your input data which you might want to clean up:

    • The columns are named ID1 and ID2 in one input file but ID_1 and ID_2 in the other.
    • The second input file has an extra column Distance so just counting the elements from which the identifier is built won't work.
    • These may be a PerlMonks copypaste artifact: Both input files start with an empty line, and the separator isn't a tab. Perhaps it is more robust to split /\s+/ instead?
Re: Is it possible to merge data frames with different amounts of rows?
by kcott (Bishop) on Apr 30, 2021 at 01:13 UTC

    G'day Mauri1313,

    Welcome to the Monastery.

    In addition to the issues pointed out by haj, I see:

    • No matches between ID1 and ID_1 (unless I add .1 to ID1 values or remove .1 from ID_1 values).
    • I followed the github link you provided in an attempt to resolve some issues; however, it just informed me that the data was too large to display.
    • I don't see ID_2 being used anywhere in the processing.
    • I don't know what "The ID_1 and ID_2 in both cases are the same, however, have a different order." means (even allowing for the ID1/ID_1 and ID2/ID_2 discrepancies).

    [I appreciate this is your first post here. For future reference, please read "How do I post a question effectively?": a clearer post results in less guesswork from us and a quicker and better answer for you.]

    In order to produce the output you show from the two inputs you presented, I might write something like this:

    #!/usr/bin/env perl use strict; use warnings; use autodie; my $main_df_file = 'pm_11131885_main_df'; my $comp_df_file = 'pm_11131885_comp_df'; my %val_for; { open my $comp_df_fh, '<', $comp_df_file; while (<$comp_df_fh>) { next if $. == 1; chomp; my (undef, $id, undef, $val) = split /\s+/; my $key = (split /\./, $id)[0]; $val_for{$key} = $val unless exists $val_for{$key}; } } { my @out_headers = qw{ID1 ID2 dN dS Omega Value Label_ID1 Label_ID2 + Group}; my $out_fmt = "%-8s\t%-8s\t%-6s\t%-6s\t%-6s\t%-7s\t%-13s\t%-13s\t% +-5s\n"; my @pre_val_indices = 0 .. 4; my @post_val_indices = 5 .. 7; my $no_match_val = ' -'; my %seen; printf $out_fmt, @out_headers; open my $main_df_fh, '<', $main_df_file; while (<$main_df_fh>) { next if $. == 1; chomp; next unless length; my @elements = split /\s+/; my $key = $elements[0]; next if $seen{$key}++; my $val = exists($val_for{$key}) ? $val_for{$key} : $no_match_ +val; printf $out_fmt, @elements[@pre_val_indices], $val, @elements[@post_val_ind +ices]; } }

    Output:

    ID1 ID2 dN dS Omega Value Label_ +ID1 Label_ID2 Group AVP78042 AVP78031 0.0059 0.1188 0.0500 0.29731 SARSr- +bat-CoV SARSr-bat-CoV Intra ATO98108 AVP78031 0.1373 1.4673 0.0936 - SARSr- +bat-CoV SARSr-bat-CoV Intra

    That might be exactly what you're looking for; if not, it may provide some ideas about how to tackle the problem.

    — Ken

Re: Is it possible to merge data frames with different amounts of rows?
by perlfan (Vicar) on May 06, 2021 at 06:54 UTC
    You may find Hash::Merge interesting. Finding gems like this is why I watch CPAN recent like a hawk.
      I watch CPAN recent like a hawk

      How do you do this? What's your recommendation for those who would like to do the same thing? tia.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11131885]
Approved by Corion
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2021-05-11 05:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (113 votes). Check out past polls.

    Notices?