http://www.perlmonks.org?node_id=863812


in reply to Joining separate data files to make one.

Here's another way to do it:

#!perl use 5.10.0; use strict; use warnings; my %merged = (); my $index = 0; map { open my $file, '<', $_ or die $!; map { $merged{$_->[0]} //= [ qw{null} x 3 ]; $merged{$_->[0]}[$index] = $_->[1]; } map { [ m{ \A ( \S+ \s \S+ \s \S+ \s \S+ ) \s ( \S+ ) \z }msx ] } map { chomp; $_ } (<$file>); close $file; ++$index; } qw{gravity magnetics bathymetry}; say join(' ', $_, @{$merged{$_}}) for sort keys %merged;

I put the script in a file called geo_file_join.pl and made some short test files:

ken@Miranda ~/c/_/tmp $ cat gravity 2010-10-01 00:00:03 lat1 long1 grav1 2010-10-02 00:00:05 lat2 long2 grav2 2010-10-03 00:00:07 lat3 long3 grav3 ken@Miranda ~/c/_/tmp $ cat magnetics 2010-10-02 00:00:05 lat2 long2 mag1 2010-10-03 00:00:07 lat3 long3 mag2 2010-10-04 00:00:09 lat4 long4 mag3 ken@Miranda ~/c/_/tmp $ cat bathymetry 2010-10-03 00:00:07 lat3 long3 bath1 2010-10-04 00:00:09 lat4 long4 bath2 2010-10-05 00:00:01 lat3 long3 bath3

Here's the output:

ken@Miranda ~/c/_/tmp $ geo_file_join.pl 2010-10-01 00:00:03 lat1 long1 grav1 null null 2010-10-02 00:00:05 lat2 long2 grav2 mag1 null 2010-10-03 00:00:07 lat3 long3 grav3 mag2 bath1 2010-10-04 00:00:09 lat4 long4 null mag3 bath2 2010-10-05 00:00:01 lat3 long3 null null bath3

Assuming your latitudes and longitudes are in some sortable format, this will sort by the first 4 fields (i.e. date, time, latitude and longitude).

Replies are listed 'Best First'.
Re^2: Joining separate data files to make one.
by msexton (Initiate) on Oct 07, 2010 at 09:40 UTC

    Hi,

    Thanks for this it worked well.

    The only problem is, I can't for the life of me figure it out. The multiple calls to map have me perplexed. I spent most of the day reading up about map, and am still a bit confused.

    I know I didn't give an example of my data files, but you were almost spot on. If I can ask a favour, how would the code vary, if the gravity and magnetics files had a 6th field, whilst the bathymetry remained at five?

    One of the other replies I received was a bit easier to understand, but did not handle the situation where a later file (eg bathymetry) ends before (in time) an earlier file (eg magnetics). It did not add nulls to the hash. It worked well when files started later than previous files.

      The multiple calls to map have me perplexed.

      Sure. map is abused here to work as foreach and while.

      map { open my $file, '<', $_ or die $!; map { $merged{$_->[0]} //= [ qw{null} x 3 ]; $merged{$_->[0]}[$index] = $_->[1]; } map { [ m{ \A ( \S+ \s \S+ \s \S+ \s \S+ ) \s ( \S+ ) \z }msx ] } map { chomp; $_ } (<$file>); close $file; ++$index; } qw{gravity magnetics bathymetry};

      The outer map is really a foreach:

      foreach my $filename (qw{gravity magnetics bathymetry}) { open my $file, '<', $filename or die $!; map { $merged{$_->[0]} //= [ qw{null} x 3 ]; $merged{$_->[0]}[$index] = $_->[1]; } map { [ m{ \A ( \S+ \s \S+ \s \S+ \s \S+ ) \s ( \S+ ) \z }msx ] } map { chomp; $_ } (<$file>); close $file; ++$index; }

      The last map inside the foreach loop simply iterates over all lines of the file and strips trailing newlines. Then it passes each line to the middle map, which extracts some parts of the line, and returns an array reference with the matches. The first map is again abused as a foreach.

      Using while (<$file>) would make that more readable:

      foreach my $filename (qw{gravity magnetics bathymetry}) { open my $file, '<', $filename or die $!; while (<$file>) { chomp; my @a=m{ \A ( \S+ \s \S+ \s \S+ \s \S+ ) \s ( \S+ ) \z }msx; $merged{$a[0]}//=[ qw{null} x 3 ]; $merged{$a[0}}[$index]=$a[1]; } close $file; ++$index; }

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      Thanks for this it worked well.

      Your welcome. I enjoyed writing it.

      The only problem is, I can't for the life of me figure it out. The multiple calls to map have me perplexed. I spent most of the day reading up about map, and am still a bit confused.

      Alexander has provided a breakdown of what's going on here. Feel free to ask if anything needs further explanation.

      I know I didn't give an example of my data files, but you were almost spot on.

      Your original question was pretty clear. I felt I had a reasonable understanding of what you were after.

      If I can ask a favour, how would the code vary, if the gravity and magnetics files had a 6th field, whilst the bathymetry remained at five?

      I'm happy to answer that with a little more information.

      • Is the 6th field to be added to the final data as an additional field?
      • Are the 5th and 6th fields to be combined and then added?
      • Is the 6th field just extraneous data to be discarded?
      • Something else?

      Finally, on the timing issue, I staggered the date-time fields through the test data to take that into consideration. Within each test file the times are ordered though. If your live data is not necessarily in chronological order, you might want to jumble up the lines in one or more files. I think it should still work but I didn't specifically test for that scenario.

      Regards,

      Ken

        Hi Ken and Alexander,

        I didn't get a chance to examine your responses today, as I was spent most of the day implementing the suggestions from one of the other respondents. I eventually got it to work correctly. The biggest problem I had was extracting the elements out of the hash to write them out to the final output file. I eventually got there.

        I will examine your suggestions on Monday

        In the meantime, I have appended a couple of fields to the datasets you sent me to show you basically what I have.

        The hash should contain:

        Date, Time, all 5 remaining fields from gravity, all 4 remaining fields from magnetics, and three remaining fields from bathymetry.

        Whilst the navigation should be the same in all three files, by putting them into the hash I can check that they are. If they are not essentially the same, then I know that a problem exists.

        No further processing of the fields is done (ie addition, etc). They are just read from the hash and written in a specific format (MGD77) to an output file. With the work I did today, I think I can manage that. Where there are multiple navigations (most times),I have a hierarchy and select what I believe is the best for output (usually bathymetry).

        Here is the updated file structure you sent me ( can't see how to make the attachment you did)

        $ cat gravity

        2010-10-01 00:00:03 lat1 long1 grav1 g_anom1 eotvos1

        2010-10-02 00:00:05 lat2 long2 grav2 g_anom2 eotvos2

        2010-10-03 00:00:07 lat3 long3 grav3 g_anom3 eotvos3

        $ cat magnetics

        2010-10-02 00:00:05 lat2 long2 mag1 m_anom1

        2010-10-03 00:00:07 lat3 long3 mag2 m_anom2

        2010-10-04 00:00:09 lat4 long4 mag3 m_anom3

        $ cat bathymetry

        2010-10-03 00:00:07 lat3 long3 bath1

        2010-10-04 00:00:09 lat4 long4 bath2

        2010-10-05 00:00:01 lat3 long3 bath3

        Thanks once again

        Mike