Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Joining separate data files to make one.

by msexton (Initiate)
on Oct 06, 2010 at 10:58 UTC ( #863753=perlquestion: print w/ replies, xml ) Need Help??
msexton has asked for the wisdom of the Perl Monks concerning the following question:

I would like to join data files that contain different attributes and produce one multicolumn file.

The individual files contain:

Date Time Latitude Longitude Attribute

The attributes are gravity (file 1), magnetics (file2) and bathymetry (file 3).

The final file would be:

Date Time Latitude Longitude Attribute1 Attribute2 Attribute3

I can join them if the individual files start at the same time and have the same length, but I have files that start and end at different times. If an attribute doesn't exist, I intend to enter a null value.

I am wondering if there is a construct that I should be looking at to perform this. I am currently importing each file into an array and then using the most confusing set of ifs and elsifs to cover each possibility.

I have toyed with cutting the input files into segments that are the same size and doing each in turn (ie. gravity and magnetics only, gravity and bathymetry only, bathymetry only). I thought I'd ask if there is another way.

Any suggestions greatly appreciated.

Thanks

Comment on Joining separate data files to make one.
Re: Joining separate data files to make one.
by Anonymous Monk on Oct 06, 2010 at 11:25 UTC

      Thanks for this

      I will investigate it after two of the others, which seem to be more amenable to my level of knowledge

Re: Joining separate data files to make one.
by BrowserUk (Pope) on Oct 06, 2010 at 11:54 UTC

    From your meagre description of the files, I assume that each files data is keyed by date & time?

    If so, the rather than loading all the data into arrays, accumulate it in a hash:

    my %data; open FILE, '<', 'gravity' or die; while( <FILE> ) { my @fields = split ' ', $_; $data{ @fields[ 0, 1 ] } = join "\t", @fields; } close FILE; open FILE, '<', 'magnetics' or die; while( <FILE> ) { my @fields = split ' ', $_; ## Pad the hash if we didn't see this date/time in the gravity fil +e $data[ "@fields[ 0, 1 ]" } //= join "\t", @fields[ 0,1 ], ('n/a') +x 3; $data{ "@fields[ 0, 1 ]" } .= join "\t", @fields[ 2 .. $#fields ]; } close FILE; open FILE, '<', 'bathymetry' or die; while( <FILE> ) { my @fields = split ' ', $_; ## Pad the hash if we didn't see this date/time before (How many field +s added by the magnetics?) $data[ "@fields[ 0, 1 ]" } //= join "\t", @fields[ 0,1 ], ('n/a') +x ???; $data{ "@fields[ 0, 1 ]" } .= join "\t", @fields[ 2 .. $#fields ]; } close FILE; for my $key ( sort keys %data ) { print $data{ $key }; }

    Depending upon your date & time formats, you might need a more sophisticated sort.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi,

      Thanks for this, I learned a lot.

      Actually it took me nearly all day to figure it out. There are two typos ( [ should be { ). A reply after yours constructed some sample data files and I used them as input to your script. The only drawback with your script is that if one of the files ends before a later file, the "n/a" is not appended to the hash for the file before. Your script handles the situation where files start after the one before.

      I at least "understand" your script, but I am having trouble with the other one. From your script I can see how to add varying fields from each of the files (ie 4 from gravity and magnetics, 3 from bathymetry). With the other script, I can't see how to vary the number of fields to read.

        . There are two typos ( [ should be { ).

        Sorry. It was typed directly into the edit box and so was never tested. I apologise for that. I wanted to describe a viable alternative approach to the problem--and I find describing with code far more efficient and clear than using words. I was aware that it wasn't a complete working solution as posted.

        The only drawback with your script is that if one of the files ends before a later file, the "n/a" is not appended to the hash for the file before.

        I would handle that in the output loop. If when you come to write a record, it is "too short", pad it with the appropriate numbers of 'n/a's. Of course, as coded with concatenating strings, determining how much to add is a pain.

        You could split "\t" to get the field count, and the padding and the rejoin, but that would be a bit silly. Better to build up the records as (a hash of) arrays, pushing the fields as you go, and then just join them at the end. After padding if necessary.

        Something like:

        my %data; open FILE, '<', 'gravity' or die; while( <FILE> ) { my @fields = split ' ', $_; $data{ @fields[ 0, 1 ] } = \@fields; } close FILE; open FILE, '<', 'magnetics' or die; while( <FILE> ) { my @fields = split ' ', $_; ## Pad the hash if we didn't see this date/time in the gravity fil +e $data{ "@fields[ 0, 1 ]" } //= [ @fields[ 0,1 ], ('n/a') x 3 ]; push @{ $data{ "@fields[ 0, 1 ]" } }, @fields[ 2 .. $#fields ]; } close FILE; open FILE, '<', 'bathymetry' or die; while( <FILE> ) { my @fields = split ' ', $_; ## Pad the hash if we've never seen it before) ## (??? == No of fields added by the magnetics) $data{ "@fields[ 0, 1 ]" } //= [ @fields[ 0,1 ], ('n/a') x ( 3 + ? +?? ) ]; ## We saw it in gravity, but not magnetics. push @{ $data{ "@fields[ 0, 1 ]" } }, ('n/a') x ??? if @{ $data{ "@fields[ 0, 1 ]" } } < 3 + ???; push @{ $data{ "@fields[ 0, 1 ]" } }, @fields[ 2 .. $#fields ]; } close FILE; for my $key ( sort keys %data ) { my $nFields = @{ $data{ $key } }; ## Pad: ??? === total number of fields push @{ $data{ $key } }, ('n/a') x ( ??? - $nFields ); print join "\t", @{ $data{ $key } }; }

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Joining separate data files to make one.
by sundialsvc4 (Monsignor) on Oct 06, 2010 at 13:24 UTC

    “The COBOL way” of doing this would be to SORT the individual files by the same key, then write logic that MERGEs the two sorted files together.   This is what you are doing here.

    Another way to do it would be ... to use DBD::SQLite.   SQLite is a public domain database engine which works with single files.   (There is no “server.”)   So, instead of storing your data (initially, perhaps) as “a flat file,” you store it as a database... this “database” being, in fact, “just a file.”   SQLite is extremely robust and fast, and it is very heavily used in smart phones and so-forth.   I find myself using it rather constantly in situations where I normally would have used “a flat file.”   The overhead of doing so is, “essentially, nil.”

Re: Joining separate data files to make one.
by kcott (Abbot) on Oct 06, 2010 at 15:19 UTC

    Here's another way to do it:

    #!perl use 5.10.0; use strict; use warnings; my %merged = (); my $index = 0; map { open my $file, '<', $_ or die $!; map { $merged{$_->[0]} //= [ qw{null} x 3 ]; $merged{$_->[0]}[$index] = $_->[1]; } map { [ m{ \A ( \S+ \s \S+ \s \S+ \s \S+ ) \s ( \S+ ) \z }msx ] } map { chomp; $_ } (<$file>); close $file; ++$index; } qw{gravity magnetics bathymetry}; say join(' ', $_, @{$merged{$_}}) for sort keys %merged;

    Assuming your latitudes and longitudes are in some sortable format, this will sort by the first 4 fields (i.e. date, time, latitude and longitude).

      Hi,

      Thanks for this it worked well.

      The only problem is, I can't for the life of me figure it out. The multiple calls to map have me perplexed. I spent most of the day reading up about map, and am still a bit confused.

      I know I didn't give an example of my data files, but you were almost spot on. If I can ask a favour, how would the code vary, if the gravity and magnetics files had a 6th field, whilst the bathymetry remained at five?

      One of the other replies I received was a bit easier to understand, but did not handle the situation where a later file (eg bathymetry) ends before (in time) an earlier file (eg magnetics). It did not add nulls to the hash. It worked well when files started later than previous files.

        The multiple calls to map have me perplexed.

        Sure. map is abused here to work as foreach and while.

        map { open my $file, '<', $_ or die $!; map { $merged{$_->[0]} //= [ qw{null} x 3 ]; $merged{$_->[0]}[$index] = $_->[1]; } map { [ m{ \A ( \S+ \s \S+ \s \S+ \s \S+ ) \s ( \S+ ) \z }msx ] } map { chomp; $_ } (<$file>); close $file; ++$index; } qw{gravity magnetics bathymetry};

        The outer map is really a foreach:

        foreach my $filename (qw{gravity magnetics bathymetry}) { open my $file, '<', $filename or die $!; map { $merged{$_->[0]} //= [ qw{null} x 3 ]; $merged{$_->[0]}[$index] = $_->[1]; } map { [ m{ \A ( \S+ \s \S+ \s \S+ \s \S+ ) \s ( \S+ ) \z }msx ] } map { chomp; $_ } (<$file>); close $file; ++$index; }

        The last map inside the foreach loop simply iterates over all lines of the file and strips trailing newlines. Then it passes each line to the middle map, which extracts some parts of the line, and returns an array reference with the matches. The first map is again abused as a foreach.

        Using while (<$file>) would make that more readable:

        foreach my $filename (qw{gravity magnetics bathymetry}) { open my $file, '<', $filename or die $!; while (<$file>) { chomp; my @a=m{ \A ( \S+ \s \S+ \s \S+ \s \S+ ) \s ( \S+ ) \z }msx; $merged{$a[0]}//=[ qw{null} x 3 ]; $merged{$a[0}}[$index]=$a[1]; } close $file; ++$index; }

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        Thanks for this it worked well.

        Your welcome. I enjoyed writing it.

        The only problem is, I can't for the life of me figure it out. The multiple calls to map have me perplexed. I spent most of the day reading up about map, and am still a bit confused.

        Alexander has provided a breakdown of what's going on here. Feel free to ask if anything needs further explanation.

        I know I didn't give an example of my data files, but you were almost spot on.

        Your original question was pretty clear. I felt I had a reasonable understanding of what you were after.

        If I can ask a favour, how would the code vary, if the gravity and magnetics files had a 6th field, whilst the bathymetry remained at five?

        I'm happy to answer that with a little more information.

        • Is the 6th field to be added to the final data as an additional field?
        • Are the 5th and 6th fields to be combined and then added?
        • Is the 6th field just extraneous data to be discarded?
        • Something else?

        Finally, on the timing issue, I staggered the date-time fields through the test data to take that into consideration. Within each test file the times are ordered though. If your live data is not necessarily in chronological order, you might want to jumble up the lines in one or more files. I think it should still work but I didn't specifically test for that scenario.

        Regards,

        Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://863753]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2014-09-02 01:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (18 votes), past polls