http://www.perlmonks.org?node_id=1012426


in reply to Need to efficiently merge 2 FIX protocol log files

Okay, lets assume you have enough memory to hold the entire contents of both files in RAM; the simplest is probably to create a hash with the timestamp as a key, and sort on that key.

However, this doesn't take into account the fact that your original files are probably already sorted, so your algorithm should probably be to compare the leading row of each file, and then write the one with the oldest timestamp value, until you run out of data. Obviously such code has to take into account the possibility of identical timestamps, one file being shorter than another .... yada yada yada.

This is roughly how to do it - it is not completely debugged Perl as I'm short on time

sub compare_ts { my ($a, $b) = @_; # this function needs changing to compare timestamp strings. if ($a < $b) { return -1; } elsif ($a == $b) { return 0;} elsif ($a > $b) { return 1; } } my @els; my $tsA, $tsB; # open the files... open my $fa, "<fileA" or die something; open my $fb, "<fileB" or die something; # slurp.... my @aFIX = <$fa>; my @bFIX = <$fb>; # prime each compare value my $rowA = shift @aFIX; my $rowB = shift @bFIX; # keep comparing till one or the other runs out of data while (defined $rowA and defined $rowB) { # get timestamps from rows, could use regex: m/10=\d\d\d\x01(\w+) # but split on SOH and using last element probably does the job @els = split( /\x01/, $rowA); $tsA = pop @els; @els = split( /\x01/, $rowB); $tsB = pop @els; if (compare_ts( $tsA, $tsB) < 0) { print $rowA; $rowA = shift @aFIX; } else { print $rowB; $rowB = shift @bFIX; } } # we've run out of data in fileA or B, so can dump the rest if (defined $rowA) { print $rowA; print @aFIX; } if (defined $rowB) { print rowB; print @bFIX; }
A Monk aims to give answers to those who have none, and to learn from those who know more.