Need to efficiently merge 2 FIX protocol log files

softwareCEO has asked for the wisdom of the Perl Monks concerning the following question:

I run a software firm in financial services and we use the FIX protocol. I have a straight forward coding problem. Can you provide the best sample code to my problem? The FIX protocol is tag value pair protocol with 1 FIX message per line. Every tag value pair is separated by ASCII SOH. Every FIX message ends with 10=xxx<SOH> where xxx is 3 digits. After this, the time stamp exists. I want to merge these 2 files into a 3rd file based on the time stamp on each line. The files could be fairly large (10,000,000 lines, 1 message/line, 100 bytes/message). The speed of the merge is very important. Here are 2 simple sample input files.

file 1:

8=FIX.4.29=005935=A49=X56=Y52=20121008-12:01:2734=198=0108=3010=12510/08/12 08:01:27.489799

8=FIX.4.29=004735=049=X56=Y52=20121008-12:01:5134=210=07810/08/12 08:01:51.489969

file 2:

8=FIX.4.29=6335=A34=149=B52=20121008-12:01:27.49056=A98=0108=3010=22710/08/12 08:01:27.489930

8=FIX.4.29=5135=034=249=B52=20121008-12:01:57.49056=A10=18610/08/12 08:01:57.490432

Comment on Need to efficiently merge 2 FIX protocol log files

Replies are listed 'Best First'.
Re: Need to efficiently merge 2 FIX protocol log files by NetWallah (Canon) on Jan 09, 2013 at 05:09 UTC
A quick search shows that the Finance::FIX module is a (very) rudimentary parser for the FIX format. Hopefully, it can get you started . Unfortunately, the module has no parsing for the Timestamp - which seems to be outside of the FIX protocol message. Since your posting was not formatted properly (See Writeup Formatting Tips), it is near impossible to see distinguishing charecteristics of your message timestamps, which we need to see, in order to propose parsing methods. "By three methods we may learn wisdom: First, by reflection, which is noblest; Second, by imitation, which is easiest; and third by experience, which is the bitterest." -Confucius	[reply]
Re: Need to efficiently merge 2 FIX protocol log files by Riales (Hermit) on Jan 08, 2013 at 23:28 UTC
Aren't CEOs supposed to pay employees to do this sort of thing?	[reply]
Re: Need to efficiently merge 2 FIX protocol log files by CountZero (Bishop) on Jan 09, 2013 at 07:30 UTC
Are the FIX files already sorted on timestamp? What is the format of the date-part of the timestamp? dd/mm/yy or mm/dd/yy or yy/mm/dd or ...? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re: Need to efficiently merge 2 FIX protocol log files by space_monk (Chaplain) on Jan 09, 2013 at 10:53 UTC
Okay, lets assume you have enough memory to hold the entire contents of both files in RAM; the simplest is probably to create a hash with the timestamp as a key, and sort on that key. However, this doesn't take into account the fact that your original files are probably already sorted, so your algorithm should probably be to compare the leading row of each file, and then write the one with the oldest timestamp value, until you run out of data. Obviously such code has to take into account the possibility of identical timestamps, one file being shorter than another .... yada yada yada. This is roughly how to do it - it is not completely debugged Perl as I'm short on time sub compare_ts { my ($a, $b) = @_; # this function needs changing to compare timestamp strings. if ($a < $b) { return -1; } elsif ($a == $b) { return 0;} elsif ($a > $b) { return 1; } } my @els; my $tsA, $tsB; # open the files... open my $fa, "<fileA" or die something; open my $fb, "<fileB" or die something; # slurp.... my @aFIX = <$fa>; my @bFIX = <$fb>; # prime each compare value my $rowA = shift @aFIX; my $rowB = shift @bFIX; # keep comparing till one or the other runs out of data while (defined $rowA and defined $rowB) { # get timestamps from rows, could use regex: m/10=\d\d\d\x01(\w+) # but split on SOH and using last element probably does the job @els = split( /\x01/, $rowA); $tsA = pop @els; @els = split( /\x01/, $rowB); $tsB = pop @els; if (compare_ts( $tsA, $tsB) < 0) { print $rowA; $rowA = shift @aFIX; } else { print $rowB; $rowB = shift @bFIX; } } # we've run out of data in fileA or B, so can dump the rest if (defined $rowA) { print $rowA; print @aFIX; } if (defined $rowB) { print rowB; print @bFIX; } [download] A Monk aims to give answers to those who have none, and to learn from those who know more.	[reply] [d/l]

Back to Seekers of Perl Wisdom