http://www.perlmonks.org?node_id=1025758

jb60606 has asked for the wisdom of the Perl Monks concerning the following question:

I have two files containing thousands, to potentially millions of lines of comma separated records. Each line is a stock market symbol, followed by its various "tick" data for the moment in time indicated by the time-stamp in column #4. The lines in FILE2, should match the lines in FILE1, but there will be instances where they won't and I would like to put together a script that will determine the following:

1.) Using the "SEQUENCE NUMBER" (the typically 7-digit number found in column #2) as the key, which lines in FILE1, are not found in FILE2.

2.) If the SEQ# is found in FILE2, continue on to compare each remaining element of the shared record (TYPE in FILE1 to TYPE in FILE2, BID to BID, SIZE to SIZE, etc.)

From what I gather, I will need to create at least one HASH to perform this action. Using code examples found on the web, I know how to manually create a very basic HASH. What I don't know how to do is:

- Import a file into a HASH, using one element as the key and the remaining elements as individual values assigned to that key. At best, I think i've been able to only import each element of each line as its own key.

- The verbiage and format needed to articulate comparisons between elements. This is what confuses me the most.

LEGEND (only the first 10 elements of each line concern me): SYMBOL,SE +QEUENCE#,TYPE (Quote or Trade or Custom),TIMESTAMP,TYPE,STATUS,BID,BI +D-SIZE,ASK,ASK-SIZE

FILE1:

ESM3,2285319,Q,13:58:50.744000,Q,WIDE,1549.250000,656,1549.500000,522, +0.000000,0.000000,0.000000,105,67,N,CME,CME ESM3,2285247,T,13:58:49.986000,SELL,1549.250000,2,0,1738560,,U ESM3,2285320,Q,13:58:50.749000,Q,WIDE,1549.250000,656,1549.500000,524, +0.000000,0.000000,0.000000,105,68,N,CME,CME ESM3,2285321,Q,13:58:50.750000,Q,WIDE,1549.250000,655,1549.500000,524, +0.000000,0.000000,0.000000,104,68,N,CME,CME ESM3,2285325,Q,13:58:50.801000,Q,WIDE,1549.250000,655,1549.500000,522, +0.000000,0.000000,0.000000,104,67,N,CME,CME ESM3,2285326,Q,13:58:50.802000,Q,WIDE,1549.250000,656,1549.500000,522, +0.000000,0.000000,0.000000,105,67,N,CME,CME ESM3,2285328,Q,13:58:50.831000,Q,WIDE,1549.250000,667,1549.500000,522, +0.000000,0.000000,0.000000,106,67,N,CME,CME ESM3,2285329,Q,13:58:50.832000,Q,WIDE,1549.250000,1504,1549.500000,522 +,0.000000,0.000000,0.000000,107,67,N,CME,CME ESM3,2285330,Q,13:58:50.833000,Q,WIDE,1549.250000,1505,1549.500000,522 +,0.000000,0.000000,0.000000,108,67,N,CME,CME ESM3,2285331,Q,13:58:50.833000,Q,WIDE,1549.250000,1506,1549.500000,522 +,0.000000,0.000000,0.000000,109,67,N,CME,CME ESM3,2285332,Q,13:58:50.833000,Q,WIDE,1549.250000,1506,1549.500000,520 +,0.000000,0.000000,0.000000,109,66,N,CME,CME ESM3,2285333,Q,13:58:50.833000,Q,WIDE,1549.250000,1506,1549.500000,519 +,0.000000,0.000000,0.000000,109,65,N,CME,CME ESM3,2285334,Q,13:58:50.833000,Q,WIDE,1549.250000,1507,1549.500000,519 +,0.000000,0.000000,0.000000,110,65,N,CME,CME

FILE2:

ESM3,2341309,Q,14:13:42.044000,Q,WIDE,1550.000000,555,1550.250000,834, +0.000000,0.000000,0.000000,140,76,N,CME,CME ESM3,2341311,Q,14:13:42.445000,Q,WIDE,1550.000000,554,1550.250000,834, +0.000000,0.000000,0.000000,139,76,N,CME,CME ESM3,2341312,Q,14:13:42.445000,Q,WIDE,1550.000000,554,1550.250000,833, +0.000000,0.000000,0.000000,139,75,N,CME,CME ESM3,2341313,Q,14:13:42.544000,Q,WIDE,1550.000000,550,1550.250000,833, +0.000000,0.000000,0.000000,138,75,N,CME,CME ESM3,2341314,Q,14:13:42.544000,Q,WIDE,1550.000000,551,1550.250000,833, +0.000000,0.000000,0.000000,139,75,N,CME,CME ESM3,2341315,Q,14:13:42.544000,Q,WIDE,1550.000000,551,1550.250000,834, +0.000000,0.000000,0.000000,139,76,N,CME,CME ESM3,2341316,Q,14:13:42.666000,Q,WIDE,1550.000000,552,1550.250000,834, +0.000000,0.000000,0.000000,140,76,N,CME,CME ESM3,2341317,Q,14:13:42.809000,Q,WIDE,1550.000000,552,1550.250000,837, +0.000000,0.000000,0.000000,140,77,N,CME,CME ESM3,2341319,T,14:13:42.851000,SELL,1550.000000,5,0,1786787,,U ESM3,2341319,Q,14:13:42.851000,Q,WIDE,1550.000000,547,1550.250000,837, +0.000000,0.000000,0.000000,140,77,N,CME,CME ESM3,2341320,Q,14:13:42.864000,Q,WIDE,1550.000000,542,1550.250000,837, +0.000000,0.00000

I'm not exactly new to PERL, though I've only used it for very basic data manipulation or searches (where shell scripting would probably have been completely adequate, but have almost no experience shell scripting). Comparing multiple values in two different files has been absolutely puzzling to me.

Below is the closest I could get to importing anything into the HASH; using just one file for an example. Problem is, I've no idea what's being used for the key and i've no idea how to initiate a comparison between this and a second file.

#!/usr/bin/perl #use warnings; #use strict; my $inFile = "CME.ESM3.MKD11.out"; open(FH1, '<', $inFile) or die("Can't open input file \"$inFile\": $!\n"); my %hash; while ($line=<FH1>) { chomp; split /,/, $line; $hash{symbol} = $_[0]; $hash{seqNum} = $_[1]; $hash{type} = $_[2]; $hash{timestamp} = $_[3]; $hash{status} = $_[5]; $hash{bid} = $_[6]; $hash{bidVol} = $_[7]; $hash{ask} = $_[8]; $hash{askVol} = $_[9]; for $key (keys %hash) { print "$key\=$hash{$key}\t"; } print "\n"; }
Any help would be greatly appreciated. Thanks

Replies are listed 'Best First'.
Re: Help creating HASH for file comparison
by toolic (Bishop) on Mar 27, 2013 at 17:29 UTC
    You could use a hash-of-hashes data structure (perldsc)
    use warnings; use strict; my %data; while (<DATA>) { chomp; my @tokens = split /,/; my %hash; $hash{symbol } = $tokens[0]; $hash{timeStamp } = $tokens[2]; $hash{status } = $tokens[3]; $hash{bid } = $tokens[4]; $hash{bidVol } = $tokens[5]; $hash{ask } = $tokens[6]; $hash{askVol } = $tokens[7]; $data{$tokens[1]} = \%hash; } use Data::Dumper; $Data::Dumper::Sortkeys=1; print Dumper(\%data); # FILE1: __DATA__ ESM3,2285319,Q,13:58:50.744000,Q,WIDE,1549.250000,656,1549.500000,522, +0.000000,0.000000,0.000000,105,67,N,CME,CME ESM3,2285247,T,13:58:49.986000,SELL,1549.250000,2,0,1738560,,U ESM3,2285320,Q,13:58:50.749000,Q,WIDE,1549.250000,656,1549.500000,524, +0.000000,0.000000,0.000000,105,68,N,CME,CME ESM3,2285321,Q,13:58:50.750000,Q,WIDE,1549.250000,655,1549.500000,524, +0.000000,0.000000,0.000000,104,68,N,CME,CME ESM3,2285325,Q,13:58:50.801000,Q,WIDE,1549.250000,655,1549.500000,522, +0.000000,0.000000,0.000000,104,67,N,CME,CME ESM3,2285326,Q,13:58:50.802000,Q,WIDE,1549.250000,656,1549.500000,522, +0.000000,0.000000,0.000000,105,67,N,CME,CME ESM3,2285328,Q,13:58:50.831000,Q,WIDE,1549.250000,667,1549.500000,522, +0.000000,0.000000,0.000000,106,67,N,CME,CME ESM3,2285329,Q,13:58:50.832000,Q,WIDE,1549.250000,1504,1549.500000,522 +,0.000000,0.000000,0.000000,107,67,N,CME,CME ESM3,2285330,Q,13:58:50.833000,Q,WIDE,1549.250000,1505,1549.500000,522 +,0.000000,0.000000,0.000000,108,67,N,CME,CME ESM3,2285331,Q,13:58:50.833000,Q,WIDE,1549.250000,1506,1549.500000,522 +,0.000000,0.000000,0.000000,109,67,N,CME,CME ESM3,2285332,Q,13:58:50.833000,Q,WIDE,1549.250000,1506,1549.500000,520 +,0.000000,0.000000,0.000000,109,66,N,CME,CME ESM3,2285333,Q,13:58:50.833000,Q,WIDE,1549.250000,1506,1549.500000,519 +,0.000000,0.000000,0.000000,109,65,N,CME,CME ESM3,2285334,Q,13:58:50.833000,Q,WIDE,1549.250000,1507,1549.500000,519 +,0.000000,0.000000,0.000000,110,65,N,CME,CME

    Here's a snippet of the output:

    $VAR1 = { '2285247' => { 'ask' => '2', 'askVol' => '0', 'bid' => 'SELL', 'bidVol' => '1549.250000', 'status' => '13:58:49.986000', 'symbol' => 'ESM3', 'timeStamp' => 'T' }, '2285319' => { 'ask' => '1549.250000', 'askVol' => '656', 'bid' => 'Q', 'bidVol' => 'WIDE', 'status' => '13:58:50.744000', 'symbol' => 'ESM3', 'timeStamp' => 'Q' },
      I think that this brings me very close to my target. However, when I add the second file (to compare the first against), how do I compare the keys and elements of the two different hashes? Also, do I have any options in terms of how the output is formatted?
Re: Help creating HASH for file comparison
by space_monk (Chaplain) on Mar 27, 2013 at 18:17 UTC
    Maybe this extract from The Perl Cookbook will be of use for when you do the hash comparison... :-)
    A Monk aims to give answers to those who have none, and to learn from those who know more.
      Thanks guys.

      I think the problem is that my code is not establishing a single key, and is instead making each element of each record a key (ie: symbol is a key, seq# is a key, status is a key, etc.). Unless i'm misunderstanding something, it would be impossible to compare one element of HASH #1 to the similar element in HASH #2 with the way I have it coded thus far.

      Thanks
Re: Help creating HASH for file comparison
by hdb (Monsignor) on Mar 28, 2013 at 07:42 UTC

    To me this sounds more like a database job, especially if your files grow very large. Fortunately, Perl can do this as well. Have a look at module DBD::CSV.

Re: Help creating HASH for file comparison
by davido (Cardinal) on Mar 28, 2013 at 17:20 UTC

    It seems like you're using two criteria where one is sufficient. The SEQUENCE_NUMBER may just be a distraction; all you really want to know is if the first ten fields are identical or not, and if not, do some further processing.

    Also, you mention that there may be a million lines per file that you're comparing. Is it a million? Or could it be millions? Are you concerned about swamping physical RAM? Or is that not an issue?

    The following solution takes an MD5 hash of the first ten elements per line of the first file, and stores it along with the line number from the first file, and the actual line in a MLDBM tied to a hash. The MD5 is the hash key, and the line number and actual line are held in an array ref. This is repeated for each line of the first file.

    Then one pass is taken through the second file. Again, for each line in the 2nd file, an MD5 hash is generated for the first ten elements. If that MD5 hash is found to be a key in our tied hash, we know (with a high degree of confidence) that we have a collision. You might perform additional processing, but all I did was to print both lines and both line numbers.

    I altered your input data set by copying one line from file two into file one, just to test. Here's the code I used:

    use strict; use warnings; use Digest::MD5 qw( md5_hex ); use MLDBM; use Fcntl; tie my %o, 'MLDBM', 'tempdb', O_CREAT|O_RDWR, 0640 or die $!; # First file we populate the hash. process_file( 'infile1.txt', \%o, sub { $_[0]->{$_[1]} = [ $., $_[2 +] ]; } ); # Second file we check for collisions. process_file( 'infile2.txt', \%o, sub { my( $tied, $hash, $line ) = @_; print "\nCollision: infile1.txt line $tied->{$hash}->[0]:\n", "\t($tied->{$hash}->[1])\n", "-- collides with: infile2.txt line $.:\n", "\t($line)\n\n" if exists $tied->{$hash}; } ); END { untie %o; unlink glob 'tempdb.*'; } # RAII. sub process_file { my( $filename, $tied, $code ) = @_; open my $infh, '<', $filename or die $!; while( my $line = <$infh> ) { my ( $wanted ) = $line =~ m/((?:[^,]*,){10})/; next unless length $wanted; chomp $line; my $hash = md5_hex( $wanted ); $code->($tied, $hash, $line); } close $infh; }

    A pure hash (as opposed to one tied to a database) would be more time-efficient, but would not scale well if your file sizes climb to millions of lines. This approach will scale fairly well.

    A small optimization might be to go back to using the SEQUENCE_NUMBER as a hash key, and then if a partial collision is found, hash the first ten elements of that line from file 1 and file 2 to detect a full collision. But what happens if a single sequence number occurs more than once in a given file? (The first approach would be tolerant of that, whereas the second wouldn't.) Either approach assumes that there are no "full" collisions within a single file.


    Dave

Re: Help creating HASH for file comparison
by samwyse (Scribe) on Mar 27, 2013 at 20:21 UTC

    If you're on a Unix system, don't use Perl. Sort each file on the second column, then use the 'join' command twice to find (a) lines that are common to both files, and (b) lines in one file that aren't in the other. The first group can then be further processed as desired.

    # sort the first file by the second column (zero-indexed) sort -n -t , +1 f1 >s1 # sort the second file by the second column (zero-indexed) sort -n -t , +1 f2 >s2 # print lines whose second column (one-indexed) # are common to both files join -t , -1 2 -2 2 s1 s2 # print lines in file 1 that are NOT in file 2 join -t , -1 2 -2 2 -v 1 s1 s2

      Don't use Perl?!? I think we have an outbreak of heresy in the lower orders! :-)

      A Monk aims to give answers to those who have none, and to learn from those who know more.
      unfortunately finding common lines is just the start. Once they're found, i'll need to compare subsequent fields and perform further processing beyond that. Perl is the only choice.