Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Compare two large files and extract a matching row...

by suneel.reddy (Novice)
on Apr 17, 2012 at 11:30 UTC ( #965492=perlquestion: print w/ replies, xml ) Need Help??
suneel.reddy has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, Posting my problem again with my piece of code... I have a requirement to compare two large files based on some conditions and create a new file out of those. These two files are two different layouts ( delemited with |), my condition should compare 2nd field of the first file and 5th field of the second and pullout a record from the second file if the IDs matches. Initially I've done this by loading files into two arrays and do a loop on those and filter. But my files are very huge - say 1GB , where each have 10 million records, and my execution ran out of memory. then I tried with Tie::File , but this only saved my memory but not the performance. My process is still running with no signs of ending :) Can someone help me solving this ??? Its urgent and I'm new to Perl.....

Record from File1 :

I|1400000042597061|ACTV|602282|2011-08-29||602178|JUSTIN||MAGRUDER||||||602282|100001|||||Gold||600990|||||||WUSA00029582|529381||||||||||||

Record from File2 :

I|1400000042589325|2011-08-29|ACTV|1400000042597061|600002|||1556 3RD AVE|||NEW YORK|NY|10128|3100|US|||||||

In the above Ex. 2nd field of F1 matches 5th Field of F2 , so this record from F2 will be loaded into an array

And my piece of code

while(my $pline = <PRFILE>) { $parrecord = $pline; my @parfields=split('\|',$parrecord); chomp(@parfields); # Child file is already loaded into @carray foreach (@carray) { $chrecord = $_; @chfields=split('\|',$chrecord); chomp(@chfields); if(@chfields[4] eq @parfields[1]) { # Some logic # last ; } } }

Comment on Compare two large files and extract a matching row...
Download Code
Re: Compare two large files and extract a matching row...
by marto (Chancellor) on Apr 17, 2012 at 11:37 UTC
Re: Compare two large files and extract a matching row...
by mzedeler (Pilgrim) on Apr 17, 2012 at 11:37 UTC

    Here is an approach that should work:

    1. Read file 1, pulling field 2 out and storing it in a hash for fast lookup.

    while(<FILE1>) { my @fields = split /|/; $lookup{$fields[1]} = 1; }

    2. Read file 2 line by line and put the line in your output file if field 5 is in the %lookup hash.

    By using this approach, you will never need to have more than one line from each file in memory at any time, and the ids from file 2. This should be both fast and memory efficient.

Re: Compare two large files and extract a matching row...
by mrguy123 (Hermit) on Apr 17, 2012 at 12:57 UTC
    The problem here is that storing too much data in an array will give you a memory problem very fast (especially for large files)

    Placing the data in a hash is a great solution (as mentioned above) but for large files that might still be a problem
    What might help you out here, is that Perl is amazingly fast in reading files. I've seen it go through more than 40 GB of data in minutes. This means if your hash is to big you can split File1 into smaller hashes and then go over File2 a few times and it won't take you very long

    Hope this helps
    Mr Guy
Re: Compare two large files and extract a matching row...
by traceyfreitas (Sexton) on Apr 17, 2012 at 20:39 UTC

    A couple of things I notice from your code:

    1. you store all the fields of your child file in @carray. Since you only need the second field, you could just have each element in @carray contain this field.

    2. for every line (M) in parent, you're searching the entire child list (N) against. That's N*M comparisons. From the way you've arranged @carray, duplicates are not consolidated and so you're potentially doing extra searches.

    3. you call chomp() more than need to. Call chomp once upon reading a line from <PRFILE> (prior to splitting). Do this when you load your field into @carray and you won't have to call chomp during your foreach(@carray) loop.

    while(my $pline=<PRFILE>) { chomp $pline; $parrecord = $pline; my @parfields = split(); ... }

    I see a couple of approaches you could take:

    ARRAY approach

    1. Create a non-redundant array (@carray) of field #2 from your child file (File1)

    -either add each field#2 to a hash; extract the keys with "keys()"; undef the hash or let it go out of scope to free up mem to perl

    or

    -push field#2 to (@carray) directly; sort; remove duplicates

    2. Search <PRFILE> for each $_ in (@carray);

    CHILD: foreach (@carray) { # search <PRFILE> line-by-line for a match of $_ to field #5 # if each field#5 in <PRFILE> is guaranteed unique, you can go to +the next CHILD element once a match is found }

    HASH approach

    In my own experience, I've noticed that the number of keys in a hash table can grow to about 280,000 keys before it starts slowing down considerably. If, in your 10M or so records, you have < 300,000 unique fields, then this approach should be fine. If not, then you'll either have to go with the ARRAY approach above, or break up your key into a prefix and suffix and store it that way. I'll give an example below.

    1. Create a hash of all the elements:

    my %lookupHash = (); while(my $line=<CHFILE>) { chomp $line; my @fields = split(/\|/, $line); $lookupHash{ $fields[1] } = (); # $field[1] = 2nd field #$lookupHash{ $fields[1] } = $line; # need entire line? }

    2. Search <PRFILE>, line-by-line, to see if any element in %lookup hash is present

    while(my $line=<PRFILE>) { chomp $line; my @fields = split(/\|/, $line); print $OUTFILE $line."\n" if(exists $lookupHash{ $fields[4] }); }
    If your performance with the above approached really is like molasses, continue reading...

    ****** It gets more complicated below here ******

    Now in the worst-case scenario, if your 1GB file of 10M or so records truly has 10M *unique* records, then this hash lookup will be slow as balls. At the cost of extra space, though hopefully not in excess of your system RAM, you could break up your hash key into a prefix and suffix to effectively create subsets of your data (as a hash-of-hash) so you don't have to look through the entire 10M records every time you do an "exists()". For example, you could break "400000042597061" down like one of these:

    field pre suf $HoH{$pre}->{$suf} = () ---------------- ---- ----------------- ------------------------ 1400000042597061="1" +"400000042597061" $HoH{1}->{400000042597061}=(); ="14"+"00000042597061" $HoH{14}->{00000042597061}=(); ...

    This way, you could do an exists() on a much smaller subset of data. Unfortunately, your 10M records means there are only at most 8 unique digits (10000000 => 8 digits) and your field here has 16 digits, so I would store the suffix as the primary key, and the prefix behind the suffix. In addition, you would do well to keep the length of the suffix down to 5 digits or less (at most 99,000 keys) rather than 6 digits (at most 999,999 keys which could be too slow). Example:

    field prefix suffix $HoH{$suf}->{$pre} = (); ---------------- ----------- -------- ------------------------- 1400000042597061="1400000042"+"597061" $HoH{597061}->{1400000042}=();

    Assuming $myField is the field of interest like "1400000042597061", you can break them down into these prefix/suffix segments with substr():

    my $prefix = substr($myField, 0, 11); my $suffix = substr($myField, 11);

    Now load it up in your 2D hash (hash-of-hash):

    $hash{$suffix}->{$prefix} = ();

    So this is how you would generate your lookup hash with <CHFILE>:

    my %lookupHash = (); LINE: while(my $line=<CHFILE>) { chomp $line; my @fields = split(/\|/, $line); my $prefix = substr($fields[1], 0, 11); # Grab from pos0 to pos9 my $suffix = substr($fields[1], 11); # Grab from pos10 and on $lookupHash{$suffix}->{$prefix} = (); #$lookupHash{$suffix}->{$prefix} = $line; }

    %lookupHash will be your lookup that you use to search your parent file. Note that I didn't put in *any* error checking to see if your field conforms to what you're expecting. If you need to keep track of the entire $line itself, you can use the alternate version of %lookuphash assignment above.

    Here's how you would search your parent file:

    LINE: while(my $line=<PRFILE>) { chomp $line; my @fields = split(/\|/, $line); # do suffix first; skip line in parent if suffix absent my $suffix = substr($fields[4], 11); next LINE if(!exists $lookupHash{$suffix}); # reach here if suffix exists; check prefix my $prefix = substr($fields[4], 0, 11); next LINE if(!exists $lookupHash{$suffix}->{$prefix}); # Output the parent line print $OUTFILE $line."\n"; # ... or output the child line if you needed that one #print $OUTFILE $lookupHash{$suffix}->{$prefix}."\n"; }

    This should speed up your program and reduce memory without additional data profiling. Good luck!

      Hi traceyfreitas,

      Thanks a lot for explaining it in detail , Let me try those approaches....and one more thing , there will be no duplicates in any of my files.

      Thanks again , I heartfully appreciate you for explaining it with minute details......

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://965492]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (12)
As of 2014-10-23 11:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (125 votes), past polls