Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Compare two Large FlatFiles

by suneel.reddy (Novice)
on Apr 13, 2012 at 14:43 UTC ( #964940=perlquestion: print w/ replies, xml ) Need Help??
suneel.reddy has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone, I have a requirement to compare two large files based on some conditions and create a new file out of those. These two files are two different layouts, my condition should compare 2nd field of the first file and 5th field of the second and pullout a record from the second file if the IDs matches. Initially I've done this by loading files into two arrays and do a loop on those and filter. But my files are very huge - say 1GB , where each have 10 million records, and my execution ran out of memory. then I tried with Tie::File , but this only saved my memory but not the performance. My process is still running with no signs of ending :) Can someone help me solving this ??? Its urgent and I'm new to Perl.....

Comment on Compare two Large FlatFiles
Re: Compare two Large FlatFiles
by marto (Chancellor) on Apr 13, 2012 at 14:50 UTC

    Welcome! Please your code and some sample input data (pay attention to Writeup Formatting Tips please). If you don't show this how can anyone help improve it?

Re: Compare two Large FlatFiles
by kennethk (Monsignor) on Apr 13, 2012 at 17:02 UTC

    This is in a fairly common family of issues. I'd suggest you review some of the answers to previous posts in the vein (for example Pattern matching across two files, Need something better than grep -f!). I'd also say nested loops are a horrible solution here, and you should swap to a hash or, if the files are too large for that, a database. We might be able to offer some more concrete advice if you show us exactly what you are doing, e.g. post code.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Compare two Large FlatFiles
by mikeraz (Friar) on Apr 13, 2012 at 18:15 UTC

    From the description you gave this psuedo code may match the problem (yeah, I'm being hesitant)

    if ( file1-field2 eq file2-field5 ) { give me file2-field_to_be_named || first_round_draft_pick }
    OK, kidding about the draft pick.

    If this is close to the truth, consider these fragments to work with:

    my %file1; my %file2; while(<file1>) { my $field_from_1 = (split /<DELIMiTER>/)[1]; $file1{$field_from_1}++; } while(<file2>) { my ($field_from_2, $record_we_want) = (split /<DELIMITER>/)[4,index +_we_cant_guess_from_description]; $file2{$field_from_2} = $record_we_want; } foreach my $keys ( %file2 ) { if ( exists $file1{$key} ) { # do whatever it is they want with $record_we_want } }
    Those snips require a reasonably small subset of the memory needed to do what you described. Perhaps small enough to do the entire task in memory. What they don't do is:
    • Preserve the order of the records, but we don't know if that's important from your description
    • Preserve duplicate ID records, but we don't ...
    If this is not helpful I'll join the first two responders in requesting you provide more information. A few lines of sanatized data file and a sample of the code you've developed so far will do wonders for our helpfulness.

    Update: destressed some tortured English


    Be Appropriate && Follow Your Curiosity
      Hi guys, below is my piece of code....please help me in tuning this...only logic I can think with the knowledge I have on Perl :) while(my $pline = <PRFILE>) { $parrecord = $pline; my @parfields=split('\|',$parrecord); chomp(@parfields); # Child file is already loaded into @carray foreach (@carray) { $chrecord = $_; @chfields=split('\|',$chrecord); chomp(@chfields); if(@chfields4 eq @parfields1) { # Some logic # last ; } } } OOOH God...!!!! I was shocked to see the alignment of my code after posting this, but I have no option :)
        Okay here is my problem.....

        record from file1 :

        I|1400000042597061|ACTV|602282|2011-08-29||602178|JUSTIN||MAGRUDER||||||602282|100001|||||Gold||600990|||||||WUSA00029582|529381||||||||||||

        record from file2 :

         I|1400000042589325|2011-08-29|ACTV|1400000042597061|600002|||1556 3RD AVE|||NEW YORK|NY|10128|3100|US|||||||

        Here second field from F1 is eual to 5th field of F2

        And my script...

        while(my $pline = <PRFILE>) { $parrecord = $pline; my @parfields=split('\|',$parrecord); chomp(@parfields); # Child file is already loaded into @carray foreach (@carray) { $chrecord = $_; @chfields=split('\|',$chrecord); chomp(@chfields); if(@chfields[4] eq @parfields[1]) { # Push $chrecord into some array# last ; } } }

Re: Compare two Large FlatFiles
by roboticus (Canon) on Apr 13, 2012 at 19:37 UTC

    suneel.reddy:

    I helped someone with a similar problem. I suggested that they sort the files on the key of interest, and then process the file sequentially.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Compare two Large FlatFiles
by traceyfreitas (Sexton) on Apr 13, 2012 at 21:46 UTC
    I believe your classmate received a solution already. Check here for the answer by aaron_baugher whose solution doesn't load everything into memory.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://964940]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2014-07-26 01:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls