Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

quicker way to merge files?

by nessundorma (Initiate)
on May 19, 2010 at 02:02 UTC ( #840600=perlquestion: print w/replies, xml ) Need Help??
nessundorma has asked for the wisdom of the Perl Monks concerning the following question:

Hey I have a script below, which takes two files, loops over the lines in the files, and if two certain variables match, combines the files into an output file. It works fine, except I am starting to loop over very large files, and this loop is now taking days instead of minutes. Is there a way to make this more efficient?

use File::Basename;
open (DATA, "file1.out") or die "$!";
while (my $line = <DATA>){
if ($line =~ /(\S+)(\s+)run(\d+)_sub(\d+)_event(\d+)(\s+)(\S+)(\s+)(\S+)(\s+)(\S+)(\s+)(\S+)(\s+)(\S+)(\s+)(\S+)(\s+)(\S+)(\s*)/){
open (REFILE, ">>output.out");
my $var1 = $5;
my $var2 = $9;
my $var3 = $11;
my $var4 = $13;
my $var5 = $15;
my $var6 = $17;
my $var7 = $19;
my $var8 = $7;
my $var9 = $1;
open (DATA2, "file2.out") or die "Cannot open file2";
while (my $line2 = <DATA2>){
if($line2 =~ /(\S+)(\s+)(\S+)(\s+)run(\d+)_sub(\d+)_event(\d+)(\s*)/){
my $var10 = $3;
my $var11 = $7;
if ($var3 == $var10){
if ($var4 == $var11){
print REFILE "$var1 $var2 $var11\n";
} } } }
} } close(REFILE); close(DATA);

Original content restored by GrandFather

Replies are listed 'Best First'.
Re: quicker way to merge files?
by GrandFather (Sage) on May 19, 2010 at 05:28 UTC

    There are a number of major problems with your code. The first is that you can open REFILE many times, but you only explicitly close it once - that probably indicates a fundamental logic error.

    Any code which rereads a file for each line of another file is bound to be slow. Don't do that! Instead read all of the smaller file into memory (if you want to look stuff up put that stuff in a hash) once before you start processing the second file, then make use of the cached data from the smaller file.

    Numbered variables almost always indicates that you should be using an array. In this case you could:

    my @vars = (undef, $5, $9, $11, $13, $15, $17, $19, $7, $1);

    although given that you don't access most of the values in the sample code you may be better to use named variables (with sensible names) for just the fields you do need. You would be even better to not capture the fields you're not interested in and thus simplify your regex!

    Ignoring the file management issue for the moment (I don't know how big the files are so it's hard to tell what a sensible solution is), the code can be cleaned up to:

    use strict; use warnings; my $file2 = <<FILE2; 00001 003 run1_sub1_event4 FILE2 while (my $line = <DATA>) { next if $line !~ /\S+\s+run\d+_sub\d+_event(\d+)\s+(.*)/; my ($event, $tail) = ($1, $2); my @params = split /\s+/, $tail; open my $DATA2, '<', \$file2 or die "Cannot open file2"; while (my $line2 = <$DATA2>) { next if $line2 !~ /\S+\s+(\S+)\s+run\d+_sub\d+_event(\d+)/; print "$event $params[1] $2\n" if $params[2] == $1 && $params[ +3] == $2; } close $DATA2; } __DATA__ 00001 run1_sub1_event1 1 2 3 4 5 6 7


    1 2 4
    True laziness is hard work
Re: quicker way to merge files?
by Marshall (Abbot) on May 19, 2010 at 07:59 UTC
    Grandfather's advice is spot on. The file operations are very expensive in terms of performance. Looks like for a lot of lines in DATA, REFILE is re-opened and the DATA2 file is re-read and re-parsed.

    Just moving the open of REFILE to the top of the code will save one very expensive file system operation for every line in DATA. If one of the files is small enough to fit into memory, something as simple as @lines =<DATA2>; will produce significant CPU savings because reading through an array in memory is MUCH faster than continually re-reading it off of the disk.

    There is some speculation involved in this next suggestion as I have no idea of the size of the files. But the CPU savings will be enormous if this works out...It appears that you are checking for each line in DATA if there is a line with 2 key parameters contained within the DATA2 file which match with the current line under inspection in DATA. If so then an output line is generated.

    BTW, I quite frankly found this blizzard of $var9,$var13 type stuff to be very confusing. Better variable names would help immensely!

    Anyway, if you read say DATA2 first and create a %data2 hash with keys like: $data2{"$var10;$var11"}=1; Then as you read DATA, you check for the existence of $data2{"$var3;$var4"} and if so print $var1 $var2 $var4, I think that would work. The size of %data2 could get huge, hundreds of thousands of keys aren't out of the question.

    Pitfalls: %data2 is just too big to fit into memory. If so then things get more complex if you want this to run really fast - but its still possible. It appears that some of what you have as \S+ in the regex are really numbers. There can be some "mismatch" when dealing with leading zeroes. In Perl everything is a string until it is used in a numeric context. One trick to delete leading zeroes is to just add "0" to the number. $var +=0; Now when you can use $var as part of a hash key, it won't have any leading zeroes. That's important if one file had "033" and the other "00033".

    In the best case, read each line of data once, parse it once. Building even what might seem to be "huge" hash tables is not nearly as expensive as re-reading a file over and over again. If you are dealing with files of just some few hundred MB, execution time in the seconds is not an unreasonable expectation.

Re: quicker way to merge files?
by Krambambuli (Curate) on May 19, 2010 at 08:16 UTC
    You made no mention about it, so I'd guess that the files are not sorted.

    If the files are bigsized, it may be worth to first sort them according to the fields you're interested in - even if that may take quite some time too - and then sequentially step _once_ through the sorted files to pick up the mergers.

Re: quicker way to merge files?
by ig (Vicar) on May 19, 2010 at 08:39 UTC

    You might consider putting your data into a database. Some can handle very large data sets and table joins are one of the things the are designed to do efficiently.

      So, to get a single file with merged contents, you're advising:

      1. Create a database.
      2. Create two tables.
      3. Load the input files into the tables.
      4. Join the tables to a third.
      5. Dump the third table back to a new file.

      And you anticipate this will be quicker than just merging the files?

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Depending on the situation, yes, it might well be... (Especially if there is no 'smaller file' and loading them into memory will not work.)

        Although you can probably drop the third table, and just write the data out (from a query) instead.

        I hadn't contemplated what you suggest. While it seems unlikely to be optimum for a one time effort, it appear that it would be quite easy to do and "days" might be ample time to get something of the sort done. It would almost certainly be faster than the current approach and avoids some non-trivial programming that might otherwise be required - and time consuming.

        More compelling is that this appears not to be a one time requirement. "I am starting to loop over very large files" suggests this is a repeating and ongoing exercise. It might be better to change the processes that produce the input data to write it directly to a database as it is produced, avoiding the intermediate files. And there is no mention of what is done with the merged file. It might be better to revise the processes that use the output file to access such a database directly.

        Along the lines of not re-inventing the wheel, I suggest consideration be given to taking advantage of a well known tool (RDBMS) that appears to be quite relevant to the problem at hand.

        There is not nearly enough information in the post to know what might be best, which is why I only suggested to consider. And your points also are worthy of careful consideration in the broader context of the requirements, though not the only way to use a database in this situation- whatever it is.

Re: quicker way to merge files?
by doug (Pilgrim) on May 19, 2010 at 17:03 UTC

    Others have pointed out how to do the I/O and regexp more efficiently. I'm going to stick with the structure:

    DATA is a horrible name to use a file handle because perl provides a semi-magic one with the same name. When I glanced and saw that, I immediately looked for a __DATA__ marker.

    I don't like big "if" blocks with no else. I think something like  next unless ( $line =~ m/..../ ); is easier to read.

    Also, the formatting could be better. Four closing bracers on the same line make me think of the worst abuses in LISP.

    - doug

      The formatting at least may not be representative of the OP's original code. He chose to use br tags rather than wrapping the code in code tags and as we all know that doesn't provide for good control over formatting.

      True laziness is hard work

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://840600]
Approved by GrandFather
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2018-03-20 08:26 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (248 votes). Check out past polls.