in reply to Performance Trap - Opening/Closing Files Inside a Loop

If you have the memory something this will probably be faster. You can save on the if/else for every line as well as only doing the minimum in the loop (like not splicing and joining when we don't really need to). Even a saving of a few microseconds X millions of lines is substantial. Multiple calls to print are significantly slower than a single call as the OS can buffer/write more efficiently.

#!/usr/bin/perl my ($field,%fh); while ( <DATA> ) { @field = split /,/; $fh{$field[2]} .= "$field[0],$field[1],$field[3]"; } for my $file( keys %fh ) { open F, ">$file" or die $!; print F $fh{$file}; close F; }



Replies are listed 'Best First'.
Re^2: Performance Trap - Opening/Closing Files Inside a Loop
by tmoertel (Chaplain) on Dec 10, 2004 at 07:04 UTC
    tachyon, your code is likely to be faster not so much because it shaves away Perl cycles but because it will greatly reduce disk seeks, which are probably dominating L~R's run time. (See my other post in this thread for more on this.)

    L~R: Assuming that you have the RAM, can you compare tachyon's code's run time to the other implementations? My guess is that tachyon's code will fare well. (If you don't have the RAM, just tweak the code so that it will process, say, 100_000 or so lines per pass and clear out %fh between passes. Also, you'll need to open files in append mode.)


      I agree reducing the number of seeks you need is vital. Given an average 3 msec seek time you can only have 333 seeks per second. This is of course glacial. Ignoring buffering the original code effectively needed 2 seeks (or more) per line, the improved version required at least 1 seek. In the example I presented the number of seeks required is a function of the number of files we need to create, not the number of lines in the input file. This will be a significant improvement provided that the number of unique files is less than the number of input lines.



      I had thought about this myself after posting. The reason I didn't give it a lot of initial thought is because the Java developer made it clear that I was not welcome in the sandbox. My guess is that some sort of limited buffer would be best since that's still a whole lot of lines to be keeping all in memory.

      Cheers - L~R

        If that is the case - combine the two methods.
        1. Buffer the strings up to say 10k or more.
        2. Once they hit that size - look for an open, or cache a new filehandle.
        3. Print out the buffered string for the file and clear the buffer.
        4. Finish off by flushing remaining buffers.
        my @a=qw(random brilliant braindead); print $a[rand(@a)];
Re^2: Performance Trap - Opening/Closing Files Inside a Loop
by Animator (Hermit) on Dec 10, 2004 at 11:57 UTC
    for my $file (keys %fh) will (or should) be slower then while (my ($file, $data) = each %fh)

      Doesn't placing a variable declaration in the conditional statement for a conditional loop occasionally lead to strange errors?

      print substr("Just another Perl hacker", 0, -2);
      - apotheon
      CopyWrite Chad Perrin

        It shouldn't (or atleast not as far as I know)
        Variable declaration with a statment modifier can have stange effects, usally a lexical not getting cleared as expected, due to my having both runtime and compile time effects, the test only blocks the runtime effect of resetting the variable. This can be used for some interesting obfu but should be avoided in serious code IMO.
        # if EXPR is false $foo will still have the last set value for this sc +ope's (but not other scope's) $foo my $foo if EXPR; #same here if EXPR is false the first time it's tested on this pass. my $bar while EXPR;
        It's rather common to stick a declaration inside the test of a (BLOCK) loop, often seen with input loops, or looping over both keys and values of a hash at the same time.
        while (my $foo = <$fh>) { #do something } while (my ($key, $value) = each %hash) { #do other things }

        You're thinking of my() on statements using modifiers. The problem is avoiding only calling my() sometimes. Then the variable may retain is previous value from earlier loops in a way that is declared off-limits and buggy but never fixed because it would be too much overhead to make it consistent.


        while ( my ( $k, $v ) = each %h ) { ... } while ( my $line = <> ) { ... } for ( my $ix = 0; ...; ... ) { ... }

        Not ok

        my $... = ... if ...; ... or my $... = ...;

      While you are correct that accessing key value pairs with each is a little faster this is unlikely to influence runtime in any measurable way as the output bottleneck lies with the OS and disk IO.