http://www.perlmonks.org?node_id=295969


in reply to Efficient processing of large directory

Use while instead.

while( my $file = <directory/*> ) { # do stuff }

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

Replies are listed 'Best First'.
Re: Re: Efficient processing of large directory
by Elliott (Pilgrim) on Oct 02, 2003 at 16:47 UTC
    Thanks for the tip - but why? (Most of all I want it to work - but I also want to understand)
      It has to do with how foreach (and for, which is an exact synonym) works. foreach will construct the entire list, then iterate through it. This can be very memory-intensive, which will slow the processing speed (due to cache misses and virtual memory issues.) A nearly exact rewrite of foreach in terms of while would look something like:
      foreach my $n (<*.*>) { # do stuff } ---- my @list = <*.*>; my $i = 0; while ($i <= $#list) { my $n = $list[$i]; # do stuff } continue { $i++; }

      ------
      We are the carpenters and bricklayers of the Information Age.

      The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        Er, why on earth do you tell him to use while and then to use while to do the exact same thing the for loop had been doing previously? Your method is still going to need to build the 17,000 element list and iterate over it, it just uses a more explicit form. A rewrite which gets around this would be simply:
        while(my $x = <*.*>) { do_stuff($x); }
        This will only read a single file at a time and has no need to create huge lists.
Re: Re: Efficient processing of large directory
by Elliott (Pilgrim) on Oct 03, 2003 at 15:37 UTC
    I've tried it now with while ... and it timed out :-(

    Looks like I'd better try subdirectories too.

      You should readdir instead. Also, if this is running from a CGI (I guess that's what timing out is referring to), then make sure to give the client a few bytes of data every now and then so it doesn't give up waiting.

      Makeshifts last the longest.

        Now I know that readdir exists (thanks Aristotle!) I was able to RTFM and put it into practice. Those functions that do not require me to open the files have improved stunningly. So much so that the client rang me at home to thank me and backed it up with an email full of exclamation marks.

        Now I have to further improve the processes that have to read all the files. But I guess that's another thread!

        readdir is less efficient than glob if a subset of the directory contents is sought.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        If I understand your problem, I can solve it! Of course, the same can be said for you.