http://www.perlmonks.org?node_id=171278

licking9Volts has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, After much searching of previous nodes, and a hard read of perlman:perlop (which only served to confuse me), I must now ask you for help. I have a program that reads a few lines in from each of over 4000 files. This is the code I'm using to grab all of the file names:
... while ($file = <*.las>) { open(FILE, "$file") || warn "Warning: can't open $file, skipping.. +.\n"; while (<FILE>) { chomp; if (/\b($info)\s*\./) { print $_} ... } } ...
When I run the program, there is a long pause (around 5 mins) before it actually starts printing out any results. What is it doing for those 5 mins? Is there something I can change to make it start printing results immediately or does it have to look at all the files before it can do anything? Thanks.

Replies are listed 'Best First'.
Re: File glob question
by Abigail-II (Bishop) on Jun 03, 2002 at 16:54 UTC
    It doesn't have to look at each file, but it *does* look at each *filename* of the directory. The glob is first going to create the entire list of matches - which it can only do by checking all filenames.

    If you want quicker output, you have to write your own opendir/readdir wrapper.

    Abigail

Re: File glob question
by VSarkiss (Monsignor) on Jun 03, 2002 at 17:08 UTC

    The answer from Abigail-II above is correct. In case you're wondering what an "openddir/readdir wrapper" is, it's a loop like this:

    chdir $the_directory or die "Couldn't chdir to $the_directory: $!"; opendir(D, ".") or die "Couldn't open . ($the_directory): $!"; while ($file = readdir D) { next unless $file =~ /\.las$/; # other logic later... } closedir D;
    Another common idiom is to use grep to pull out the entries you want first, although this will consume memory proportional to the number of files and the lengths of their names.
    chdir $the_directory or die "Couldn't chdir to $the_directory: $!"; opendir(D, ".") or die "Couldn't open . ($the_directory): $!"; my @files = grep { /\.las$/ } readdir D; closedir D; foreach my $file (@files) { # whatever... }
    Note, both of those code samples are untested.

    HTH

Re: File glob question
by Zaxo (Archbishop) on Jun 03, 2002 at 17:21 UTC
Re: File glob question
by rinceWind (Monsignor) on Jun 03, 2002 at 17:55 UTC
    You haven't said which platform your program is running on. There's a good chance that it's not the machine running your code that is bottlenecked, but the file server or the network.

    On Win32, there are well known performance problems with the layers of the operating system involved in wildcard directory lookup (=globbing).

    I've also seen the same scenario with NFS, but nowhere near as bad - in this case, the Unix box doing the file serving had some severe hardware problems of its own.

    A way round this is to use FTP instead of a direct mapping, and Net::FTP to access the directories and files.

    Update: Some CB corresponence with licking9Volts has established that the files are being served from a Unix box using Samba. IIRC, Windows has to have an image of the directory in memory before it can glob. If the directory is huge, then Windows thrashes in memory.

      rinceWind, I am running this on Win2k actually. The program sits in the same directory as the data files though. Would network problems still affect it? Also, Abigail, would it be any faster if I gathered all of the file names into a seperate hash or array? BTW, the reason, I am concerned with speed is that I would like to eventually use part of this code on an internal website as a kind of file archive search. If it really came down to it, I could just create a separate index file with the data I need, then update it weekly and search on that, but how much fun is that? Thanks.

      Update: I crossed out the line above because I realized that with either scenario, it's still going to glob the file names into something and therefore it still has to read ALL of the file names. Thanks Abigail.

        Actually, if your CGI's being hit with any kind of traffic, you'll have to do something like that. At the very least, you'll need something from the Cache::Cache clique or one of the other similar modules.

        Makeshifts last the longest.

Re: File glob question
by tmiklas (Hermit) on Jun 03, 2002 at 21:37 UTC
    Maybe I'm wrong, but I suppose that those 5 minutes are just to make a list of files in a direcotry. The speed of operation depends mostly on your filesystem speed (for example Linux ext2 is much slower than ReiserFS - even in beta-experimental version).
    While you can do nothing to speed this up, you can do some work-around... At first I suggest to make list of those 4000 files and write it somewhere. Then you can try something like this:
    #!/usr/bin/perl use strict; open (FH, "<files.list") || die "Can't find files.list: $!\n"; my @files = <FH>; close (FH); while (my $file = shift(@files)) { chomp ($file); open (FILE, "$file") || warn "Warning: can't open $file, skipping. +..\n"; while (<FILE>) { chomp; if (/\b($info)\s*\./) { print $_} ... } }
    and so on... Anyway, I'm sure that other monks right here would find much better solution than mine :-)

    Greetz, Tom.