Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Faster Method for Gathering Data

by APA_Perl (Novice)
on Jul 31, 2003 at 11:40 UTC ( #279534=perlquestion: print w/replies, xml ) Need Help??

APA_Perl has asked for the wisdom of the Perl Monks concerning the following question:

The following code gathers the path names of all SGML files in a directory tree. It just runs VERY VERY slow. I am working with 33,000 files and it took this code almost 5 hours to gather everything into the array.

Is there a faster method anyone could suggest?

Be kind, I am new and this is my first post to the Monks.

Thanks!

use File::Find; use Time::localtime; $now = ctime(); my @dirs = @ARGV or die "No valid directory argument(s)"; find( sub{ m/\.sgml$/ and push @files,"$File::Find::name" and print "$ +File::Find::name\n";}, @dirs, ); $fileCount=@files; print "\nThere are $fileCount files here.\n"; $then = ctime(); print $now; print "\n"; print $then;

Replies are listed 'Best First'.
Re: Faster Method for Gathering Data
by Abigail-II (Bishop) on Jul 31, 2003 at 12:07 UTC
    And how fast is the equivalent find command? Do something like:
    $ time find dir1 dir2 dir3 -name '*.sgml' > /dev/null

    If that also takes hours, the problem isn't at your Perl program.

    Abigail

      If the requester isn't on Unix, would wrapping the appropriate system("") call with some code to store the start and finish time be useful?

      Maybe the Benchmark module?

      I don't know, just wanted to see if such a strategy might be worthwhile.
Re: Faster Method for Gathering Data
by BrowserUk (Patriarch) on Jul 31, 2003 at 13:23 UTC

    How long does the same search take from the command line? Time both

    dir /s \\remotemachine\....\*.sgml attrib /s \\remotemachine\....\*.sgml

    If either of these is substantially faster than File::Find, then there may be ways of speeding things up. If not, then it would seem that you have a very slow link somewhere between you and the network drive.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Re: Faster Method for Gathering Data
by cianoz (Friar) on Jul 31, 2003 at 12:14 UTC
    as far as i can tell there's nothing wrong in your code
    (well except for a lacking "use strict"...)
    (i tested it to search for /\.c$/ in /usr/src/linux and it took just less then 1 second for more than 4000 files on an average machine..)
    try to compare it with the unix find command (if you are on unix)
    also if you are using a slow terminal it could help to eliminate the print statement or redirecting it to a file.
      Sorry should have been more specific. I am on a Windows system, checking the files across a Win2000 server network drive.

      I guess that impacts it.

      The print command is in there to show that it is actually working and not frozen. I need the array for later use to open the files and do some reporting based on the elements in the SGML.

      Thanks TONS for verifying that at least it might not be me.

        It might be useful to consider if you can deal with the files as they are found in the filesystem. Often, programmers don't consider the option of handling things as they come through, instead feeling that they have to work through a sorted list. The way you can tell is if you don't care what order your datasources come in and if you don't need them again once you've gotten what you need.

        This definitely sounds like a situation where a type of stream could definitely work. Why not do something like the following:

        open FINDER, "find . -type f -print |" || die "Couldn't issue find command\n"; my %SGML_Reporting_Stuff; while (<FINDER>) { my $fh = IO::File->new($_) || die "Cannot open '$_' for reading\n"; # Do stuff to populate %SGML_Reporting_Stuff $fh->close; } close FINDER; # Use %SGML_Reporting_Stuff here.
        I used a Unix command, but you could replace the command with the appropriate Window command and it should work. This isn't necessarily going to give you a huge boost in speed, but it will reduce your memory requirements, which often translates into a 5%-15% speed improvement. In your case, where you're taking 5+ hours, that can be as much as 45 minutes, or more.

        Now, of course, if you need to read file A before reading files B and C, this won't work as well. You could still do something similar, by having a second hash which says "I can't process these filenames until I have process that filename". Once you hit "that filename", you process the ones that you had to hold off on. If you were to go this route, I would create a process_file() subroutine to do your actual processing.

        ------
        We are the carpenters and bricklayers of the Information Age.

        The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

        I did a bit more digging, and thought this might help...
        You could use the following code (straight from the Benchmark docs) to reassure yourself that the networked access is the bottleneck.
        use Benchmark; $t0 = new Benchmark; # ... your code here ... # system("dir", "/s", "path_to_root_sgml_dir\\*.sgml"); $t1 = new Benchmark; $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";
        Oh, and welcome to the monastery!
Re: Faster Method for Gathering Data
by crouchingpenguin (Priest) on Jul 31, 2003 at 13:18 UTC
Re: Faster Method for Gathering Data
by dga (Hermit) on Jul 31, 2003 at 17:27 UTC

    Another possibility which may not apply in your situation is to run a straight recursive directory listing into a text file then write a perl script to parse that.

    The fastest of course would be to run the listing on the remote machine and then transfer the listing file to the local machine.

    Second fastest might be to do the listing over the network and save the output locally and run a parsing script on that. Of course if an over the network directory listing takes 5 hours to complete you don't save a lot of time.

    use strict; while(<>) { push(@files) if /\.sgml$/; }
Re: Faster Method for Gathering Data
by Cine (Friar) on Jul 31, 2003 at 12:52 UTC
    Your problem is most likely not really related to perl, it is a filesystem thing, where lookups are made in linear time with regards to the number of files in a directory.
    You should look into the htree option of ext{2,3}. Goggle will help you there ;)

    T I M T O W T D I
      Howdy!

      Your problem is most likely not really related to perl, it is a filesystem thing, where lookups are made in linear time with regards to the number of files in a directory

      I've run into what I think is similar behavior. I have some CDs that have something like 11,000 files on them, all in a single directory. On a Windows or MacOS 9 box, I saw excruciatingly slow access times for files down in the list. The first few hundred were plenty zippy, but the farther I got into the list, the slower the access.

      Doing the same access on a Solaris box or MacOSX yielded pleasantly surprising results. File lookups were more like constant time instead of proportional to how far into the list the name was.

      I suspect that the problem is exacerbated by using a "slow" medium, like CD-ROM or network volumes.

      yours,
      Michael

        Network, yes. CDROM no. The difference is that the meta data on the CDROM can be cached, whereas the network drive has to recheck it.

        T I M T O W T D I

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://279534]
Approved by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2022-09-25 21:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer my indexes to start at:




    Results (116 votes). Check out past polls.

    Notices?