It might be useful to consider if you can deal with the files as they are found in the filesystem. Often, programmers don't consider the option of handling things as they come through, instead feeling that they have to work through a sorted list. The way you can tell is if you don't care what order your datasources come in and if you don't need them again once you've gotten what you need.
This definitely sounds like a situation where a type of stream could definitely work. Why not do something like the following:
open FINDER, "find . -type f -print |"
|| die "Couldn't issue find command\n";
my %SGML_Reporting_Stuff;
while (<FINDER>)
{
my $fh = IO::File->new($_)
|| die "Cannot open '$_' for reading\n";
# Do stuff to populate %SGML_Reporting_Stuff
$fh->close;
}
close FINDER;
# Use %SGML_Reporting_Stuff here.
I used a Unix command, but you could replace the command with the appropriate Window command and it should work. This isn't necessarily going to give you a huge boost in speed, but it will reduce your memory requirements, which often translates into a 5%-15% speed improvement. In your case, where you're taking 5+ hours, that can be as much as 45 minutes, or more.
Now, of course, if you need to read file A before reading files B and C, this won't work as well. You could still do something similar, by having a second hash which says "I can't process these filenames until I have process that filename". Once you hit "that filename", you process the ones that you had to hold off on. If you were to go this route, I would create a process_file() subroutine to do your actual processing.
------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified. | [reply] [Watch: Dir/Any] [d/l] |
I did a bit more digging, and thought this might help...
You could use the following code (straight from the Benchmark docs) to reassure yourself that the networked access is the bottleneck.
use Benchmark;
$t0 = new Benchmark;
# ... your code here ...
# system("dir", "/s", "path_to_root_sgml_dir\\*.sgml");
$t1 = new Benchmark;
$td = timediff($t1, $t0);
print "the code took:",timestr($td),"\n";
Oh, and welcome to the monastery! | [reply] [Watch: Dir/Any] [d/l] |