Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: sizeDateValidator.pl is horribly slow

by graff (Chancellor)
on Nov 06, 2011 at 03:06 UTC ( [id://936224]=note: print w/replies, xml ) Need Help??


in reply to sizeDateValidator.pl is horribly slow

In the first try, you are calling stat numerous times on each file, and that's wasting some amount of time. Call stat once per file, and save all its information for your various actions.

As for how long it should take to scan 20,000 files, what sort of time span are you expecting, and what sort of evidence (what sorts of processes) lead you to expect that?

There are some other trivial oddities in your first script -- I expect they don't affect the timing much (if at all), but they detract from the overall coherence of the code. Oh, and consistent indenting is useful...

Here's how I would do it:

use POSIX; # Get argv handling out of the way first... if ( @ARGV != 3 or ! -f $ARGV[0] ) { die "Usage: perl $0 FileListToValidate OutFile StatusFile\n"; } # Next take care of all the i/o file handling... if ( -e $ARGV[2] ) { die "$ARGV[2] already exists -- I will not overwrite it\n"; } open( STAT, '>', $ARG[2] ) or die "Can't write status info to $ARGV[2] +: $!\n"; if ( ! open( OUT, '>', $ARGV[1] ) { print STAT "error: can't write output to $ARGV[1]: $!\n"; exit; } if ( ! open( IN, '<', $ARGV[0] ) { print STAT "error: can't open $ARGV[0] for input $!\n"; exit; } # Now get to work... my @inpList = <IN>; chomp @inpList; for ( @inpList ) { # let $_ hold the file name tr/"//d; # get rid of double-quotes my @stats = stat; # do this just once (works on $_ by default) if ( ! @stats ) { # empty list means stat failed print OUT join( '|', $_, ( 'notfound' ) x 2 ), "\n"; } else { print OUT join( '|', $_, $stats[7], POSIX::strftime( "%m/%d/%Y %I:%M %p", localtime( $stats[9] + )), "\n"; } } print STAT "success\n";
That eliminates a lot of useless variable creations and value assignments, but I think reducing the multiple stat calls per file to just one will be the thing that has a noticeable effect.

Personally, I'd go with just two command line args -- printing error messages (and even a "success" message) to stderr should suffice, so you just need the input list and the name to use for the output list (and you eliminate two possible causes of failure).

As for the second try, processing the output of some other command is bound to take longer (and can cause more trouble). Don't do that when a perl internal function can do the same thing.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://936224]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-04-23 09:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found