Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Efficient deletion of files / shell interaction

by madd (Acolyte)
on Jul 19, 2009 at 16:02 UTC ( [id://781455]=perlquestion: print w/replies, xml ) Need Help??

madd has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a directory with several hundred sets of files (an input, data and output file). The files all begin with a given prefix and have a 4 digit sequential number. i.e.
Kick0000.inp Kick0000.data Kick0000.log
are the input, data and output files respectively. I wish to delete the data files of jobs that did not terminate normally (i.e. jobs where the output does not end in "Normal termination"). The following script, using backticks works, but I'm not sure I've done it in the best way. Mainly, I'm concerned about using backticks twice each iteration. I'd appreciate any advice on whether the backticks really are a problem, and if so, how I could better write this script. Thanks in advance, Madd
#!/usr/bin/perl -s $prefix="Kick"; $Restart="Restart.data"; open (RESTART,"$Restart") or die "Unable to open $Restart for reading\ +n"; $AlreadyDone = <RESTART>; $jobs_run= sprintf '%04d',($AlreadyDone); for (my $j=0;$j<=$jobs_run;$j++) { $job_no=sprintf '%04d',($j); $job_title="$prefix$job_no"; $job_end=`tail -n 1 $job_title.log`; if($job_end !~ /Normal/) { `rm $job_title.data`; } }

Replies are listed 'Best First'.
Re: Efficient deletion of files / shell interaction
by Corion (Patriarch) on Jul 19, 2009 at 16:06 UTC

    Instead of rm, do you maybe want the unlink function? Note that this function won't ask/warn you if you're deleting files that don't have the appropriate attributes set.

    Also, you might or might not want to read the files yourself instead of using tail, by using seek and the diamond operator.

      Thanks, unlink works fine. I'll benchmark both approaches for my own edification later. However I'm not sure how to make the best use of seek and readline given that the log files are of varying lengths (~100kB - 10MB) and I'm always after just the last line. It seems a bit too much work to open, read and close several hundred files.

        In addition to Corion's advice there is File::ReadBackwards. I don't know if it would be faster than seek but it would probably be easier to code up since it's line instead of block-size oriented.

        seek allows you to seek to (just before) the end of the file and to read the last (say) 1024 bytes of the file, and then look there for what you're searching for. Which is about what tail does, too.

        The trick about using seek and read is that in order to seek to a position from the end of a file, you have to specify a negative number for the offset amount. For example, if the log files for successful runs always have the phrase "Normal termination\n" as the very last thing in each file, that's just 19 bytes you need to read from the end -- but let's pad that a bit, just to be safe:
        #!/usr/bin/perl use strict; my $prefix="Kick"; my $Restart="Restart.data"; open( RESTART, $Restart ) or die "Unable to read $Restart: $!\n"; my $AlreadyDone = <RESTART>; my ( $jobs_run ) = ( $AlreadyDone =~ /(\d+)/ ); for ( my $j=0; $j<=$jobs_run; $j++) { my $job_title = sprintf( "%s%04d", $prefix, $j ); if ( open( my $fh, "<", "$job_title.log" )) { seek( $fh, -24, 2 ); read( $fh, my $job_end, 24 ); unlink "$job_title.data" unless ($job_end =~ /Normal/); } else { warn "Unable to read $job_title.log: $!\n"; } }

        Some miscellaneous notes:

        • I didn't see any clear rationale for using the "-s" option on the shebang line. If you have a reason for that in your "production" script, it's fine, but it seemed unnecessary here.
        • Your method of getting a numeric value from the "Restart.data" file was strange. I think a regex match for the numeric is better/safer.
        • When you decide to report an error message regarding a failed open() call, including "$!" in the message can be very helpful.
        • Overall, I think the overhead of opening and closing data files in perl will be less (given that you are only reading the last couple dozen bytes from each file), than the OS overhead of creating and tearing down 100's of subshells for running "tail", just as the use of perl's unlink function is fairly certain to be more efficient than a series of subshells that invoke the unix "rm" command.
        • Notice how easy it was to add use strict

        (updated to add a missing close-paren -- forgot to do "perl -cw" before hitting "create")

Re: Efficient deletion of files / shell interaction
by kyle (Abbot) on Jul 19, 2009 at 17:55 UTC

    I might write this, if I didn't care if anyone can read it.

    my @job_nums = 0 .. $jobs_run; my $logs = join q{ }, map { sprintf "$prefix%04d.log" } @job_nums; my $tail_rx = qr{ ==> (.+?) <== \n ( [^\n]* ) \n }xms; my @tail_pairs = `tail -n 1 $logs` =~ $tail_rx; unlink map { s/\.log$/.data/; $_ } map { $tail_pairs[$_*2] } grep { $tail_pairs[1+$_*2] !~ /Normal/ } @job_nums;

    Note that I haven't tried this, and given that it calls unlink, one would do well to test it with a copy of real data before trying it with real data.

    This has the advantages that it shells out to run tail only once on the list of every log file and that it calls unlink only once with the list of the files to delete. On the other hand, the heavy use of map and grep means it does more looping than is absolutely necessary.

Re: Efficient deletion of files / shell interaction
by JavaFan (Canon) on Jul 19, 2009 at 18:58 UTC
    tail -1 ... list of file ... |\ perl -nE 'given($_) {when (/==> (.*) <==/) {$file = $1} when (/^\s*$/} + {;} when {/Normal/} default {unlink $file or die}'
    This assumes you don't have files ending in '==> foo <=='.
Re: Efficient deletion of files / shell interaction
by psini (Deacon) on Jul 19, 2009 at 16:09 UTC

    unlink?

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://781455]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-19 15:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found