Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

RE (tilly) 2: List non-matching files

by tilly (Archbishop)
on Aug 20, 2000 at 00:17 UTC ( [id://28670]=note: print w/replies, xml ) Need Help??


in reply to RE: List non-matching files
in thread List non-matching files

In general pure Perl solutions tend to be faster than find for all of the reasons that Perl usually beats shell scripting. (You don't have to keep on launching processes.) In this case it comes down to launching one rm and passing it a lot of filenames vs launching an rm per file. Guess which I think is faster?

However find has one huge advantage. It is one of the few ways to get around limitations with listing large numbers of files in shell scripts. The nom script given doesn't do that.

A second advantage is that while find has a more complex API, it is also more flexible... :-)

Replies are listed 'Best First'.
I agree, honest.
by gryng (Hermit) on Aug 20, 2000 at 22:34 UTC
    find . -type f -not -name '*.html' -maxdepth 1 -print | xargs rm -rf;

    The above only launches three processes (well, actually a few more if xargs decides there are too many files), and since it's I/O bound, I doubt a Perl based solution would be significantly faster (and personally, my wager is that it would be slower).

    However, I agree that shell scripting would be slower in general than Perl, for the reason of process creation. But I don't think this case counts.

    Ciao,
    Gryn :)

      And it freaks badly if you have any filenames with whitespace in them, especially a newline. This same thing in Perl works just fine in one process:
      #!/usr/bin/perl opendir DOT, "."; unlink grep { -f and not /\.html$/ } readdir DOT;

      -- Randal L. Schwartz, Perl hacker

        Very true :) , kudos.
      In case you care:

      find . -type f -not -name '*.html' -maxdepth 1 -exec rm '{}';

      Does the same job with only one process.

      For me, performance doesn't matter that much. I usually deal with 100..2000 files each time and running time isn't much different.

      One should measure the total time from the split-second in which your brain decided what you want to do and until you see the next command prompt :)

      This is why it's useful to have simple basic blocks (with short names :) that do the job.

        Sorry, not true. From a manpage for find:
        -exec command ; Execute command; true if 0 status is returned. All following arguments to find are taken to be argu­ ments to the command until an argument consisting of `;' is encountered. The string `{}' is replaced by the current file name being processed everywhere it occurs in the arguments to the command, not just in arguments where it is alone, as in some versions of find. Both of these constructions might need to be escaped (with a `\') or quoted to protect them from expansion by the shell. The command is exe­ cuted in the starting directory.
        Every time it reaches the exec it launches a new process. Your version actually launches a separate instance of /bin/rm per file processed! (Good thing *nix optimizes process creation!)

        But for one-off jobs, you are right. How long it takes you to remember how to do it probably matters more than any details about how much work it is for the computer. (For mass deletes I usually write a short Perl script rather than look at find just because I know Perl very well. YMMV.)

        But these performance considerations matter a lot for jobs that will be run repeatedly...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://28670]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (10)
As of 2024-04-16 08:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found