Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Large scale search and replace with perl -i

by elbie (Curate)
on Apr 14, 2003 at 18:35 UTC ( [id://250369]=perlquestion: print w/replies, xml ) Need Help??

elbie has asked for the wisdom of the Perl Monks concerning the following question:

I have a large number of files that I need to do a search and replace on. At this point, I'm resigned to using the following:

find . -name '*.html' -type f -exec perl -pi -e 's/foo/bar/g;' \{\} \;

Which has two problems that I can think of:

  1. Every file acted upon spawns a new process through find's -exec option. I was thinking instead to try:

    perl -pi -e 's/foo/bar/g;' `find . -name '*.html' -type f`

    But I've had problems with that in the past when find returns a very large list.

  2. Files that do not contain the match get operated on anyway. There's a lot of extra overhead here as well, and timestamps get changed to boot. Again, I could try:

    perl -pi -e 's/foo/bar/g;' `find . -name '*.html' -type f -exec grep -l foo \{\} \;`

    But I still have all the same problems as with the first item above.

Is there another way to use perl -i on a directory recursively so that only files matching a certain criteria are updated?

elbieelbieelbie

Replies are listed 'Best First'.
Re: Large scale search and replace with perl -i
by jasonk (Parson) on Apr 14, 2003 at 18:52 UTC

    You can combine the two approaches using xargs:

    find . -name '*.html' -type -f -print0 | \ xargs -0 -n 50 perl -pi -e 's/foo/bar/g'

    This will use find to list all the files you want, and xargs to pass them to your perl script. By specifying the -n 50 option to xargs, each invocation of perl will be passed a maximum of 50 filenames to process (if you still get too many arguments because your paths are really long, lower the number). I haven't benchmarked it to make sure, but I suspect that under most circumstances the overhead of using grep first to find the files that contain the thing you want to replace will actually be less efficient than just running the replacement on every file you find.


    We're not surrounded, we're in a target-rich environment!
      If your xargs is any good, you don't have to use the -n option. xargs will know the limits of your OS, and create argument lists that will neither have too many argument, nor will the flattened argument list exceed your OSses limit.

      Abigail

•Re: Large scale search and replace with perl -i
by merlyn (Sage) on Apr 14, 2003 at 19:22 UTC
    Semi-untested:
    use File::Find; @ARGV = (); find sub { push @ARGV, $File::Find::name if -f and /\.html$/; }, "."; { local $^I = ".bak"; local $/; while (<>) { if (s/foo/bar/g) { # changes? print; # print the new one } else { # no changes? back it out! close ARGVOUT; # for windows, not needed on Unix rename "$ARGV$^I", $ARGV or warn "Cannot rename for $ARGV$^I: $! +"; } } }

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      One small caveat with this neat technique (that I just got bitten by) is that if $^I is set to a wild card (eg. *.bak or orig_*) so that the filename of the backup is edited rather than simply appended, the rename will fail.

      I'll hazard a guess as to your respose to this as

      <merlyn>Don't do that then. {grin}</merlyn>

      but I thought it was worth a mention here :)


      Examine what is said, not who speaks.
      1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
      2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
      3) Any sufficiently advanced technology is indistinguishable from magic.
      Arthur C. Clarke.
        But, that star there doesn't do anything. That would create files named "foo*.bak" and "bar*.bak" from "foo" and "bar". And thus, the rename would undo it just fine.

        Unless you're talking about some local hack to your Perl to make it interpret $^I differently.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

        update Oh my goodness. A new feature was snuck in to Perl in 5.6, and documented in perlrun but not perlvar.

        My apologies. Wow, I'll have to write a column about it now to remember it. {grin} And I don't recall it in the perldelta from 5.5 to 5.6, or perhaps I considered it un-noteworthy. Yeah, just checked, not in perldelta. No wonder I hadn't noticed it.

        update 2 On further research, 5.4 didn't have the feature, but 5.5 did. And yet it wasn't in 5.5's perldelta. That's why I missed it. I don't always diff the entire manpage set. {sigh} I rely on perldelta.

        update 3 See "Put your inplace-edit backup files into a subdir".

Re: Large scale search and replace with perl -i
by antifun (Sexton) on Apr 14, 2003 at 19:23 UTC

    First question: how many is a "large number"? If it's on the order of 10^4 or less, you will probably spend more time fiddling with a script than it would take to do with a more "brute-force" approach. (Given reasonably fast computer, yadda yadda yadda.)

    As for the more theoretical question, you would certainly want to use the second approach (with find -exec grep -l foo) to reduce your working file set as much as possible.

    Then your next issue is avoiding the overhead of running multiple perls. The -i switch relies on the magic of <>, which is @ARGV if there are command-line arguments, and STDIN if there are not (paraphrasing slightly). However, what you need to do in this case is use both kinds of magic, so your perl will have to be a little more creative. It's harder to do the shuffle that -i does than to read from STDIN manually, so here's one way to try it:

    find . -name "*.html" -type f -exec grep -l foo {} \; | perl -pi -e 'B +EGIN{ @ARGV = <STDIN>; chomp @ARGV }; while (<>) { s/foo/bar/g; } co +ntinue { print }'

    Notice that you can fiddle with @ARGV before the <> magic takes place. The internals of the script are basically what the -p option does.


    ---
    "I hate it when I think myself into a corner."
    Matt Mitchell
      you would certainly want to use the second approach (with find -exec grep -l foo) to reduce your working file set as much as possible.

      You would certainly not, because you will have to open all files anyway - even if just to check. The difference is that grepping for matches first will make you spawn one process per file as well as require to open the matching files another time (in Perl) to actually process them. You have a (large) net loss that way.

      Taking that out, and using the -print0 option to avoid some nasty surprises (but not all, unfortunately, due to the darn magic open) leaves us with the following. Note I have removed the continue {} block as it isn't necessary and just costs time. I'm also setting the record separator such that the diamond operator reads fixed size blocks (64kbytes in this example), rather than scanning for some end of line character.

      find . -name "*.html" -type f -print0 | \ perl -i -p0e \ 'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = "\n" }; \ while (<>) { s/foo/bar/g; print }'

      That should be about as efficient as it gets.

      If you have a lot of nonmatching files, you might save work by hooking a grep in there - but not with find's -exec. That's what xargs was invented for.

      find . -name "*.html" -type f -print0 | \ xargs -r0 grep -l0 | \ perl -i -p0e \ 'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = "\n" }; \ while (<>) { s/foo/bar/g; print }'
      Update: s/= \\65536!= "\\n"/; as per runrig's observation.

      Makeshifts last the longest.

        find . -name "*.html" -type f -print0 | perl -i -p0e \ 'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = \65536 }; \ while (<>) { s/foo/bar/g; print }'
        You don't want to do that. If 'foo' spans across one of those read blocks, then you'll miss the substitution.
Re: Large scale search and replace with perl -i
by BrowserUk (Patriarch) on Apr 15, 2003 at 02:05 UTC

    Given that most html files are usually (hopefully) < 1 MB in size, it would make sense to use Aristotle's technique of changing $/, but set it to null and slurp the whole file each time.

    find . -name "*.html" -type f -print0 | \ perl -i -p0e \ 'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = '' }; \ while (<>) { s/foo/bar/g; print }'

    If the number of files produced by find is too many for your command line to handle, couldn't you produce a list of directories from find and pass that into perl and then let perl glob those? Something like (NB:completely untested code)

    find . -type d -print0 | \ perl -i -p0e \ 'BEGIN{ @ARGV = <STDIN>; \ chomp @ARGV; \ @ARGV = map{glob "$_/*.html"}; \ $/ = '' }; \ while (<>) { s/foo/bar/g; print }'

    Combining that with Merlyn's trick of backing out the -i effect if nothing is found should save more time.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
Re: Large scale search and replace with perl -i
by neilwatson (Priest) on Apr 14, 2003 at 18:55 UTC
    Try this: find / -name "*.html" -exec perl -pi -e 's/find/replace/gi' {} \;

    Update Ooops reread your questions. Hmmm, not sure about ignoring certain files. However, does filtering your find file list through grep really gain you any speed? You are having grep go through all your files and then have perl go through whatever files grep returns.

    Neil Watson
    watson-wilson.ca

      If there are many files that will not have a match, this might actually be faster, because you will save on IO writes. The perl -i will always write to a new (temporary) file, even if it turns out the content is the same - after all, Perl can't know there isn't a match. So, you will do more IO writes, and your OS goes twice as fast through its buffer cache.

      It's hard to say whether a grep is worthwhile. Without knowing more about the content of the files, I won't dismiss it.

      Abigail

Re: Large scale search and replace with perl -i
by Improv (Pilgrim) on Apr 14, 2003 at 19:23 UTC
    One thing you might consider, given that you're willing to put the time into asking on Perlmonks, is using find2perl -- it should be a lot more efficient than actually using find.
      Is it? I'd like to see some benchmark. It's certainly not my impression, and I don't find it logically, find being a C program written to do exactly one task.

      Abigail

        The reason it should be, apart from it being suggested to be so in the find2perl manpage (hehe), is that process creation is a fairly expensive operation, and it usually is the case that all the spawnings of perl (or anything else that perl can easily duplicate in functionality) are going to slow down the entire operation enough that a single-process all-perl implementation will outpace it by a good margin. Of course, your milage may vary.
Re: Large scale search and replace with perl -i
by Jenda (Abbot) on Apr 15, 2003 at 11:51 UTC

    Just for reference. If you are using Windows and have G.pm installed you can do it like this:

    perl -MG=R -pi.bak -e "s/foo/bar/g" *.html
    The =R will tell G.pm to do the parameter globing recursively.

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

    Edit by castaway: Closed small tag in signature

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://250369]
Approved by Mr. Muskrat
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-04-20 01:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found