Re: Large scale search and replace with perl -i
by jasonk (Parson) on Apr 14, 2003 at 18:52 UTC
|
find . -name '*.html' -type -f -print0 | \
xargs -0 -n 50 perl -pi -e 's/foo/bar/g'
This will use find to list all the files you want, and xargs to pass them to your perl script. By specifying the -n 50 option to xargs, each invocation of perl will be passed a maximum of 50 filenames to process (if you still get too many arguments because your paths are really long, lower the number). I haven't benchmarked it to make sure, but I suspect that under most circumstances the overhead of using grep first to find the files that contain the thing you want to replace will actually be less efficient than just running the replacement on every file you find.
We're not surrounded, we're in a target-rich environment! |
---|
| [reply] [d/l] |
|
If your xargs is any good, you don't have to use
the -n option. xargs will know the limits
of your OS, and create argument lists that will neither have
too many argument, nor will the flattened argument list exceed
your OSses limit.
Abigail
| [reply] |
•Re: Large scale search and replace with perl -i
by merlyn (Sage) on Apr 14, 2003 at 19:22 UTC
|
use File::Find;
@ARGV = ();
find sub {
push @ARGV, $File::Find::name if -f and /\.html$/;
}, ".";
{
local $^I = ".bak";
local $/;
while (<>) {
if (s/foo/bar/g) { # changes?
print; # print the new one
} else { # no changes? back it out!
close ARGVOUT; # for windows, not needed on Unix
rename "$ARGV$^I", $ARGV or warn "Cannot rename for $ARGV$^I: $!
+";
}
}
}
-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply. | [reply] [d/l] |
|
One small caveat with this neat technique (that I just got bitten by) is that if $^I is set to a wild card (eg. *.bak or orig_*) so that the filename of the backup is edited rather than simply appended, the rename will fail.
I'll hazard a guess as to your respose to this as
<merlyn>Don't do that then. {grin}</merlyn>
but I thought it was worth a mention here :)
Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.
| [reply] |
|
But, that star there doesn't do anything. That would create files named "foo*.bak" and "bar*.bak" from "foo" and "bar". And thus, the rename would undo it just fine.
Unless you're talking about some local hack to your Perl to make it interpret $^I differently.
-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.
update
Oh my goodness. A new feature was snuck in to Perl in 5.6, and documented in perlrun but not perlvar.
My apologies. Wow, I'll have to write a column about it now to remember it. {grin} And I don't recall it in the perldelta from 5.5 to 5.6, or perhaps I considered it un-noteworthy. Yeah, just checked, not in perldelta. No wonder I hadn't noticed it.
update 2
On further research, 5.4 didn't have the feature, but 5.5 did. And yet it wasn't in 5.5's perldelta. That's why I missed it. I don't always diff the entire manpage set. {sigh} I rely on perldelta.
update 3
See "Put your inplace-edit backup files into a subdir".
| [reply] |
Re: Large scale search and replace with perl -i
by antifun (Sexton) on Apr 14, 2003 at 19:23 UTC
|
First question: how many is a "large number"? If it's on the order of 10^4 or less, you will probably spend more time fiddling with a script than it would take to do with a more "brute-force" approach. (Given reasonably fast computer, yadda yadda yadda.)
As for the more theoretical question, you would certainly want to use the second approach (with find -exec grep -l foo) to reduce your working file set as much as possible.
Then your next issue is avoiding the overhead of running multiple perls. The -i switch relies on the magic of <>, which is @ARGV if there are command-line arguments, and STDIN if there are not (paraphrasing slightly). However, what you need to do in this case is use both kinds of magic, so your perl will have to be a little more creative. It's harder to do the shuffle that -i does than to read from STDIN manually, so here's one way to try it:
find . -name "*.html" -type f -exec grep -l foo {} \; | perl -pi -e 'B
+EGIN{ @ARGV = <STDIN>; chomp @ARGV }; while (<>) { s/foo/bar/g; } co
+ntinue { print }'
Notice that you can fiddle with @ARGV before the <> magic takes place. The internals of the script are basically what the -p option does.
---
"I hate it when I think myself into a corner."
Matt Mitchell | [reply] [d/l] |
|
you would certainly want to use the second approach (with find -exec grep -l foo) to reduce your working file set as much as possible.
You would certainly not, because you will have to open all files anyway - even if just to check. The difference is that grepping for matches first will make you spawn one process per file as well as require to open the matching files another time (in Perl) to actually process them. You have a (large) net loss that way.
Taking that out, and using the -print0 option to avoid some nasty surprises (but not all, unfortunately, due to the darn magic open) leaves us with the following. Note I have removed the continue {} block as it isn't necessary and just costs time. I'm also setting the record separator such that the diamond operator reads fixed size blocks (64kbytes in this example), rather than scanning for some end of line character.
find . -name "*.html" -type f -print0 | \
perl -i -p0e \
'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = "\n" }; \
while (<>) { s/foo/bar/g; print }'
That should be about as efficient as it gets.
If you have a lot of nonmatching files, you might save work by hooking a grep in there - but not with find's -exec. That's what xargs was invented for.
find . -name "*.html" -type f -print0 | \
xargs -r0 grep -l0 | \
perl -i -p0e \
'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = "\n" }; \
while (<>) { s/foo/bar/g; print }'
Update: s/= \\65536!= "\\n"/; as per runrig's observation.
Makeshifts last the longest. | [reply] [d/l] [select] |
|
find . -name "*.html" -type f -print0 | perl -i -p0e \
'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = \65536 }; \
while (<>) { s/foo/bar/g; print }'
You don't want to do that. If 'foo' spans across one of those read blocks, then you'll miss the substitution. | [reply] [d/l] |
|
Re: Large scale search and replace with perl -i
by BrowserUk (Patriarch) on Apr 15, 2003 at 02:05 UTC
|
Given that most html files are usually (hopefully) < 1 MB in size, it would make sense to use Aristotle's technique of changing $/, but set it to null and slurp the whole file each time.
find . -name "*.html" -type f -print0 | \
perl -i -p0e \
'BEGIN{ @ARGV = <STDIN>; chomp @ARGV; $/ = '' }; \
while (<>) { s/foo/bar/g; print }'
If the number of files produced by find is too many for your command line to handle, couldn't you produce a list of directories from find and pass that into perl and then let perl glob those? Something like (NB:completely untested code)
find . -type d -print0 | \
perl -i -p0e \
'BEGIN{ @ARGV = <STDIN>; \
chomp @ARGV; \
@ARGV = map{glob "$_/*.html"}; \
$/ = '' }; \
while (<>) { s/foo/bar/g; print }'
Combining that with Merlyn's trick of backing out the -i effect if nothing is found should save more time.
Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.
| [reply] [d/l] [select] |
Re: Large scale search and replace with perl -i
by neilwatson (Priest) on Apr 14, 2003 at 18:55 UTC
|
Try this:
find / -name "*.html" -exec perl -pi -e 's/find/replace/gi' {} \;
Update Ooops reread your questions. Hmmm, not sure about ignoring certain files. However, does filtering your find file list through grep really gain you any speed? You are having grep go through all your files and then have perl go through whatever files grep returns.
Neil Watson
watson-wilson.ca | [reply] [d/l] |
|
| [reply] |
Re: Large scale search and replace with perl -i
by Improv (Pilgrim) on Apr 14, 2003 at 19:23 UTC
|
One thing you might consider, given that you're willing to
put the time into asking on Perlmonks, is using find2perl
-- it should be a lot more efficient than actually using
find. | [reply] |
|
Is it? I'd like to see some benchmark. It's certainly not my
impression, and I don't find it logically, find
being a C program written to do exactly one task.
Abigail
| [reply] |
|
The reason it should be, apart from it being suggested to be
so in the find2perl manpage (hehe), is that process creation
is a fairly expensive operation, and it usually is the
case that all the spawnings of perl (or anything else that
perl can easily duplicate in functionality) are going to
slow down the entire operation enough that a single-process
all-perl implementation will outpace it by a good margin.
Of course, your milage may vary.
| [reply] |
|
|
Re: Large scale search and replace with perl -i
by Jenda (Abbot) on Apr 15, 2003 at 11:51 UTC
|
perl -MG=R -pi.bak -e "s/foo/bar/g" *.html
The =R will tell G.pm to do the parameter globing recursively.
Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
Edit by castaway: Closed small tag in signature | [reply] [d/l] |