### Multiple substitutions in large files

by mdi (Acolyte)
 on May 09, 2005 at 13:41 UTC Need Help??
mdi has asked for the wisdom of the Perl Monks concerning the following question:

I need to do multiple substitutions in several large (1-10MB) files. I've been using this:
use strict;
use warnings;
use Tie::File;

foreach my $x (@ARGV) { tie my @f, 'Tie::File',$x or die "Could not tie $x:$!\n";
for (@f) {
s/^\|/\\N\|/;
s/\|\s*$/\|\\N/; s/\|\s*\|/\|\\N\|/g; s/\|\.\s*\|/\|\\N\|/g; s/\|\s+/\|/g; s/\s+\|/\|/g; s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g;
s/(\d{5})-(?:\d{1,4}|\s+)/$1/; } } [download] ...but this is taking entirely too long, and using up too much CPU. How can I do this more efficiently? Replies are listed 'Best First'. Re: Multiple substitutions in large files by Joost (Canon) on May 09, 2005 at 13:48 UTC #!/usr/bin/perl -pi s/^\|/\\N\|/; s/\|\s*$/\|\\N/;
s/\|\s*\|/\|\\N\|/g;
s/\|\.\s*\|/\|\\N\|/g;
s/\|\s+/\|/g;
s/\s+\|/\|/g;
s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g; s/(\d{5})-(?:\d{1,4}|\s+)/$1/;
[download]

Should be very much more effective, because it creates new files instead of trying to move all the bytes in the file around for each substitution.

See perlrun's info on the -i and -p switches.

Re: Multiple substitutions in large files
by dragonchild (Archbishop) on May 09, 2005 at 13:46 UTC
#!/usr/bin/perl -p

s/^\|/\\N\|/;
s/\|\s*$/\|\\N/; s/\|\s*\|/\|\\N\|/g; s/\|\.\s*\|/\|\\N\|/g; s/\|\s+/\|/g; s/\s+\|/\|/g; s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g;
s/(\d{5})-(?:\d{1,4}|\s+)/$1/; [download] Execute as so: my_scriptydoo.pl file1 > file2 [download] Update: ikegami is absolutely correct. I should be doing a redirect. The next 1st level response provides the -pi version. • In general, if you think something isn't in Perl, try it out, because it usually is. :-) • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?" Shouldn't that be -pi (or -pi.bak if a backup is desired)? With just -p, the usage would be my_scriptydoo.pl file1 > file1.new Re: Multiple substitutions in large files by ikegami (Pope) on May 09, 2005 at 14:58 UTC a|b||d becomes a|b|\N|d |b|c|d becomes \N|b|c|d a|b|c| becomes a|b|c|\N and similarly, a|b|.|d becomes a|b|\N|d but .|b|c|d does not become \N|b|c|d a|b|c|. does not become a|b|c|\N Is that a bug? If the above is a bug, the following regexps are probably faster: s/\s*\|\s*/\|/g; s/^\.?(?=\|)/\\N/; s/(?<=\|)\.?(?=\||$)/\\N/g;
s/(?<=\d{2}:\d{2}:\d{2})\.\d+//g;
s/(?<=\d{5})-(?:\d{1,4}|\s+)//;
[download]

If the above is not a bug, the following regexps are probably faster:

s/\s*\|\s*/\|/g;
s/^(?=\|)/\\N/;
s/(?<=\|)(?=\||\$)/\\N/g;
s/(?<=\|)\.(?=\|)/\\N/g;
s/(?<=\d{2}:\d{2}:\d{2})\.\d+//g;
s/(?<=\d{5})-(?:\d{1,4}|\s+)//;
[download]

I reduced the number of regexps by combining a few, I shortened the regexps by removing the spaces first (not last), and I used zero-widths positive lookaheads and lookbehinds to mimimze the text being captured and substituted.

Use this in conjuction with the -p or -pi suggestion for better results.

 Option a) would mean launching cmd.exe /k c:\path\to\ batchfile- launching-perl- script.cmd. Option b) would be to add pause as the last line of said batch file. [LanX]: First day after holidays ... and already stressed by the fact that colleagues changed stuff without communication ... apparently I'm the only one trying to fight entropy [Corion]: LanX: The command is always in the history if you typed it in before. If you didn't type the command into the command line, it will not be there. I think there is doskey which can stuff command lines into the history LanX damns the cult of CB ;-) LanX WTF WTF WTF [LanX]: please forget my last 3 posts [LanX]: Yeah option a doesn't go into history [LanX]: probably I need to teach the app to restart after C-c Kill [Corion]: LanX: Maybe have an infinite-loop cmd file? Much easier than trying to manage that from within Perl IMO [Corion]: Alternatively, relaunch the application from cron (or a Windows cron) every minute

