Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Multiple substitutions in large files

by mdi (Acolyte)
on May 09, 2005 at 13:41 UTC ( #455179=perlquestion: print w/replies, xml ) Need Help??
mdi has asked for the wisdom of the Perl Monks concerning the following question:

I need to do multiple substitutions in several large (1-10MB) files. I've been using this:
use strict; use warnings; use Tie::File; foreach my $x (@ARGV) { tie my @f, 'Tie::File', $x or die "Could not tie $x: $!\n"; for (@f) { s/^\|/\\N\|/; s/\|\s*$/\|\\N/; s/\|\s*\|/\|\\N\|/g; s/\|\.\s*\|/\|\\N\|/g; s/\|\s+/\|/g; s/\s+\|/\|/g; s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g; s/(\d{5})-(?:\d{1,4}|\s+)/$1/; } }
...but this is taking entirely too long, and using up too much CPU. How can I do this more efficiently?

Replies are listed 'Best First'.
Re: Multiple substitutions in large files
by Joost (Canon) on May 09, 2005 at 13:48 UTC
Re: Multiple substitutions in large files
by dragonchild (Archbishop) on May 09, 2005 at 13:46 UTC
    #!/usr/bin/perl -p s/^\|/\\N\|/; s/\|\s*$/\|\\N/; s/\|\s*\|/\|\\N\|/g; s/\|\.\s*\|/\|\\N\|/g; s/\|\s+/\|/g; s/\s+\|/\|/g; s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g; s/(\d{5})-(?:\d{1,4}|\s+)/$1/;

    Execute as so:

    my_scriptydoo.pl file1 > file2

    Update: ikegami is absolutely correct. I should be doing a redirect. The next 1st level response provides the -pi version.


    • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
    • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
      Shouldn't that be -pi (or -pi.bak if a backup is desired)? With just -p, the usage would be my_scriptydoo.pl file1 > file1.new
Re: Multiple substitutions in large files
by ikegami (Pope) on May 09, 2005 at 14:58 UTC

    a|b||d becomes a|b|\N|d
    |b|c|d becomes \N|b|c|d
    a|b|c| becomes a|b|c|\N
    and similarly,
    a|b|.|d becomes a|b|\N|d
    but
    .|b|c|d does not become \N|b|c|d
    a|b|c|. does not become a|b|c|\N
    Is that a bug?

    If the above is a bug, the following regexps are probably faster:

    s/\s*\|\s*/\|/g; s/^\.?(?=\|)/\\N/; s/(?<=\|)\.?(?=\||$)/\\N/g; s/(?<=\d{2}:\d{2}:\d{2})\.\d+//g; s/(?<=\d{5})-(?:\d{1,4}|\s+)//;

    If the above is not a bug, the following regexps are probably faster:

    s/\s*\|\s*/\|/g; s/^(?=\|)/\\N/; s/(?<=\|)(?=\||$)/\\N/g; s/(?<=\|)\.(?=\|)/\\N/g; s/(?<=\d{2}:\d{2}:\d{2})\.\d+//g; s/(?<=\d{5})-(?:\d{1,4}|\s+)//;

    I reduced the number of regexps by combining a few, I shortened the regexps by removing the spaces first (not last), and I used zero-widths positive lookaheads and lookbehinds to mimimze the text being captured and substituted.

    Use this in conjuction with the -p or -pi suggestion for better results.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://455179]
Approved by Fletch
help
Chatterbox?
[Corion]: Option a) would mean launching cmd.exe /k c:\path\to\ batchfile- launching-perl- script.cmd. Option b) would be to add pause as the last line of said batch file.
[LanX]: First day after holidays ... and already stressed by the fact that colleagues changed stuff without communication ... apparently I'm the only one trying to fight entropy
[Corion]: LanX: The command is always in the history if you typed it in before. If you didn't type the command into the command line, it will not be there. I think there is doskey which can stuff command lines into the history
LanX damns the cult of CB ;-)
LanX WTF WTF WTF
[LanX]: please forget my last 3 posts
[LanX]: Yeah option a doesn't go into history
[LanX]: probably I need to teach the app to restart after C-c Kill
[Corion]: LanX: Maybe have an infinite-loop cmd file? Much easier than trying to manage that from within Perl IMO
[Corion]: Alternatively, relaunch the application from cron (or a Windows cron) every minute

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (12)
As of 2017-03-27 15:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should Pluto Get Its Planethood Back?



    Results (320 votes). Check out past polls.