Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Multiple substitutions in large files

by mdi (Acolyte)
on May 09, 2005 at 13:41 UTC ( #455179=perlquestion: print w/ replies, xml ) Need Help??
mdi has asked for the wisdom of the Perl Monks concerning the following question:

I need to do multiple substitutions in several large (1-10MB) files. I've been using this:
use strict; use warnings; use Tie::File; foreach my $x (@ARGV) { tie my @f, 'Tie::File', $x or die "Could not tie $x: $!\n"; for (@f) { s/^\|/\\N\|/; s/\|\s*$/\|\\N/; s/\|\s*\|/\|\\N\|/g; s/\|\.\s*\|/\|\\N\|/g; s/\|\s+/\|/g; s/\s+\|/\|/g; s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g; s/(\d{5})-(?:\d{1,4}|\s+)/$1/; } }
...but this is taking entirely too long, and using up too much CPU. How can I do this more efficiently?

Comment on Multiple substitutions in large files
Download Code
Re: Multiple substitutions in large files
by dragonchild (Archbishop) on May 09, 2005 at 13:46 UTC
    #!/usr/bin/perl -p s/^\|/\\N\|/; s/\|\s*$/\|\\N/; s/\|\s*\|/\|\\N\|/g; s/\|\.\s*\|/\|\\N\|/g; s/\|\s+/\|/g; s/\s+\|/\|/g; s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g; s/(\d{5})-(?:\d{1,4}|\s+)/$1/;

    Execute as so:

    my_scriptydoo.pl file1 > file2

    Update: ikegami is absolutely correct. I should be doing a redirect. The next 1st level response provides the -pi version.


    • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
    • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
      Shouldn't that be -pi (or -pi.bak if a backup is desired)? With just -p, the usage would be my_scriptydoo.pl file1 > file1.new
Re: Multiple substitutions in large files
by Joost (Canon) on May 09, 2005 at 13:48 UTC
Re: Multiple substitutions in large files
by ikegami (Pope) on May 09, 2005 at 14:58 UTC

    a|b||d becomes a|b|\N|d
    |b|c|d becomes \N|b|c|d
    a|b|c| becomes a|b|c|\N
    and similarly,
    a|b|.|d becomes a|b|\N|d
    but
    .|b|c|d does not become \N|b|c|d
    a|b|c|. does not become a|b|c|\N
    Is that a bug?

    If the above is a bug, the following regexps are probably faster:

    s/\s*\|\s*/\|/g; s/^\.?(?=\|)/\\N/; s/(?<=\|)\.?(?=\||$)/\\N/g; s/(?<=\d{2}:\d{2}:\d{2})\.\d+//g; s/(?<=\d{5})-(?:\d{1,4}|\s+)//;

    If the above is not a bug, the following regexps are probably faster:

    s/\s*\|\s*/\|/g; s/^(?=\|)/\\N/; s/(?<=\|)(?=\||$)/\\N/g; s/(?<=\|)\.(?=\|)/\\N/g; s/(?<=\d{2}:\d{2}:\d{2})\.\d+//g; s/(?<=\d{5})-(?:\d{1,4}|\s+)//;

    I reduced the number of regexps by combining a few, I shortened the regexps by removing the spaces first (not last), and I used zero-widths positive lookaheads and lookbehinds to mimimze the text being captured and substituted.

    Use this in conjuction with the -p or -pi suggestion for better results.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://455179]
Approved by Fletch
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2014-12-28 10:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (180 votes), past polls