Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Massive regexp search and replace

by albert.llorens (Initiate)
on Feb 10, 2005 at 12:39 UTC ( #429692=perlquestion: print w/ replies, xml ) Need Help??
albert.llorens has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I need to do massive regexp search and replace. For a given text file, I need to go through each line of this file and apply each of the (different) replacement expressions listed in a Perl script (or a separate data file). The whole thing is not too complex to implement. The problem is my replacement expressions may amount to thousands, in which case all several implementations I have tried are extremely slow.

Here is a sample implementation of mine:
open(IN, "<samplein.txt"); my @INPUT = <IN>; close (IN); open (OUT,">sampleout.txt"); foreach $INline (@INPUT) { foreach my $Rpatt (@Patts) { (my $Source, my $Target) = split(/\t/, $Rpatt); if ($Source && $Target) { $Source = $Source; $Target = "\"$Target\""; $INline =~ s/$Source/$Target/gee); } } chomp $INline; print OUT $INline; } close (OUT);

This implementation does the job, but when the list of replacement patterns (@Patts) is big (several thousands), the replacement takes ages.

Can anyone help me find an efficient implementation.

Comment on Massive regexp search and replace
Download Code
Re: Massive regexp search and replace
by Hena (Friar) on Feb 10, 2005 at 13:15 UTC
    I assume that source in the patterns are unique. This assumption comes from the fact that it they are not, then you end up doing only the first. If that assumption is correct, then I suggest you parse the patterns as hash instead of list, this would remove someamount of splits. Like this:
    # assume REGEX is the pattern filehandle # asseme INPUT is the your input filehandle my %regex=(); while (<REGEX>) { chomp; my ($key,$value) = split (\t,$_); $value = "\"$value\""; $regex{$key}=$value; } while (<INPUT>) { s/$key/$regex{$key}/gee foreach my $key (keys %regex); }
    This could also allow testing if there is an regex you want to use 'exists()' (depending on input, eg change only certain column within csv file or something). But since I don't know if input is suitable for this, i can't know if exists could be used. If it could, you might be able to drop the second foreach loop completely.
      Thanx Hena. I will try what you suggest and see if it reduces processing time sufficiently.

      As for your assumtions, a sample replacement patterns list (REGEX) could be:
      \b([a-z])([a-z]*)ung\b \u$1\l$2ung Treecontrol Tree Control [Tt]abreiter Reiterelement [Tt]ile Teilbild
      And a sample input text (INPUT) for the replacements could be:
      Die Segnung ist gestern erfolgt. Die segnung ist gestern erfolgt. Die Rechnung wird geschickt. Die rechnung wird geschickt. Die Treecontrol. Die Tabreiter. Die tabreiter. Die Tile. Die tile.
      I wonder if this changes anything in what you suggest...
        Well, all direct text translations might be handled faster... but unless there is a lot of them compared to others then it probably won't help (might actually be slower). The actual help would be better to be tested as this is pure speculation :).

        Basicly make to hashes instead of one. Something like this.
        while (<REGEX>) { chomp; my ($key,$value) = split (\t,$_); $value = "\"$value\""; if ($key=~s/^\w+$/) { $simple{$key}=$value; } else { $regex{$key}=$value; } } while (<INPUT>) { s/$key/$regex{$key}/gee foreach my $key (keys %regex); foreach (split (/\s+/,$_)) { if (exists($simple{$_})) { push (@line,$simple{$_}); } else { push (@line,$_); } } print OUT "@line\n"; }
        Note that in the given examples, you might write out the '[Tt]ile' pattern to Tile and tile rows. Which would move it from slower pattern group to faster. But as I said, I'm not sure how much this would help.
        Expanding on Hena's idea I wonder if it would be even more efficient to use Tie::File to go through, writing replacements as you go (untested):
        use Tie::File; my $inputfile = "samplein.txt"; &replacer($inputfile); sub replacer { tie my @currentfile, 'Tie::File', $inputfile or die "$!"; my $inputline; foreach $inputline ( $currentfile[0] .. $#currentfile ) { foreach my $key (keys %regex) { $inputline =~ s/$key/$regex{$key}/gee; } } untie @currentfile; } ## Totally untested

        Seems like the write operation would be faster with Tie::File
Re: Massive regexp search and replace
by holli (Monsignor) on Feb 10, 2005 at 14:57 UTC
    Using the following technique, you can encapsulate your regexes in anonymous subroutines, that can easily be called with the string to change as first argument. They return the changed string. Like this:
    #list of regex-strings my @regex = ( 's/(a+)/\U$1/g', 's/([bz]+)/XX/g', ); #is now a list of subroutines @regex = map { eval "\$sub = sub { \$_=\$_[0]; $_; \$_ }" } @regex;
    This list can easily be used like this:
    my @text = ( "aaaabbzz", "bbbyyy", ); for my $t ( @text ) { print "org $t\n"; for my $re ( @regex ) { $t = &$re($t); } print "new $t\n"; }
    Encapsulating the regexes in subroutines should be faster than recompiling the same regex again and again. Note, that I did no benchmarks.


    holli, /regexed monk/

      No doubt - advantage for compiling the regular expressions only once. But I'd take it just a tiny bit further - instead of all the copying around of the line:

      #list of regex-strings my @regex = ( 's/(a+)/\U$1/g', 's/([bz]+)/XX/g', ); #is now a list of subroutines @regex = map { eval "sub { $_ }" } @regex;
      Notes: got rid of the copying of the line in and out, we'll just work on the global $_; also got rid of the extraneous assignment to the global $sub variable. Now you use it like:
      my @text = ( "aaaabbzz", "bbbyyy", ); for ( @text ) { print "org $_\n"; for my $re ( @regex ) { &$re(); # or even just &$re } print "new $_\n"; }
      The advantage here is when you have many regex's (which the OP said they would) - less copying of data around. It's just a tiny bit more dangerous since so many functions modify $_, though.

Re: Massive regexp search and replace
by Anonymous Monk on Feb 10, 2005 at 15:00 UTC
    One problem you have is that you're recompiling the regexes over and over again. Better to cache them. I'd do something like:
    # Preprocessing phase. my @subs = map {my($r, $s) = split /\t/; [qr/$r/, $s]} @Patts; # Real stuff. open(my $in, "<", "...") or die "..."; open(my $out, ">", "...") or die "..."; while (<$in>) { s/$_[0]/$_[1]/g for @subs; # No need for /ee, is there? print $out $_; } close $out or die "...";
Re: Massive regexp search and replace
by grinder (Bishop) on Feb 10, 2005 at 15:08 UTC
    Can anyone help me find an efficient implementation.

    If the cost of a subroutine call is cheaper than scanning the list (which I suspect is the case), then you can assemble all the target patterns into one, perform a single match, and then dispatch to the sub that gives you what you want to substitute:

    #! /usr/local/bin/perl -w use strict; use Regexp::Assemble; my $ra; my %dispatch = ( 'food' => sub { 'pizza' }, 'water' => sub { 'beer' }, 'like' => sub { 'enjoy' }, '(\d+)F' => sub { int(($ra->mvar(1) - 32 ) * 5/9 ) . 'C' }, ); $ra = Regexp::Assemble->new( track => 1 )->add( keys %dispatch ); while( <DATA> ) { while( $ra->match($_) ) { my $m = $ra->matched; s/$m/&{$dispatch{$m}}/e; } print; } __DATA__ I'd like a glass of water with my food, it's 92F in here!

    ... produces...

    I'd enjoy a glass of beer with my pizza, it's 33C in here!

    Generating the dispatch table from a data file is left as an exercise to the reader (but a pretty fun one, I might say).

    - another intruder with the mooring in the heart of the Perl

      Thanks grinder.

      I had already used module Regexp::Assemble for another quite similar script, and I had thought of using it with track=>1 for my replacement script. But I could not because my new script hanged whenever I created a Regexp::Assemble object with track=>1.

      The exact line in my code where the script hanged:
      my $Rpatt = Regexp::Assemble->new( chomp=>1, track=>1 )->add( keys(%PA +TTS) );
      Where %PATTS is a hash with elements of the form 'pattern->replacement'. Any idea why it hanged?
        Any idea why it hanged?

        Without seeing %PATTS, no idea. Maybe you have uncovered a bug! If you're interested in resolving the problem, I now have a mailing list for the module. If you care to join up we can maybe get to the bottom of the problem.

        - another intruder with the mooring in the heart of the Perl

Re: Massive regexp search and replace
by rdfield (Priest) on Feb 10, 2005 at 15:51 UTC
    Maybe I'm missing something completely obvious, but have you tried the "o" modifier in your regexes? eg
    $INline =~ s/$Source/$Target/egoe

    rdfield

      No, but I have tried precompiling $Source with qr// in my patterns list before I do the actual replacement. Isn't that more or less equivalent to using /o?
      You´re missing that the /o-modifier is just a "promise" to perl, that the variable that contains the regex won´t change within the iteration/loop. In this cause using /o is just wrong, because only the first regex would be applied over and over. Regardless of the change of $Source or $Target.

      qr//-ing the regex is something different. When you qr// a string and assign it to a variable you store the compiled regex in it. This is easily proven by ref()-ing that variable, it will return "Regexp". Of course you can store such a Regexp in whatever data-structure you want, not just plain scalars.

      But that is of little use in this case, because there is no qs//-statment (qr// is just for matching, not for replacing).


      holli, /regexed monk/
Re: Massive regexp search and replace
by TedPride (Priest) on Feb 10, 2005 at 16:26 UTC
    The massive processing time is caused by having to run each regex on each line. The best way to cut processing time is to run the regexes on the entire file instead, or at least as large chunks of it as you can manage. Read x number of bytes and then everything to the next line break; perform all regex calls; output to a new file.

    This is a case where buying a block or two of additional memory may be the simplest - and most cost-effective - solution.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://429692]
Approved by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (16)
As of 2014-07-14 14:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (262 votes), past polls