Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Massive regexp search and replace

by Hena (Friar)
on Feb 10, 2005 at 13:15 UTC ( #429701=note: print w/ replies, xml ) Need Help??


in reply to Massive regexp search and replace

I assume that source in the patterns are unique. This assumption comes from the fact that it they are not, then you end up doing only the first. If that assumption is correct, then I suggest you parse the patterns as hash instead of list, this would remove someamount of splits. Like this:

# assume REGEX is the pattern filehandle # asseme INPUT is the your input filehandle my %regex=(); while (<REGEX>) { chomp; my ($key,$value) = split (\t,$_); $value = "\"$value\""; $regex{$key}=$value; } while (<INPUT>) { s/$key/$regex{$key}/gee foreach my $key (keys %regex); }
This could also allow testing if there is an regex you want to use 'exists()' (depending on input, eg change only certain column within csv file or something). But since I don't know if input is suitable for this, i can't know if exists could be used. If it could, you might be able to drop the second foreach loop completely.


Comment on Re: Massive regexp search and replace
Download Code
Re^2: Massive regexp search and replace
by albert.llorens (Initiate) on Feb 10, 2005 at 13:31 UTC
    Thanx Hena. I will try what you suggest and see if it reduces processing time sufficiently.

    As for your assumtions, a sample replacement patterns list (REGEX) could be:
    \b([a-z])([a-z]*)ung\b \u$1\l$2ung Treecontrol Tree Control [Tt]abreiter Reiterelement [Tt]ile Teilbild
    And a sample input text (INPUT) for the replacements could be:
    Die Segnung ist gestern erfolgt. Die segnung ist gestern erfolgt. Die Rechnung wird geschickt. Die rechnung wird geschickt. Die Treecontrol. Die Tabreiter. Die tabreiter. Die Tile. Die tile.
    I wonder if this changes anything in what you suggest...
      Well, all direct text translations might be handled faster... but unless there is a lot of them compared to others then it probably won't help (might actually be slower). The actual help would be better to be tested as this is pure speculation :).

      Basicly make to hashes instead of one. Something like this.
      while (<REGEX>) { chomp; my ($key,$value) = split (\t,$_); $value = "\"$value\""; if ($key=~s/^\w+$/) { $simple{$key}=$value; } else { $regex{$key}=$value; } } while (<INPUT>) { s/$key/$regex{$key}/gee foreach my $key (keys %regex); foreach (split (/\s+/,$_)) { if (exists($simple{$_})) { push (@line,$simple{$_}); } else { push (@line,$_); } } print OUT "@line\n"; }
      Note that in the given examples, you might write out the '[Tt]ile' pattern to Tile and tile rows. Which would move it from slower pattern group to faster. But as I said, I'm not sure how much this would help.
      Expanding on Hena's idea I wonder if it would be even more efficient to use Tie::File to go through, writing replacements as you go (untested):
      use Tie::File; my $inputfile = "samplein.txt"; &replacer($inputfile); sub replacer { tie my @currentfile, 'Tie::File', $inputfile or die "$!"; my $inputline; foreach $inputline ( $currentfile[0] .. $#currentfile ) { foreach my $key (keys %regex) { $inputline =~ s/$key/$regex{$key}/gee; } } untie @currentfile; } ## Totally untested

      Seems like the write operation would be faster with Tie::File

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://429701]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2014-09-19 08:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (133 votes), past polls