Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Improve processing time for string substitutions

by valavanp (Curate)
on Apr 16, 2007 at 13:34 UTC ( #610350=perlquestion: print w/ replies, xml ) Need Help??
valavanp has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I want to convert the entities like &ge, &le, into hexa values ≥, ≤. The contents of the file are in $string. I am storing the %entitylist in a separate file. This file consists of 15,000 entities and hexadecimal values e.g. '&ge' => '≥'. So from the below code it takes 20 seconds to process each file as it has to read the entitylist file and replace with hexa value.
while (my ($key, $value) = each(%entitylist)){ #print $key."\n".$value."\n"; $string =~ s/$key/$value/g; }
Is there any other solution to minimise the process time. thanks monks for your suggestions.

Comment on Improve processing time for string substitutions
Download Code
Replies are listed 'Best First'.
Re: Improve processing time for string substitutions
by Corion (Pope) on Apr 16, 2007 at 13:54 UTC

    Like I already told you in the CB:

    my $re = join "|", map { quotemeta $_ } keys %entitylist; $re = qr/(?:$re)/; # ... $string =~ s/($re)/$entitylist{ $1 }/ge;

    You might want to take a look at perlre for the /e switch.

      Is there a reason for the non-capturing parens in "$re = qr/(?:$re)/;"?

      What good does the /e switch do in the regex "s/($re)/$entitylist{ $1 }/ge;"?


        No (qr// already adds non-capturing parens) and none (interpolation is done after each match).

        The /e switch is an error by me - it is a leftover from when I thought about doing the hex-conversion manually with a sprintf call in the right hand side..

        I always build my regular expressions with noncapturing parentheses when I pal on latter assembling them - this prevents embarassing bug hunts later when I change how the target RE is built, possibly by repeating one ("atomic") building block - leaving out the parentheses causes hard-to-track misbehaviour with input on the seam of the two blocks.

Re: Improve processing time for string substitutions
by kyle (Abbot) on Apr 16, 2007 at 13:47 UTC

    Off the top of my head:

    my $entity = join '|', keys %entitylist; $string =~ s/($entity)/$entitylist{$1}/g;

    This way you only scan the whole string once instead of once for each entity.

    Update: I've gotten a private message that this needs a /e switch to work. I submit that this is not the case.

    my %entitylist = ( a => 1, b => 2); my $string = 'abc'; my $entity = join '|', keys %entitylist; $string =~ s/($entity)/$entitylist{$1}/g; print $string, "\n"; __END__ 12c

    (I actually tested this before my original post, but I only pasted in the relevant portion.)

Re: Improve processing time for string substitutions
by liverpole (Monsignor) on Apr 16, 2007 at 13:47 UTC
    Hi valvanp,

    Do the contents of the file have to be in one $string?

    If you can read the lines of the file into an array instead, it should take less processing time, as the regex won't have to scan (and modify) one huge, single line each time.

Re: Improve processing time for string substitutions
by Krambambuli (Deacon) on Apr 16, 2007 at 14:35 UTC
    I'm having difficulties in understanding the exact form of the entities - is it just my browser/setup or is something wrong with the formatting ?

    I'd be anxious to know what Benchmark would show about the various speedup suggestions.

    If it would be true that the entities content might be separated by successively splitting on '>' and then on '<', replacing the content from the hash and re-joining the modified parts might even beat the regexp-based approach.


      It's unescaped HTML entities:

      I want to convert the entities like &ge, &le, into hexa values &#x2265, &#x2264.

      But, I'll also point out that the actual entity should end in a semicolon. (which prevents issues such as '&or;' matching '&ordf;')

        Well then, it's something like below that might fit.

        The results are interesting:
        Benchmark: timing 1000 iterations of regexish, with_splitting... regexish: 59 wallclock secs (58.16 usr + 0.02 sys = 58.18 CPU) @ 17 +.19/s (n=1000) with_splitting: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) +@ 50000.00/s (n=1000) (warning: too few iterations for a reliable count)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://610350]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (3)
As of 2015-10-04 03:51 GMT
Find Nodes?
    Voting Booth?

    Does Humor Belong in Programming?

    Results (98 votes), past polls