Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: Very slow regex substitution on Unicode string

by petdance (Parson)
on May 26, 2011 at 04:23 UTC ( #906747=note: print w/replies, xml ) Need Help??

in reply to Very slow regex substitution on Unicode string

In addition to what tchrist said, it's worth noting that the parsimonious matching can be much slower than other choices because of all the backtracking it has to do.

I don't know what your data looks like, but if you can make limits on the regex based on what you know about the data, then that will help. For instance, in the HTML comments you're capturing HINT followed by almost anything. Does it need to be that liberal? Or are you really only looking for HINT followed by some non-whitespace? If you can change the HINT.*? part of your regex to HINT\S* you will get much faster times.

Of course, that's just a guess on my part, but anything you can do to help the regex matcher limit its range of work, the better.


  • Comment on Re: Very slow regex substitution on Unicode string

Replies are listed 'Best First'.
Re^2: Very slow regex substitution on Unicode string
by pbijnens (Novice) on May 26, 2011 at 06:58 UTC

    There is not so much backtracking going on I think.

    I need to delete the whole xml-comment when it starts with the space+HINT. I think the fastest way is:

    s/<!-- HINT.*? -->//
    The ".*?" should do minimal matching, i.e. stopping as soon as the string-so-far is followed by the litteral space-dash-dash-greaterthan. This (should) makes it do no backtracking at all, I believe. I do not think I can make it faster, or am I wrong?

    Also the removal of the initial BOM should not lead to much backtracking. The regex engine should look at most at one character ahead in this case (I think).

    And note that the whole regex was anchored at the beginning of the string too; again leading to much less work.

    I'll just wait for CentOS 6.x to come out, which has perl5.10, and until then just split up the regex. Good enough for me.

      You should get minimal backtracking. At the start of each word in the comment it will backtrack over the space between words.

      You should look into building your own up-to-date perl. I've found it to be remarkably easy on both Solaris & Linux boxes, and you get to use the latest & greatest perl for you own work. Just leave the system version where it is, and put your own on the #! line of your programs or put it earlier in your personal PATH.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://906747]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2018-05-22 01:32 GMT
Find Nodes?
    Voting Booth?