Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Very slow regex substitution on Unicode string

by petdance (Parson)
on May 26, 2011 at 04:23 UTC ( #906747=note: print w/ replies, xml ) Need Help??


in reply to Very slow regex substitution on Unicode string

In addition to what tchrist said, it's worth noting that the parsimonious matching can be much slower than other choices because of all the backtracking it has to do.

I don't know what your data looks like, but if you can make limits on the regex based on what you know about the data, then that will help. For instance, in the HTML comments you're capturing HINT followed by almost anything. Does it need to be that liberal? Or are you really only looking for HINT followed by some non-whitespace? If you can change the HINT.*? part of your regex to HINT\S* you will get much faster times.

Of course, that's just a guess on my part, but anything you can do to help the regex matcher limit its range of work, the better.

xoxo,
Andy


Comment on Re: Very slow regex substitution on Unicode string
Re^2: Very slow regex substitution on Unicode string
by pbijnens (Novice) on May 26, 2011 at 06:58 UTC

    There is not so much backtracking going on I think.

    I need to delete the whole xml-comment when it starts with the space+HINT. I think the fastest way is:

    s/<!-- HINT.*? -->//
    The ".*?" should do minimal matching, i.e. stopping as soon as the string-so-far is followed by the litteral space-dash-dash-greaterthan. This (should) makes it do no backtracking at all, I believe. I do not think I can make it faster, or am I wrong?

    Also the removal of the initial BOM should not lead to much backtracking. The regex engine should look at most at one character ahead in this case (I think).

    s/^\x{feff}*//
    And note that the whole regex was anchored at the beginning of the string too; again leading to much less work.

    I'll just wait for CentOS 6.x to come out, which has perl5.10, and until then just split up the regex. Good enough for me.

      You should get minimal backtracking. At the start of each word in the comment it will backtrack over the space between words.

      You should look into building your own up-to-date perl. I've found it to be remarkably easy on both Solaris & Linux boxes, and you get to use the latest & greatest perl for you own work. Just leave the system version where it is, and put your own on the #! line of your programs or put it earlier in your personal PATH.

      TJD

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://906747]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (11)
As of 2014-07-25 19:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (174 votes), past polls