|laziness, impatience, and hubris|
Multi-line Regex Performanceby pboin (Deacon)
|on Nov 01, 2005 at 15:35 UTC||Need Help??|
pboin has asked for the wisdom of the Perl Monks concerning the following question:
I've got a rather large (327M) file of multiple line records to process. I'm picking up each record, pulling the key out, and then looking into a hash to see if it's a 'keeper' or not.
Records look like this:
I've got a regex that technically works, but the performance is very poor (over 1 sec. per record). I wondered about slurping the whole file in, but I'm not swaping at all, so that 327M is all in memory -- shouldn't be a problem, right?. My best hunch is that the regex is smarter than I am, and it is doing some heavy-duty backtracking that I'm not understanding.
I think the regex is starting to capture at double-hashes. Then, continue non-greedy matching anything (including newline), until a positive-lookahead of either a) more double-hashes, denoting a new record or b) EOF.
My code looks like this:
This is definitely a non-linear problem...
Prompted by some insightful responses, I decided to minimize my dependence on regex in this case. I decided to buffer the file manually, and got the 80k line test time down to just over 1.2 seconds.
The performance problem w/ the regex is directly addressed and explained quite well by TheDamian in his must-read book Perl Best Practices. I almost want to keep the secret 'cause I feel so strongly that every perl programer should have this book. But... The extenstive tracking and back-tracking from .* seems to be the problem. Read about it in 'Unconstrained Repetitions' on p. 250.
Thanks for taking a look...