|Perl: the Markov chain saw|
Re^2: Arbitrary number of captures in a regular expressionby throop (Chaplain)
|on Sep 24, 2007 at 04:44 UTC||Need Help??|
For those for whom '\G' is deep into 'executable line noise' country:
The \G anchor forces the next match to start where the last match left off. Use \G analogously to ^ at the beginning of a string. ^ matches only the beginning of a string – \G matches only the beginning of the string when greedy matching has chewed off the front of the string.
perlfaq5 has more detail. (The internal hyperlink at perldoc.perl.org is broken – apparently the backslash discombobulated the escapeHTML routines. But this link will get you there.) The other piece of the puzzle is the '(?='. This handy expression—the 'zero width positive lookahead' (along with its evil twin '(?!') are explained in more detail at perlretut.
You may also want to review Non-capturing-groupings.
Let's take Sidhekin's piece of work apart, and not be quite so terse. As perlretut says
Long regexps like this may impress your friends, but can be hard to decipher. In complex situations like this, the //x modifier for a match is invaluable. It allows one to put nearly arbitrary whitespace and comments into a regexp without affecting their meaning. Using it, we can rewrite our 'extended' regexp in the more pleasing formSo using the x modifier, the heart of Sidhekin's code becomes
Notice—since whitespace is not significant when using the //x modifier. So where Sidhekin used a single blankspace, I had to use a '\s'.
This is straightforward way for a programmer to do a greedy capture in the middle of the string. Realize tho that it it not the most straightforward way for the computer. For each 'm \d+' expression in the string, the computer
This is a trivial amount of extra work on a single line. But if you are attempting to do something similar by, say, matching across line breaks and pattern searching a set of 120 page MS-Word documents, you may notice some performance problems.
Update: added detail