|Perl: the Markov chain saw|
FMTYEWTK about split //by ysth (Canon)
|on Jan 21, 2004 at 00:49 UTC||Need Help??|
split //, $string is the usual idiom for splitting a string into a list of its characters.
Why it works as it does is actually quite complex.
First of all, match (m//) and substitution (s///) have a special case for an empty regex: they will apply the last successful regex instead, so:
may apply the regex /ab/ or /ef/ (or some previous regex, if that match didn't succeed) to $str2. (Note that "previous" here means execution order, not linear order in the source). If no previous regex succeeded, an empty regex is actually used.
This is a fairly volatile feature, since any intervening code that uses a regex will change the results (e.g. a regex in a tie method implicitly invoked, or in the main code of a require'd file). It's also a barrier to integration of the defined-or patch (see defined or: // and //= and Re: Perl 5.8.3) to 5.8.x, since a // where perl may be expecting either an operator or a term could mean defined-or or could mean ($_ =~ //). Without the feature, the latter would be overwhelmingly less likely to occur in real code.
People more often use this "feature" by accident than on purpose, with code like m/$myregex/ where $myregex is empty (since the "is it an empty regex" test occurs after interpolation). One solution is to use m/(?#)$myregex/ if you anticipate that $myregex may be empty.
But all that is beside the point, because special treatment of // (documented with respect to s/// and m// in perlop) is not a feature of perl's regexes but a feature of the match and substitution operators, and doesn't apply to split at all.
So what does happen when you say split //, $str?
Well, in general terms, split returns X pieces of a string that result from applying a regex X-1 times and removing the parts that matched, so split /b/, "abc" produces the list ("a","c"). (Throughout, I will ignore the effects of placing capturing parentheses in the regex.)
Similarly, split //, "ac" matches the empty string between the letters and returns ("a","c").
The analytic of mind will note that there are also empty strings before and after the "ac". Spreading it out, the regex will match at each //: "// a // c //", making 3 divisions in the string, so you might expect split to return a list of the four pieces produced ("","a","c",""), but instead a little dwimmery comes into play here.
Dealing first with the empty string at the end, split has a third parameter for limiting the number of substrings to produce (which normally defaults to 0, and where <= 0 means unlimited), so split /b/, "abcba", 2 returns ("a","cba"). As a special case, if the limit is 0, trailing empty fields are not returned. However, if the limit is less than zero or large enough to include empty trailing fields, they will be returned: split /b/, "ab", 2 for example does return "a" and an empty trailing field, while split /b/, "ab" returns only an "a".
The same provision applies to the empty string following the zero-width match at the end of the string. split //, "a" returns only the "a", while split //, "a", 2 returns ("a","").
(I said "normally defaults to 0" because in one case, this doesn't apply: if the split is the only thing on the right of an assignment to a list of scalars, the limit will default to one more than the number of scalars. This is intended as an optimization, but can have odd consequences. For instance, my ($a,$b,$c) = split //, "a" will result in the split having a default limit of 4, obverting the usual suppression of the empty trailing field: split will return ("a",""), leaving $b blank and $c undefined.)
But there is also an empty string before the zero-width match at the beginning of the string. The above methodology doesn't apply to that. If you say split /a/, "ab" it will break "ab" into two strings: ("","b"), whether or not limit is specified (unless you limit it to one return, which basically will always ignore the pattern and return the whole original string).
Similarly, split //, "b" doesn't base returning or not returning the leading "" on limit. Instead, a different rule applies. That rule is that zero-width matches at the beginning of the string won't pare off the preceding empty string; instead, it is discarded. So while split /a/, "ab" does produce ("","b"), split //, "b" only produces ("b").
This rule applies not only to the empty regex //, but to any regex that produces a zero-width match, e.g. /^/m. (While on the topic of /^/, that is special-cased for split to mean the equivalent of /^/m, as it would otherwise be pretty useless.) So split /(?=b)/, "b" returns ("b"), not ("","b").
One last consideration, that also plays a part with s/// and m//: if you match a zero-width string, why doesn't the next attempt at a match also do so in the same place? For instance, $_ = "a"; print "at pos:",pos," matched <$&>" while /(?=a)/g should loop forever, since after the first match, the position is still at 0 and there is an "a" following. Applying this logic to split //, you can see that the // should match over and over without advancing in the string. To prevent this, any match that advances through the string is only allowed to zero-width match once at any given position. If a subsequent match would have come up with a zero width at the same position, the match is not allowed. This rule applies whether perl is in a match loop within a single operation (s///, split, or list-context m//g) or in a loop in perl code (e.g. the above 1 while m//g), or even two independent m//g matches.
For example: $_ = "3"; /(?=\w)/g && /\d??/g && print $&; does print "3", even though the ?? requests a 0 digit match be preferred over 1 digit, because a 0-length match isn't allowed at that position.
(Update: s/FMTEYEWTK/FMTYEWTK/; googlefight shows the latter winning by 10:1)
Update: this isn't really a tutorial, or at least it's an inside out one. (That is, it's taking a single line of code and explaining how lots of different things affect (or don't affect) it, rather than setting out to explain those different things generically). If time allows, I may rewrite it as one. There's lots of good stuff to talk about with split.