Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

FMTYEWTK about split //

by ysth (Canon)
on Jan 21, 2004 at 00:49 UTC ( #322751=perlmeditation: print w/replies, xml ) Need Help??

split //, $string is the usual idiom for splitting a string into a list of its characters.

Why it works as it does is actually quite complex.

First of all, match (m//) and substitution (s///) have a special case for an empty regex: they will apply the last successful regex instead, so:

if (condition) { $str =~ /ab/; } else { $str =~ /ef/; } $str2 =~ //;
may apply the regex /ab/ or /ef/ (or some previous regex, if that match didn't succeed) to $str2. (Note that "previous" here means execution order, not linear order in the source). If no previous regex succeeded, an empty regex is actually used.

This is a fairly volatile feature, since any intervening code that uses a regex will change the results (e.g. a regex in a tie method implicitly invoked, or in the main code of a require'd file). It's also a barrier to integration of the defined-or patch (see defined or: // and //= and Re: Perl 5.8.3) to 5.8.x, since a // where perl may be expecting either an operator or a term could mean defined-or or could mean ($_ =~ //). Without the feature, the latter would be overwhelmingly less likely to occur in real code.

People more often use this "feature" by accident than on purpose, with code like m/$myregex/ where $myregex is empty (since the "is it an empty regex" test occurs after interpolation). One solution is to use m/(?#)$myregex/ if you anticipate that $myregex may be empty.

But all that is beside the point, because special treatment of // (documented with respect to s/// and m// in perlop) is not a feature of perl's regexes but a feature of the match and substitution operators, and doesn't apply to split at all.

So what does happen when you say split //, $str?

Well, in general terms, split returns X pieces of a string that result from applying a regex X-1 times and removing the parts that matched, so split /b/, "abc" produces the list ("a","c"). (Throughout, I will ignore the effects of placing capturing parentheses in the regex.)

Similarly, split //, "ac" matches the empty string between the letters and returns ("a","c").

The analytic of mind will note that there are also empty strings before and after the "ac". Spreading it out, the regex will match at each //: "// a // c //", making 3 divisions in the string, so you might expect split to return a list of the four pieces produced ("","a","c",""), but instead a little dwimmery comes into play here.

Dealing first with the empty string at the end, split has a third parameter for limiting the number of substrings to produce (which normally defaults to 0, and where <= 0 means unlimited), so split /b/, "abcba", 2 returns ("a","cba"). As a special case, if the limit is 0, trailing empty fields are not returned. However, if the limit is less than zero or large enough to include empty trailing fields, they will be returned: split /b/, "ab", 2 for example does return "a" and an empty trailing field, while split /b/, "ab" returns only an "a".

The same provision applies to the empty string following the zero-width match at the end of the string. split //, "a" returns only the "a", while split //, "a", 2 returns ("a","").

(I said "normally defaults to 0" because in one case, this doesn't apply: if the split is the only thing on the right of an assignment to a list of scalars, the limit will default to one more than the number of scalars. This is intended as an optimization, but can have odd consequences. For instance, my ($a,$b,$c) = split //, "a" will result in the split having a default limit of 4, obverting the usual suppression of the empty trailing field: split will return ("a",""), leaving $b blank and $c undefined.)

But there is also an empty string before the zero-width match at the beginning of the string. The above methodology doesn't apply to that. If you say split /a/, "ab" it will break "ab" into two strings: ("","b"), whether or not limit is specified (unless you limit it to one return, which basically will always ignore the pattern and return the whole original string).

Similarly, split //, "b" doesn't base returning or not returning the leading "" on limit. Instead, a different rule applies. That rule is that zero-width matches at the beginning of the string won't pare off the preceding empty string; instead, it is discarded. So while split /a/, "ab" does produce ("","b"), split //, "b" only produces ("b").

This rule applies not only to the empty regex //, but to any regex that produces a zero-width match, e.g. /^/m. (While on the topic of /^/, that is special-cased for split to mean the equivalent of /^/m, as it would otherwise be pretty useless.) So split /(?=b)/, "b" returns ("b"), not ("","b").

One last consideration, that also plays a part with s/// and m//: if you match a zero-width string, why doesn't the next attempt at a match also do so in the same place? For instance, $_ = "a"; print "at pos:",pos," matched <$&>" while /(?=a)/g should loop forever, since after the first match, the position is still at 0 and there is an "a" following. Applying this logic to split //, you can see that the // should match over and over without advancing in the string. To prevent this, any match that advances through the string is only allowed to zero-width match once at any given position. If a subsequent match would have come up with a zero width at the same position, the match is not allowed. This rule applies whether perl is in a match loop within a single operation (s///, split, or list-context m//g) or in a loop in perl code (e.g. the above 1 while m//g), or even two independent m//g matches.

For example: $_ = "3"; /(?=\w)/g && /\d??/g && print $&;  does print "3", even though the ?? requests a 0 digit match be preferred over 1 digit, because a 0-length match isn't allowed at that position.

(Update: s/FMTEYEWTK/FMTYEWTK/; googlefight shows the latter winning by 10:1)

Update: this isn't really a tutorial, or at least it's an inside out one. (That is, it's taking a single line of code and explaining how lots of different things affect (or don't affect) it, rather than setting out to explain those different things generically). If time allows, I may rewrite it as one. There's lots of good stuff to talk about with split.

Update: added links to defined-or stuff; thanks graff, Trimbach, hardburn

Replies are listed 'Best First'.
Re: FMTYEWTK about split //
by graff (Chancellor) on Jan 21, 2004 at 03:52 UTC
    It's also a barrier to integration of the defined-or patch to 5.8.x, since a // where perl may be expecting either an operator or a term could mean defined-or or could mean ($_ =~ //). Without the feature, the latter would be overwhelmingly less likely to occur in real code.

    Whoa... I tried really hard, but I really didn't get this at all. It's obvious you understand what you're talking about, but I think a large number of readers here (certainly most of those who would go to the Tutorials wing where this is likely to end up) won't have a clue what "the defined-or patch to 5.8" refers to, let alone what sort of distinction you're trying to make here. If this is really an important point, provide some more detail, and perhaps some code snippet(s) with comments or contrasting outputs to clarify the point. If it's not that important, then take it out, because it isn't helping.

    The rest provides some useful detail (i.e. things that folks would want to know when using split // to best effect), but there is also a bit of useless detail (i.e. pedantry), which I would not commend in a "tutorial" piece.

    I'd suggest you give it a day or two, then re-read it and consider how you would write it differently...

        Yes, I know what the term "defined-or" refers to, but an average perl user knowing the arcanery involved in "the defined-or patch to 5.8" is sort of like a plumber knowing the particular alloy properties that distinguish the steel in his old hammer from that of his new one. Sure, a few plumbers may know something about this...

      Alternatively, thanks to the wonderful invention of Hypertext, one can simply link to a node that explains about it, and not have to clutter our writings explaining tangently related points.

      I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
      -- Schemer

      : () { :|:& };:

      Note: All code is untested, unless otherwise stated

        I can see someone following that link and saying to themselves, "what is Ponie"? :)
Re: FMTYEWTK about split //
by dragonchild (Archbishop) on Jan 21, 2004 at 13:03 UTC
    It's also a barrier to integration of the defined-or patch to 5.8.x, since a // where perl may be expecting either an operator or a term could mean defined-or or could mean ($_ =~ //). Without the feature, the latter would be overwhelmingly less likely to occur in real code.

    I'm not the world's best lexer, but I cannot imagine a situation where // could be misinterpreted. Your example, if I remember right, implies that I could write $_ =~ +;. =~ is the operator that requires a term on the RHS.

    Unless, as is often the case, I'm missing something ...

    We are the carpenters and bricklayers of the Information Age.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: FMTYEWTK about split //
by ambrus (Abbot) on Jan 21, 2004 at 13:21 UTC

    I always used ()=$string=~/(.)/gs; to split a string to characters. Also, it's sometimes useful to iterate through the characters of a list, for which you can use the scalar-context variation: while($string=~/(.)/gs){DO SOMETHING WITH ($1)}; -- you can't do this with split, can you?

    Update: added s flags, strange noone noticed that

      adding to Roger's reply,

      print $_, $/ for @{[ split // => $bar ]};

      is strict and warnings friendly, as well, and twisted in it's (useless) indirection.

      ~Particle *accelerates*

      you can't do this with split, can you?

      Sure can.
      for (split //, $string) { print "$_\n"; }
        I thought that for (split //, $string) { would cost a lot of memory if $string is long, but now I'm not sure.
Re: FMTYEWTK about split //
by xenchu (Friar) on Jan 21, 2004 at 01:02 UTC

    Thanks, ysth your meditation is certainly more than I knew about split.


    The Needs of the World and my Talents run parallel to infinity.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://322751]
Approved by xenchu
Front-paged by broquaint
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2022-05-21 09:19 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (76 votes). Check out past polls.