|Problems? Is your data what you think it is?|
Understanding Split and Joinby davido (Cardinal)
|on Dec 28, 2006 at 07:05 UTC||Need Help??|
split and join
Regular expressions are used to match delimiters with the split function, to break up strings into a list of substrings. The join function is in some ways the inverse of split. It takes a list of strings and joins them together again, optionally, with a delimiter. We'll discuss split first, and then move on to join.
A simple example...
Let's first consider a simple use of split: split a string on whitespace.
@simpsons now contains "Bart", "", "Lisa", "Maggie", "Marge", and "Homer".
There is an empty element in the list that split placed in @simpsons. That is because \s matched exactly one whitespace character. But in our string, $line, there were two spaces between Bart and Lisa. Split, using single whitespaces as delimiters, created an empty string at the point where two whitespaces were found next to each other. That also includes preceding whitespace. In fact, empty delimiters found anywhere in the string will result in empty strings being returned as part of the list of strings.
We can specify a more flexible delimiter that eliminates the creation of an empty string in the list.
@simpsons now contains "Bart", "Lisa", "Maggie", "Marge", and "Homer", because the delimiter match is seen as one or more whitespaces, multiple whitespaces next to each other are consumed as one delimiter.
Where do delimiters go?
"What does split do with the delimiters?" Usually it discards them, returning only what is found to either side of the delimiters (including empty strings if two delimiters are next to each other, as seen in our first example). Let's examine that point in the following example:
The delimiter is something visible: 'humility'. And after this code executes, @japh contains four strings, "Just ", "another ", "Perl ", and "hacker.". 'humility' bit the bit-bucket, and was tossed aside.
If you want to keep the delimiters you can. Here's an example of how. Hint, you use capturing parenthesis.
@list now contains "alpha","-", "bravo","-", "charlie", and so on. The parenthesis caused the delimiters to be captured into the list passed to @list right alongside the stuff between the delimiters.
The null delimiter
What happens if the delimiter is indicated to be a null string (a string of zero characters)? Let's find out.
Now @letters contains a list of four letters, "M", "o", "n", and "k". If split is given a null string as a delimiter, it splits on each null position in the string, or in other words, every character boundary. The effect is that the split returns a list broken into individual characters of $string.
Split's return value
Earlier I mentioned that split returns a list. That list, of course, can be stored in an array, and often is. But another use of split is to store its return values in a list of scalars. Take the following code:
What happened to the person's sex? It's just discarded because we're only accepting four of the five fields into our list of scalars. And how does split know what string to split up? When split isn't explicitly given a string to split up, it assumes you want to split the contents of $_. That's handy, because foreach aliases $_ to each element (one at a time) of @mydata.
Words about Context
Put to its normal use, split is used in list context. It may also be used in scalar context, though its use in scalar context is deprecated. In scalar context, split returns the number of fields found, and splits into the @_ array. It's easy to see why that might not be desirable, and thus, why using split in scalar context is frowned upon.
The limit argument
Split can optionally take a third argument. If you specify a third argument to split, as in @list = split ( /\s+/, $string, 3 ); split returns no more than the number of fields you specify in the third argument. So if you combine that with our previous example.....
Now, $everything_else contains Bart's phone number, his age, and his sex, delimited by ":", because we told split to stop early. If you specify a negative limit value, split understands that as being the same as an arbitrarily large limit.
Unspecified split pattern
As mentioned before, limit is an optional parameter. If you leave limit off, you may also, optionally, choose to not specify the split string. Leaving out the split string causes split to attempt to split the string contained in $_. And if you leave off the split string (and limit), you may also choose to not specify a delimiter pattern.
If you leave off the pattern, split assumes you want to split on /\s+/. Not specifying a pattern also causes split to skip leading whitespace. It then splits on any whitespace field (of one or more whitespaces), and skips past any trailing whitespace. One special case is when you specify the string literal, " " (a quoted space), which does the same thing as specifying no delimiter at all (no argument).
The star quantifier (zero or more)
Finally, consider what happens if we specify a split delimiter of /\s*/. The quantifier "*" means zero or more of the item it is quantifying. So this split can split on nothing (character boundaries), any amount of whitespace. And remember, delimiters get thrown away. See this in action:
@letters now contains "H", "e", "l", "l", "o", "w", "o", "r", "l", "d", and "!".
Using split versus Regular Expressions
There are cases where it is equally easy to use a regexp in list context to split a string as it is to use the split function. Consider the following examples:
In the first example you're defining what to throw away. In the second, you're defining what to keep. But you're getting the same results. That is a case where it's equally easy to use either syntax.
But what if you need to be more specific as to what you keep, and perhaps are a little less concerned with what comes between what you're keeping? That's a situation where a regexp is probably a better choice. See the following example:
That type of a match would be difficult to accomplish with split. Try not to fall into the pitfall of using one where the other would be handier. In general, if you know what you want to keep, use a regexp. If you know what you want to get rid of, use split. That's an oversimplification, but start there and if you start tearing your hair out over the code, consider taking another approach. There is always more than one way to do it.
That's enough for split, let's take a look at join.
join: Putting it back together
If you're exhausted by the many ways to use split, you can rest assured that join isn't nearly so complicated. We can over-simplify by saying that join, does the inverse of split. If we said that, we would be mostly accurate. But there are no pattern matches going on. Join takes a string that specifies the delimiter to be concatenated between each item in the list supplied by subsequent parameter(s). Where split accommodates delimiters through a regular expression, allowing for different delimiters as long as they match the regexp, join makes no attempt to allow for differing delimiters. You specify the same delimiter for each item in the list being joined, or you specify no delimiter at all. Those are your choices. Easy.
To join a list of scalars together into one colon delimited string, do this:
Whew, that was easy. You can also join lists contained in arrays:
Use join to concatenate
It turns out that join is the most efficient way to concatenate many strings together at once; better than the '.' operator.
How do you do that? Like this:
As any good Perlish function should, join will accept an actual list, not just an array holding a list. So you can say this:
Which puts humility between each word in the list.
By specifying a null delimiter (nothing between the quotes), you're telling join to join up the elements in @array, one after another, with nothing between them. Easy.
Hopefully you've still got some energy left. If you do, dive back into the Tutorial.
Credits and updates