comment on

Proposed rewrite of split and join tutorial page by root.
Portions based on split and join, "Getting Started with Perl" tutorial by root. Please compare to the original in considering merit.

split and join

Regular expressions are used to match delimiters with the split function, to break up strings into a list of substrings. The join function is in some ways the inverse of split. It takes a list of strings and joins them together again, optionally, with a delimiter. We'll discuss split first, and then move on to join.

A simple example...

Let's first consider a simple use of split: split a string on whitespace.

    $line = "Bart  Lisa Maggie Marge Homer";
    @simpsons = split ( /\s/, $line ); 
        # Splits line and uses single whitespaces 
        # as the delimiter.
[download]

@simpsons now contains "Bart", "", "Lisa", "Maggie", "Marge", and "Homer".

There is an empty element in the list that split placed in @simpsons. That is because \s matched exactly one whitespace character. But in our string, $line, there were two spaces between Bart and Lisa. Split, using single whitespaces as delimiters, created an empty string at the point where two whitespaces were found next to each other. That also includes preceding whitespace. In fact, empty delimiters found anywhere in the string will result in empty strings being returned as part of the list of strings.

We can specify a more flexible delimiter that eliminates the creation of an empty string in the list.

    @simpsons = split ( /\s+/, $line ); 
    #Now splits on one-or-more whitespaces.
[download]

@simpsons now contains "Bart", "Lisa", "Maggie", "Marge", and "Homer", because the delimiter match is seen as one or more whitespaces, multiple whitespaces next to each other are consumed as one delimiter.

Where do delimiters go?

"What does split do with the delimiters?" Usually it discards them, returning only what is found to either side of the delimiters (including empty strings if two delimiters are next to each other, as seen in our first example). Let's examine that point in the following example:

    $string = "Just humilityanother humilityPerl humilityhacker.";
    @japh = split ( /humility/, $string );
[download]

The delimiter is something visible: 'humility'. And after this code executes, @japh contains four strings, "Just ", "another ", "Perl ", and "hacker.". 'humility' bit the bit-bucket, and was tossed aside.

Preserving delimiters

If you want to keep the delimiters you can. Here's an example of how. Hint, you use capturing parenthesis.

    $string = "alpha-bravo-charlie-delta-echo-foxtrot";
    @list = split ( /(-)/, $string );
[download]

@list now contains "alpha","-", "bravo","-", "charlie", and so on. The parenthesis caused the delimiters to be captured into the list passed to @list right alongside the stuff between the delimiters.

The null delimiter

What happens if the delimiter is indicated to be a null string (a string of zero characters)? Let's find out.

    $string = "Monk";
    @letters = split ( //, $string );
[download]

Now @letters contains a list of four letters, "M", "o", "n", and "k". If split is given a null string as a delimiter, it splits on each null position in the string, or in other words, every character boundary. The effect is that the split returns a list broken into individual characters of $string.

Split's return value

Earlier I mentioned that split returns a list. That list, of course, can be stored in an array, and often is. But another use of split is to store its return values in a list of scalars. Take the following code:

    @mydata = ( "Simpson:Homer:1-800-000-0000:40:M",
                "Simpson:Marge:1-800-111-1111:38:F",
                "Simpson:Bart:1-800-222-2222:11:M",
                "Simpson:Lisa:1-800-333-3333:9:F",
                "Simpson:Maggie:1-800-444-4444:2:F" );
    foreach ( @mydata ) {
        ( $last, $first, $phone, $age ) = split ( /:/ ); 
        print "You may call $age year old $first $last at $phone.\n";
    }
[download]

What happened to the person's sex? It's just discarded because we're only accepting four of the five fields into our list of scalars. And how does split know what string to split up? When split isn't explicitly given a string to split up, it assumes you want to split the contents of $_. That's handy, because foreach aliases $_ to each element (one at a time) of @mydata.

Words about Context

Put to its normal use, split is used in list context. It may also be used in scalar context, though its use in scalar context is deprecated. In scalar context, split returns the number of fields found, and splits into the @_ array. It's easy to see why that might not be desirable, and thus, why using split in scalar context is frowned upon.

The limit argument

Split can optionally take a third argument. If you specify a third argument to split, as in @list = split ( /\s+/, $string, 3 ); split returns no more than the number of fields you specify in the third argument. So if you combine that with our previous example.....

    ( $last, $first, $everything_else) = split ( /:/, $_, 3 );
[download]

Now, $everything_else contains Bart's phone number, his age, and his sex, delimited by ":", because we told split to stop early. If you specify a negative limit value, split understands that as being the same as an arbitrarily large limit.

Unspecified split pattern

As mentioned before, limit is an optional parameter. If you leave limit off, you may also, optionally, choose to not specify the split string. Leaving out the split string causes split to attempt to split the string contained in $_. And if you leave off the split string (and limit), you may also choose to not specify a delimiter pattern.

If you leave off the pattern, split assumes you want to split on /\s+/. Not specifying a pattern also causes split to skip leading whitespace. It then splits on any whitespace field (of one or more whitespaces), and skips past any trailing whitespace. One special case is when you specify the string literal, " " (a quoted space), which does the same thing as specifying no delimiter at all (no argument).

The star quantifier (zero or more)

Finally, consider what happens if we specify a split delimiter of /\s*/. The quantifier "*" means zero or more of the item it is quantifying. So this split can split on nothing (character boundaries), any amount of whitespace. And remember, delimiters get thrown away. See this in action:

    $string = "Hello world!";
    @letters = split ( /\s*/, $string );
[download]

@letters now contains "H", "e", "l", "l", "o", "w", "o", "r", "l", "d", and "!".
Notice that the whitespace is gone. You just split $string, character by character (because null matches boundaries), and on whitespace (which gets discarded because it's a delimiter).

Using split versus Regular Expressions

There are cases where it is equally easy to use a regexp in list context to split a string as it is to use the split function. Consider the following examples:

    my @list = split /\s+/, $string;
    my @list = $string =~ /(\S+)/g;
[download]

In the first example you're defining what to throw away. In the second, you're defining what to keep. But you're getting the same results. That is a case where it's equally easy to use either syntax.

But what if you need to be more specific as to what you keep, and perhaps are a little less concerned with what comes between what you're keeping? That's a situation where a regexp is probably a better choice. See the following example:

    my @bignumbers = $string =~ /(\d{4,})/g;
[download]

That type of a match would be difficult to accomplish with split. Try not to fall into the pitfall of using one where the other would be handier. In general, if you know what you want to keep, use a regexp. If you know what you want to get rid of, use split. That's an oversimplification, but start there and if you start tearing your hair out over the code, consider taking another approach. There is always more than one way to do it.

That's enough for split, let's take a look at join.

join: Putting it back together

If you're exhausted by the many ways to use split, you can rest assured that join isn't nearly so complicated. We can over-simplify by saying that join, does the inverse of split. If we said that, we would be mostly accurate. But there are no pattern matches going on. Join takes a string that specifies the delimiter to be concatenated between each item in the list supplied by subsequent parameter(s). Where split accommodates delimiters through a regular expression, allowing for different delimiters as long as they match the regexp, join makes no attempt to allow for differing delimiters. You specify the same delimiter for each item in the list being joined, or you specify no delimiter at all. Those are your choices. Easy.

To join a list of scalars together into one colon delimited string, do this:

    $string = join ( ':', $last, $first, $phone, $age, $sex );
[download]

Whew, that was easy. You can also join lists contained in arrays:

    $string = join ( ':', @array );
[download]

Use join to concatenate

It turns out that join is the most efficient way to concatenate many strings together at once; better than the '.' operator.

How do you do that? Like this:

    $string = join ( '', @array );
[download]

As any good Perlish function should, join will accept an actual list, not just an array holding a list. So you can say this:

    $string = join ( '*', "My", "Name", "Is", "Dave" );
[download]

Or even...

    $string = join ( 'humility', ( qw/My name is Dave/ ) );
[download]

Which puts humility between each word in the list.

By specifying a null delimiter (nothing between the quotes), you're telling join to join up the elements in @array, one after another, with nothing between them. Easy.

Hopefully you've still got some energy left. If you do, dive back into the Tutorial.

Credits and updates

Update: To avoid confusing the continuity of this node with inline update callouts, I'll enumerate updates and list credits at the end.

Portions of this node are adaptations of the original tutorial provided by root.
Thanks go to gmax for providing comments that led to clarification on what ' ' does (in place of //).
diotalevi caught an important spelling error, helped to improve the description of the correlation between split, join, and Regular Expressions, and provided additional information on the behavior of "//" null-string delimiters.
At the suggestion of several people, including Abigail-II I smoothed out my assertion that join is the inverse of split. It isn't an exact inverse, it just has inverse behaviors for a particular subset of what split can do.
Per Not_a_Number's suggestion, I added the caviet that you may only leave out the delimiter expression if you also leave out all other parameters (thus letting split implicitly try to split the contents of $_).
Also at Not_a_Number's suggestion I moved the "Preserving Delimiters" section to immediately following the "Where do Delimiters go?" section, to improve the logical flow of the document.
Implemented merlin's suggestion of discussing the notion of not getting too hung up on one tool (split) when another tool (regexps) might be the simpler approach (and vice versa).

In reply to Tutorial suggestion: split and join by davido

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


The stupid question is the question not asked
	PerlMonks