Beefy Boxes and Bandwidth Generously Provided by pair Networks Ovid
Keep It Simple, Stupid
 
PerlMonks  

Tutorial suggestion: split and join

by davido (Archbishop)
on Aug 28, 2003 at 22:23 UTC ( #287544=perlmeditation: print w/ replies, xml ) Need Help??

Proposed rewrite of split and join tutorial page by root.
Portions based on split and join, "Getting Started with Perl" tutorial by root. Please compare to the original in considering merit.

split and join

Regular expressions are used to match delimiters with the split function, to break up strings into a list of substrings. The join function is in some ways the inverse of split. It takes a list of strings and joins them together again, optionally, with a delimiter. We'll discuss split first, and then move on to join.

A simple example...

Let's first consider a simple use of split: split a string on whitespace.

$line = "Bart Lisa Maggie Marge Homer"; @simpsons = split ( /\s/, $line ); # Splits line and uses single whitespaces # as the delimiter.

@simpsons now contains "Bart", "", "Lisa", "Maggie", "Marge", and "Homer".

There is an empty element in the list that split placed in @simpsons. That is because \s matched exactly one whitespace character. But in our string, $line, there were two spaces between Bart and Lisa. Split, using single whitespaces as delimiters, created an empty string at the point where two whitespaces were found next to each other. That also includes preceding whitespace. In fact, empty delimiters found anywhere in the string will result in empty strings being returned as part of the list of strings.

We can specify a more flexible delimiter that eliminates the creation of an empty string in the list.

@simpsons = split ( /\s+/, $line ); #Now splits on one-or-more whitespaces.

@simpsons now contains "Bart", "Lisa", "Maggie", "Marge", and "Homer", because the delimiter match is seen as one or more whitespaces, multiple whitespaces next to each other are consumed as one delimiter.

Where do delimiters go?

"What does split do with the delimiters?" Usually it discards them, returning only what is found to either side of the delimiters (including empty strings if two delimiters are next to each other, as seen in our first example). Let's examine that point in the following example:

$string = "Just humilityanother humilityPerl humilityhacker."; @japh = split ( /humility/, $string );

The delimiter is something visible: 'humility'. And after this code executes, @japh contains four strings, "Just ", "another ", "Perl ", and "hacker.". 'humility' bit the bit-bucket, and was tossed aside.

Preserving delimiters

If you want to keep the delimiters you can. Here's an example of how. Hint, you use capturing parenthesis.

$string = "alpha-bravo-charlie-delta-echo-foxtrot"; @list = split ( /(-)/, $string );

@list now contains "alpha","-", "bravo","-", "charlie", and so on. The parenthesis caused the delimiters to be captured into the list passed to @list right alongside the stuff between the delimiters.

The null delimiter

What happens if the delimiter is indicated to be a null string (a string of zero characters)? Let's find out.

$string = "Monk"; @letters = split ( //, $string );

Now @letters contains a list of four letters, "M", "o", "n", and "k". If split is given a null string as a delimiter, it splits on each null position in the string, or in other words, every character boundary. The effect is that the split returns a list broken into individual characters of $string.

Split's return value

Earlier I mentioned that split returns a list. That list, of course, can be stored in an array, and often is. But another use of split is to store its return values in a list of scalars. Take the following code:

@mydata = ( "Simpson:Homer:1-800-000-0000:40:M", "Simpson:Marge:1-800-111-1111:38:F", "Simpson:Bart:1-800-222-2222:11:M", "Simpson:Lisa:1-800-333-3333:9:F", "Simpson:Maggie:1-800-444-4444:2:F" ); foreach ( @mydata ) { ( $last, $first, $phone, $age ) = split ( /:/ ); print "You may call $age year old $first $last at $phone.\n"; }

What happened to the person's sex? It's just discarded because we're only accepting four of the five fields into our list of scalars. And how does split know what string to split up? When split isn't explicitly given a string to split up, it assumes you want to split the contents of $_. That's handy, because foreach aliases $_ to each element (one at a time) of @mydata.

Words about Context

Put to its normal use, split is used in list context. It may also be used in scalar context, though its use in scalar context is deprecated. In scalar context, split returns the number of fields found, and splits into the @_ array. It's easy to see why that might not be desirable, and thus, why using split in scalar context is frowned upon.

The limit argument

Split can optionally take a third argument. If you specify a third argument to split, as in @list = split ( /\s+/, $string, 3 ); split returns no more than the number of fields you specify in the third argument. So if you combine that with our previous example.....

( $last, $first, $everything_else) = split ( /:/, $_, 3 );

Now, $everything_else contains Bart's phone number, his age, and his sex, delimited by ":", because we told split to stop early. If you specify a negative limit value, split understands that as being the same as an arbitrarily large limit.

Unspecified split pattern

As mentioned before, limit is an optional parameter. If you leave limit off, you may also, optionally, choose to not specify the split string. Leaving out the split string causes split to attempt to split the string contained in $_. And if you leave off the split string (and limit), you may also choose to not specify a delimiter pattern.

If you leave off the pattern, split assumes you want to split on /\s+/. Not specifying a pattern also causes split to skip leading whitespace. It then splits on any whitespace field (of one or more whitespaces), and skips past any trailing whitespace. One special case is when you specify the string literal, " " (a quoted space), which does the same thing as specifying no delimiter at all (no argument).

The star quantifier (zero or more)

Finally, consider what happens if we specify a split delimiter of /\s*/. The quantifier "*" means zero or more of the item it is quantifying. So this split can split on nothing (character boundaries), any amount of whitespace. And remember, delimiters get thrown away. See this in action:

$string = "Hello world!"; @letters = split ( /\s*/, $string );

@letters now contains "H", "e", "l", "l", "o", "w", "o", "r", "l", "d", and "!".
Notice that the whitespace is gone. You just split $string, character by character (because null matches boundaries), and on whitespace (which gets discarded because it's a delimiter).

Using split versus Regular Expressions

There are cases where it is equally easy to use a regexp in list context to split a string as it is to use the split function. Consider the following examples:

my @list = split /\s+/, $string; my @list = $string =~ /(\S+)/g;

In the first example you're defining what to throw away. In the second, you're defining what to keep. But you're getting the same results. That is a case where it's equally easy to use either syntax.

But what if you need to be more specific as to what you keep, and perhaps are a little less concerned with what comes between what you're keeping? That's a situation where a regexp is probably a better choice. See the following example:

my @bignumbers = $string =~ /(\d{4,})/g;
That type of a match would be difficult to accomplish with split. Try not to fall into the pitfall of using one where the other would be handier. In general, if you know what you want to keep, use a regexp. If you know what you want to get rid of, use split. That's an oversimplification, but start there and if you start tearing your hair out over the code, consider taking another approach. There is always more than one way to do it.


That's enough for split, let's take a look at join.

join: Putting it back together

If you're exhausted by the many ways to use split, you can rest assured that join isn't nearly so complicated. We can over-simplify by saying that join, does the inverse of split. If we said that, we would be mostly accurate. But there are no pattern matches going on. Join takes a string that specifies the delimiter to be concatenated between each item in the list supplied by subsequent parameter(s). Where split accommodates delimiters through a regular expression, allowing for different delimiters as long as they match the regexp, join makes no attempt to allow for differing delimiters. You specify the same delimiter for each item in the list being joined, or you specify no delimiter at all. Those are your choices. Easy.

To join a list of scalars together into one colon delimited string, do this:

$string = join ( ':', $last, $first, $phone, $age, $sex );

Whew, that was easy. You can also join lists contained in arrays:

$string = join ( ':', @array );

Use join to concatenate

It turns out that join is the most efficient way to concatenate many strings together at once; better than the '.' operator.

How do you do that? Like this:

$string = join ( '', @array );

As any good Perlish function should, join will accept an actual list, not just an array holding a list. So you can say this:

$string = join ( '*', "My", "Name", "Is", "Dave" );

Or even...

$string = join ( 'humility', ( qw/My name is Dave/ ) );

Which puts humility between each word in the list.

By specifying a null delimiter (nothing between the quotes), you're telling join to join up the elements in @array, one after another, with nothing between them. Easy.

Hopefully you've still got some energy left. If you do, dive back into the Tutorial.


Credits and updates

Update: To avoid confusing the continuity of this node with inline update callouts, I'll enumerate updates and list credits at the end.

  • Portions of this node are adaptations of the original tutorial provided by root.
  • Thanks go to gmax for providing comments that led to clarification on what ' ' does (in place of //).
  • diotalevi caught an important spelling error, helped to improve the description of the correlation between split, join, and Regular Expressions, and provided additional information on the behavior of "//" null-string delimiters.
  • At the suggestion of several people, including Abigail-II I smoothed out my assertion that join is the inverse of split. It isn't an exact inverse, it just has inverse behaviors for a particular subset of what split can do.
  • Per Not_a_Number's suggestion, I added the caviet that you may only leave out the delimiter expression if you also leave out all other parameters (thus letting split implicitly try to split the contents of $_).
  • Also at Not_a_Number's suggestion I moved the "Preserving Delimiters" section to immediately following the "Where do Delimiters go?" section, to improve the logical flow of the document.
  • Implemented merlin's suggestion of discussing the notion of not getting too hung up on one tool (split) when another tool (regexps) might be the simpler approach (and vice versa).

Comment on Tutorial suggestion: split and join
Select or Download Code
Re: Tutorial suggestion: split and join
by gmax (Abbot) on Aug 28, 2003 at 22:54 UTC

    It is not only the intermediate elements. The effect of a delimiter are felt on the empty elements at the beginning and the end of the source string.

    Consider the following examples

    #!/usr/bin/perl -w use strict; my $line = " Bart Lisa Maggie Marge Homer "; # notice the leading and trailing spaces my @simpsons; for ( " ", '\s', '\s+' ) { print "delimiter /$_/\n"; @simpsons = split ( /$_/, $line ); print map {"<$_>"} @simpsons; print $/; } print "delimiter ' '\n"; @simpsons = split ( ' ', $line ); print map {"<$_>"} @simpsons; print $/; __END__ delimiter / / <><><Bart><><Lisa><Maggie><Marge><Homer> delimiter /\s/ <><><Bart><><Lisa><Maggie><Marge><Homer> delimiter /\s+/ <><Bart><Lisa><Maggie><Marge><Homer> delimiter ' ' <Bart><Lisa><Maggie><Marge><Homer>

    The best choice if you want to split a string by spaces and you don't want the empty elements is to use a simple quoted space (not a regex) as a delimiter, as the last example shows.

    From perldoc -f split

    As a special case, specifying a PATTERN of space ("' '") will split on white space just as "split" with no arguments does. Thus, "split(' ')" can be used to emulate awk's default behavior, whereas "split(/ /)" will give you as many null initial fields as there are leading spaces. A "split" on "/\s+/" is like a "split(' ')" except that any leading whitespace produces a null first field. A "split" with no arguments really does a "split(' ', $_)" internally.

    Update If you want to document the above behavior, you can use B::Deparse.

    perl -MO=Deparse -e '$_=" a b c ";print map {"<$_>"} split' $_ = ' a b c '; print map({"<$_>";} split(" ", $_, 0));

    However, this will work in Perl 5.8.0 but not in 5.6.1. (in 5.6.1 the output of the one-liner is correct, but the deparsed code is not). Apparently, there was a bug that was recently fixed. Thanks to diotalevi for his useful analysis in this matter.

     _  _ _  _  
    (_|| | |(_|><
     _|   
    
      Thanks for the comment. I've made a few updates to the original node to see if I can do a better job of emphasizing your point. It's my goal to further revise it based on comments here.

      My whole reason for going through re-writing the split and join section of the tutorial was just that I felt the existing section didn't provide enough depth and examples to give a true feel of the power, flexibility, and usefulness of split and join. Rather than to re-invent the tutorial, I just hoped to tighten the spokes of the existing wheel. It is ok if this just spurrs a discussion that might lead to an improvement here and there. Split/Join seemed like a good place to start.

      If you (or anyone else) have suggestions on how it could be further improved I'd like to hear. Maybe in the end someone will read it and come away with a better feel for how to use split because of our efforts.

      In keeping with Perlmonks tradition, I'll list the updates made to the node, and who helped make them possible. Rather than clutter up the cohesiveness of the node, the update and credit list will appear at the bottom, documenting what input led to assorted improvements.

      Dave

      "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

Re: Tutorial suggestion: split and join
by diotalevi (Canon) on Aug 29, 2003 at 02:13 UTC

    Regular expressions are used with the split function to break up strings into a list of substrings. The converse to split is the join function, which takes a list of strings and joins them together again.
    No. The converse to a split is a regular expression (with matching). The idea here is that with matching you specify what you want to match and you get that as a return value ala @simpsons = "Bart Lisa Maggie Marge Homer" =~ /(\w+)/g). With split you specify all the things you don't want and you get everything else. Which one you use depends on what is more natural to specify. I'm glossing over the fact that plain regular expressions are actually even more powerful than that but that's how they compare with split anyway (which itself uses a regex as its first parameter).

    If split is given a null string as a delimeter, it splits on each character in the string, and since the delimeter is nothing at all, nothing at all is thrown away.
    No. The empty regular expression matches at every position possible which includes between characters. This is how anchoring works. As an example, ^ normally matches the position before the first character. This is only possible if the "spaces" before, after and between characters are also places to match. So when matching "ab" with // it can match before the "a", the "a", between the "a" and the "b" and after the "b". When you actually run that the only parts that get returned in the split is the "a" and the "b".

    The word "delimit" isn't spelled with an 'e' in it.

      Thank you for the information and clarification. I've re-worked the appropriate portions of the node and provided credit at the end.

      Dave

      "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

        The 'delimet' (sic) spelling error still persists in a heading.

        Cheers!

        Update: I see you fixed the heading but the error also still exists in '...same thing as specifying no delimeter at all' and in the credits section.

      The converse to split is the join function, which takes a list of strings and joins them together again.

      No. The converse to a split is a regular expression (with matching). The idea here is that with matching you specify what you want to match and you get that as a return value ala @simpsons = "Bart Lisa Maggie Marge Homer" =~ /(\w+)/g). With split you specify all the things you don't want and you get everything else. Which one you use depends on what is more natural to specify.

      I don't understand what meaning of converse you use here. English is not my first language - but here is what I found for 'converse' in the Meriam-Webster online dictionary:
      something reversed in order, relation, or action: as a : a theorem formed by interchanging the hypothesis and conclusion of a given theorem b : a proposition obtained by interchange of the subject and predicate of a given proposition <"no P is S " is the converse of "no S is P ">
      I believe the OP used 'converse' as a synonim for 'inverse' in the mathematical sense, i.e. he wanted to say that  split o join = identity. Which is quite close to being correct.

      Or perhaps you meant join in the first sentence? This possibillity occured to me just after posting this comment.

Re: Tutorial suggestion: split and join
by Abigail-II (Bishop) on Aug 29, 2003 at 08:52 UTC
    The inverse of split is the join function

    No, it's not. If that would be true, I could use join to create the same string I split with split. But this is obviously not always possible. Think splitting on a regex that matches more than one string (for instance /\s+/), or the throwing away of trailing empty fields.

    I missed the explaination of the meaning of a negative third argument.

    Abigail

Re: Tutorial suggestion: split and join
by Not_a_Number (Parson) on Aug 29, 2003 at 13:34 UTC

    1) This statement:

    If you leave off the pattern, split assumes you want to split on /\s+/.

    needs considerable qualification. You can't just 'leave off the pattern' except in one particular circumstance. Try this:

    my $str = " foo bar baz "; my @spl = split $str; print "@spl";

    It just doesn't work (at least in 5.6.1). You can only leave off the 'pattern' if you also leave off the 'target', ie if you are splitting on an implicit $_, eg:

    while ( <DATA> ) { split; #do something interesting; }

    2) This is plain wrong:

    One special case is when you specify the string literal, " " (a quoted space), which does the same thing as specifying no delimiter at all (no argument).

    There is nothing special about " " (see gmax's example above). You are confusing it with ' '.

    3) One other minor point, I would suggest bringing the paragraph Preserving delimiters up to just below Where do delimiters go?, which IMHO would be a more logical ordering.

    hth

    dave

    Update: See Abigail-II's post below

      There is nothing special about " " (see gmax's example above). You are confusing it with ' '.

      I don't understand how gmax example shows it. Could you provide us with a single string on which splitting with " " produces a different result than splitting with ' '?

      It would be very, very strange if split produces different results, because the difference between " " and ' ' has disappeared long before split is called. After compile time, the difference between " " and ' ' is gone.

      Abigail

        Sorry

        I always thought that the 'magic' of ' ' was limited to single quotes. And a total misreading of this line:

        for ( " ", '\s', '\s+' ) {

        in gmax's code seemed to bear out my fallacy.

        Remark withdrawn, with apologies.

        :( dave

•Re: Tutorial suggestion: split and join
by merlyn (Sage) on Aug 29, 2003 at 14:05 UTC
    Consider adding at least a paragraph or two on the "other" inverse of split - matching /(...)/g in a list context:
    my @words = split /\s+/, $string; my @words = $string =~ /(\S+)/g;
    For these two, it's equally easy to say what you want to throw away vs what you want to keep. But sometimes, it's easier to say what you want to keep:
    my @bignums = $string =~ /(\d{4,})/;g
    That'd be hard to do as a split. And sometimes, it's easier to say what you want to throw away:
    my @funny_delimited = split /(?:,\s+|###|!delim!)/, $string;
    So, both of them are useful to know. Far too often, I see one being used where the other would be quite nice. Keep them both nearby in your toolbox.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: Tutorial suggestion: split and join
by davido (Archbishop) on Aug 30, 2003 at 08:13 UTC
    I just wanted to follow up to mention that I've now made a number of changes and revisions to the root node, Tutorial suggestion: split and join. I have attempted to implement the corrections, suggestions, and pointers that have come my way regarding the document.

    The changes and credits are enumerated at the end of the document.

    Please do let me know what you think.

    Dave

    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://287544]
Approved by enoch
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (11)
As of 2014-04-18 19:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (471 votes), past polls