Clear questions and runnable code
get the best and fastest answer
Other form of greediness in Perlby naikonta (Curate)
|on Oct 01, 2007 at 16:26 UTC||Need Help??|
When we talk about 'greediness' in Perl, we always think about regex, around 97% of the chances. At least, that's what I got when I used 'greedy' or 'greediness' in the Search box and in the Super Search. There are some functions, constructs, or areas where 'greediness' in fact exists to present occasional surprises to us who aren't aware as well as to provide expected conveniences.
List construct is always greedy. It eats up everything up to after the last comma, whether you want it or not. And this can be a problem when you need to provide a list inside another list.
In my early days of Perl programming I was beaten by this. A few days ago I wrote similiar code above but the element with joined list as the value was the last element. I was aware to write it such. It just reminded me the mistake I made then. And this was the moment when it crossed my mind that "greediness is not all about regex". Then I decided to write something about it. This is not, by any means, an attempt to provide a solid reference or a complete tutorial about "other form of greediness in Perl".
Many Perl built-in functions take list as their arguments. Sometimes I like to use shortcuts like $var = $value, next if $true (something PBP does seems to discourage) instead of,
This is a problem, for example, when we use warn. I admit that this is not particularly useful example. The whole line is always evaluated to true due to if 1. But I think it's enough.
One might expect that it prints 123 at ... line ... to STDERR and exits the while loop. Instead, it exits without showing anything. Why? According to the docs, the warn function takes LIST. So it swallows the stuffs including last because perl sees only one statement. As soon as the last is taken into account, it is executed and it exits the loop right away without warn having any chance to be executed. The B::Deparse helps to make it clear.
As you can see, last is considered part of list elements for warn. The parentheses around the list are added by B::Deparse due to -p switch. The if part is also removed because the condition is simple and always true.
Parentheses to the rescue
Just like regex that has non-greedy quantifiers (the ? mark), the list construct also provides non-greedy behavior by using parentheses. Of course, this is not something surprising. Parens have been there from the start. It's used for, among other things, explicitly declaring the order of precedences, or limiting the boundary of list elements. The latter is exactly what we need to solve our problems above.
The first problem with join is easy. By using parentheses around the arguments to join, the problem is solved right away.
The closing paren effectively marks the end of what join is supposed to grab. IOW, the closing paren limits the boundary of list elements join is expected to take. The first comma after the name => 'vote' pair acts normally as pair separator. However after the join function, the comma acts as argument separator. The list becomes greedy it could eat the last pair as the last two args for join. The => operator is really just a comma in disguise (with a power to force-stringify its left operand). After the closing paren that terminates the join arguments, the list greedinees is tamed, and the next comma becomes pair separator again. The %chart contains four pairs of elements as expected. The same thing for the warn example. When we use the parens, it's clear that we have two separate execution units. We already got hint from B::Deparse at the previous section. We only need to make sure where to put the closing paren.
It now prints 123 at ... line ... to STDERR then exits the loop. In this case, the comma operator after the closing paren acts as statement separator. It's exactly the same if it was written as,
Unary operators are not greedy
On the contrary, functions in unary operators category are never greedy because they never take more than one argument. Take chr for example.
The code consist of two functions, print and chr, and 'a list' of arguments taken from @ARGV, the command line arguments. The print operator, as we know, takes list as its argument (after a filehandle, explicit or not) so it will spit out what ever we feed to it. On the other hand, the chr takes only one argument (or it will use $_ if none is supplied). Perl knows for sure that chr will only use the first argument ($ARGV) and passes the rest to print. So what happens is chr takes 66 and returns the character that number represents (B). The print operator takes the return value and prints it together with 67, so we see B67. Now, let's see it through B::Deparse.
Looking at how the parens are placed, we can see that the comma operator terminates rather than separates the arguments for chr. As for the print, the comma does separate the arguments. The same thing applies for functions that takes no argument at all, such as time.
Greedy and non-greedy behaviors
Interestingly, Perl also has functions that provide both greedy and non-greedy behaviors, depending on how many arguments we pass in to them. First, let's take examples of the split. This function breaks up string on specified pattern (or predefined pattern if one is omitted) into parts.
The split is greedy in this code because it will fill up the array @fields with as many elements as it can find. In this case, it contains book, perl, and programming. However, if we use the limit argument, then we get the non-greedy behavior. We can tell it how many elements exactly do we want.
No matter how many elements it could split, @fields would always contain at the maximum two elements. From the code above, @fields contains book and perl|programming.
The second example is the substr function. It returns part of a string based on start position and how many to return. If we don't specify how many we want, it's greedy. The good side, we don't have to worry about how many characters (for some definition of "character") are there, just get them all.
The substr returns partial string from $line starting at position 10 (counting from 0) onward, extraction and report language in this case. Let's say we want certain number of charcters, then we get the non-greedy behavior and we can explicitly ask for the specific part of the string to return.
Subroutine: Passing params and returning values
Somehow I feel that passing parameters to subroutines and returning values from subroutines are classical topics regarding some features of Perl. They also happen to have strong relationship with this greediness matter. Parameters passed in to a subroutine will be in the array @_, one of Perl special globals. We can then assign variables from this array. However, an array assigned from @_ is greedy. Nothing left for the next variable to be assigned from @_ anymore.
From the code above, the array @destination in the subroutine upstream() absolutely contains empty element, not the content of @dest as one might expect. OTOH, the array @source has all the combined elements of original @src and @dest from the caller. It's clear: @source is greedy. No matter how many storage units are passed in, it will be all in a single array, once the array is assigned from @_. Yes, prototyping can help, if it's really wanted. And its purpose can be beaten as well. Nevertheless, discussion on subroutine prototyping is out of scope of this article.
But there's another elegant solution to overcome this greediness problem. And it's also a good time to introduce the benefit of references. So to make the subroutine upstream() treats the two arrays as distinct storage units, they must be passed in as references. Likewise, to access the arrays, they must be dereferenced first.
This time, each array can be passed inand accessed individually. Two storage unist will be accepted as two storageunits as well. No greedy behavior.
Just like when passing in the parameters, greediness also happens when returning values from a subroutine to the caller. Perl uses the same stack model for parameters passing and values returning, as stated in the perlsub:
The Perl model for function call and return values is simple: all functions are passed as parameters one single flat list of scalars, and all functions likewise return to their caller one single flat list of scalars. Any arrays or hashes in these call and return lists will collapse, losing their identities....
The subroutine get_group() in the following code is supposed to return a list of groups a user is assgined to. There's no way we can distinct the type of groups returned even if we know it.
The whole elements of both arrays are returned as flat list, and the first array that captures the return values is greedy. Indeed, the array @special will have cdrom, games, westteam, marketing, and committee while the array @normal is just an empty one. To capture them as two different units, we again use references to return the groups, and dereference them to access the elements.
Aggregat, list, and scalar arguments in the function syntax descriptions
The perlfunc contains explanation about Perl functions at general as well as a syntax description for each function. A function can be generally called in many ways (various number of arguments) and context-sensitive (void, scalar, or list). Functions can take aggregat, list or scalar arguments.
Aggregat variables mean that they can have more than one scalar, so the function explicitly takes an ARRAY or a HASH as argument. If what we have is a reference to an array or a hash, then it must be dereferenced first, appropriately. In the following examples, the %$rgb, %ENV, @names, and @$another_members are each treated as a single unit, individually.
If we emulate it,
There's no greediness here. It's guaranteed that @target_array is the same as @names and @new_elements is the same as new_members in the first example. Likewise, @target_array is the same as @$another_members and @new_elements is the same as @names in the second one. List argument (denoted as LIST) is, however, greedy, as we discussed earlier in this article. Everything after the first array will be treated as @new_elements, no matter how many arrays or scalars are there.
A function can take a number of scalars too. For example, the splice syntax is splice ARRAY,OFFSET,LENGTH,LIST. The argument ARRAY must be a real array or a dereferenced array reference. OFFSET and LENGTH are two scalar arguments. They will be treated individually. So the first three arguments are not treated in greediness. But LIST, the last argument is, once again, greedy.
Well, I actually didn't meant to make some conclusion, but I need a closing to summarize what we have seen so far :-)
Happy Perl programming, folks!
Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!