Meditations on Programming Theory
meditation #3: identification, part 1: variables

Okay, this is where things get complicated.

The concept of identification is only the tip of a tightly-interconnected iceberg. I'm going to start by blasting my way through a bunch of vocabulary so you have some idea of where the major pieces are, then I'll go back and fill in the details.

(and apologies for slipping schedule by 4 days.. I'll try to be back on track with the next one)

# Vocabulary:

• The first piece of the identification iceberg is the concept of encapsulation.

Encapsulation means putting something in a container. When we encapsulate something, we create a boundary that separates 'things on the inside' from 'things on the outside'. Then we treat that container, and everything in it, as a single, coherent unit. For want of a better term, we call that unit an entity.

• Encapsulation gives rise to the concept of abstraction.

We treat entities differently depending on whether we're looking at them from the inside or from the outside. From the inside, an entity can have any number of pieces. From the outside, an entity is a single, irreducible thing.

The word 'abstraction' means 'taking something away', and in this case, moving from the inside of an entity to its outside takes away information about the entity's internal structure. We abstract an entity's structure by putting those pieces into a container. The boundary between an entity's inside and outside, which defines the limits of what we can see, is called an abstraction barrier.

• Abstraction barriers gives rise to a concept called scope.

We can only see the pieces that make up an entity from inside the entity's abstraction barrier. For all intents and purposes, neither the pieces, nor the rules that apply to them, exist outside that barrier. The term scope refers to the region inside the abstraction barrier, where those pieces and rules do exist.

• An identifier is a string that represents an entity, as seen from outside the abstraction barrier.

Encapsulation and abstraction are powerful concepts, but we can't actually do much with an entity until we give it a name. The rule that associates a specific name with a specific entity is called a binding, and bindings give us the concept of identification.

Identification gives us two further concepts: specification and object permanence. Specification means we have the power to select a specific entity any time we want it, and object permanence means we can safely assume that we'll get the same entity every time. It's hard to overstate the importance of those two concepts. If we lose either one, we lose the ability to solve any but the most trivial of problems. Most bugs, in fact, boil down to a failure in of one of those two areas. An entity that observes those two principles strictly is said to be referentially transparent, and I'll discuss that idea in more detail later.

• A value is a string that represents an entity, as seen from inside the abstraction barrier.

Identification is all well and good, but computation still boils down to replacing one string with another. A name is a string, but it's a string we apply from the outside. It tells us that an entity exists, but nothing more. We still need a string that represents whatever's inside the abstraction barrier. We call that string a value.

A value summarizes an entity's internal structure. Summation is a form of abstraction, which means that a summary can pass through an entity's abstraction barrier. The process of finding a string that serves as an abstract summary for whatever lives inside an entity is called evaluation.

• The association between an entity's name and its value is called reference.

At this point we have three items to consider:

1. an entity
2. the name that identifies it
3. and the value that summarizes it

We know that the relationship between the entity and its name is called identification, and that the relationship between the entity and its value is called summation. That still leaves the relationship between the name and the value, though. We call that relationship reference. A name identifies its entity, but we say that it refers to the entity's value. The act of replacing a name with its associated value is called dereferencing.

Dereferencing is a type of substitution, but it's different from the kind we saw back in [id://nnnnn|MOPT-02.] We can't write a simple "replace this string with that string" rule to represent it. Dereferencing is the form of substitution that crosses an abstraction barrier, and is much more powerful than the "replace this string with that string" kind.

# Discussion:

Now that we have all those terms out on the table, entities tend to fall into one of two general categories: variables and functions. The distinction is an arbitrary one, though. Technically we could say that variables are just a special kind of function, or that functions are just a special kind of variable. Either version works, but it's easier to describe certain ideas if we think of variables and functions as different kinds of entities.

## variables:

I should start by saying that this kind of variable has very little to do with the '\$x' structures that Perl programmers call 'variables'. When programmers think about variables, they tend to think about addressable blocks of mutable storage. Officially, those are called lvalues because they can appear on the left-hand side of an assignment expression. For the purposes of this discussion, lvalues are not variables. Lvalues can do things that variables can't (i.e: violate the principle of referential transparency), which gives rise to unsolvable problems.

<digression>
Functional programming languages try to eliminate those problems by getting rid of lvalues entirely. The results are subject to debate. Functional languages do encourage good programming habits by 'Huffmann coding' them appropriately (making good things easy to do, and bad things hard to do), but they're hardly immune to the problems that lvalues pose, as some advocates would like to believe. It's also worth noting that the principle of referential transparency raises its own unsolvable problems when applied to the subjects of concurrent or distributed processing.

As Larry Wall once said, "Perl is a mess because the problem space is a mess," and it's easy enough to use Perl in a referentially transparent way.
</digression>

• For the sake of this discussion, a variable is an entity that can have some value, but isn't restricted to any specific value.

Variables represent the concept of picking a single value out of a set of possible values. The set of all values a variable can possibly have is called the variable's type, and I'll talk about types in a minute.

We use the term raw variable to describe the situation where we know a variable's type, but not its value. A raw variable is an abstraction of its type, having all the characteristics common to every item of that type, but none of the characteristics that make one item different from another. If you happen to be into set theory, a raw variable is an abstraction which represents the intersection of all characteristics belonging to every member of its type.

Raw variables are useful, because they let us define 'generic' operations that appply to every member of a type. We'll get a lot of mileage out of raw variables when start talking about functions.

### variables and type:

As I mentioned before, a type is the set of all values a variable can possibly have. If we take that idea all the way back to the level of human assumptions, a type tells us everything we can safely assume about a variable. A type is an information space, and we can represent that space with an appropriately defined language.

Now before anyone tries to make this point for me: Perl does happen to use a very relaxed type system. The interpreter does a lot of work behind the scenes to make sure a given value supports as many reasonable-to-a-human assumptions as possible. That doesn't make Perl a 'weak and sloppy' language, as advocates of strictly-typed languages sometimes imply, but it doesn't mean Perl programmers can ignore type systems, either.

Perl is a language for big kids, the way unix is an operating system for grownups. Perl acknowledges the fact that defining a type system is just as hard as writing the program itself, and that "one size fits all" type systems tend to fit everyone equally badly. Perl gives programmers both the freedom and the responsibility to define their own type systems wherever those systems happen to be necessary.

From a theoretical standpoint, the most interesting part of that whole rant was the statement, "defining a type system is just as hard as writing the program itself." That's true, but it takes a bit of work to show how and why.

As I mentioned above, a type is an information space, and we can represent that space with a language. We use formal systems to generate languages, so there must be some kind of connection between types and formal systems.

That connection is called a decision problem.

A decision problem is a question you can answer with a simple 'yes' or 'no'. The question "is this value defined for this type?" is a decision problem. Since we represent sets with languages, and generate languages using formal systems, deciding whether a value is defined for a given type boils down to the decision problem, "is this string a theorem?"

As we saw in MOPT-02, the process of deriving theorems is called computation.

Now, as it turns out, it's much harder to prove that a string is a theorem than it is to derive a theorem from an axiom. In fact, it's half-way impossible. All we can do is generate theorems, and compare them to the one we want to prove. Eventually (by which I mean "some time between now and forever"), we'll generate every theorem that can be computed, and if one of them matches our candidate, we can stop with an answer of 'yes'. We don't have any way to stop with an answer of 'no', though.

Theorem proving belongs to a set of problems that are called partially decidable, or recursively enumerable. We'll run into those terms again when we start looking at the kinds of problems we can solve with various tools. For now, it's enough to know that the only way to decide whether a value is defined for some type is to build a set of theorems up from their axioms.

We generally do that using raw variables. We start with a bunch of axioms of the form:

If we apply operation X to a variable of type Y, we get a variable of type Z.

which usually gets collapsed to the form:

X(Y) : Z

and then we start combining those rules to establish the limits of what's possible. We have to know what the operations are, though, and we have to break them down in sufficient detail to prove that X(Y) really does generate Z. By the time we've done that, we've built a microscopically-detailed specification for the actual program.

Type systems are actually quite useful as design tools. They give us a way to keep track of our assumptions without getting bogged down in the details of implementation. They also give us line-by-line suggestions about testing and error handling procedures.

### variables and scope:

So.. we know that types can be cool, and that a variable can adopt any value appropriate to its type. The trick of working with values is keeping track of when and where they're defined.

Both the entity that acts as a variable, and the name that acts as that variable's identifier, can be defined within the scope of some other entity (called a parent) (1). If they are, they only have meaning within that parent's abstraction barrier, which means that they're subject to scope limitations.

(1) - A variable's value, of course, will always be defined in the same scope as its name. We have to replace the one with the other, after all.

An identifier that lives in a different scope than the entity it's bound to is called a free variable. Free variables aren't bound to any entity at all by the terms of their definition, so it's up to the compiler and runtime environment to work out a binding. The rules the compiler and runtime environment use to establish that binding are called scope rules.

Perl has four different sets of scope rules:

• Global scope rules
• Local scope rules
• Lexical scope rules
• and Package scope rules

which cover most of the options available in any language.

Under global scope rules, every occurrence of the same name is bound to the same entity. That entity is defined in the outermost parent, regardless of where the variable is first defined. Global scoping is also called static scoping, because the binding between the identifier and the entity never changes. The compiler can create the entity at compile time, and bind every instance of the identifier it sees to that entity as it works its way through the code.

The following code, which demonstrates global binding:

\$GLOBAL = "static"; \$LOCAL = "dynamic"; \$LEXICAL = "lexical"; sub show_evaluation_context { printf ( "evaluation context: %s \n", join (',', (\$GLOBAL, \$LOCAL, \$LEXICAL)) ); } sub global_scope { printf ( "global scope: %s \n", join (',', (\$GLOBAL, \$LOCAL, \$LEXICAL)) ); show_evaluation_context(); } global_scope();

produces the following ouput:

global scope: static,dynamic,lexical evaluation context: static,dynamic,lexical

which is pretty much what we'd expect.

Perl's local scope rules create a new entity every time the local() keyword is executed. Local scoping is also known as dynamic scoping, because a single identifier can be bound to several different entities during the course of a program's lifetime.

The value of a dynamically-scoped variable depends on what's called the variable's evaluation context. A variable inherits its evaluation context from the entity that uses it, not from the entity that defines it. A dynamically scoped variable uses the most recent binding it can find for that identifier. Technically, all of Perl's global variables are dynamically scoped, but only the local() keyword can create a new entity and establish a new binding.

If we use the code above with a function that creates a dynamic (i.e.: local) variable:

sub local_scope { local (\$LOCAL) = "-------"; printf ( "local scope: %s \n", join (',', (\$GLOBAL, \$LOCAL, \$LEXICAL)) ); show_evaluation_context(); } local_scope(); print "\n"; print "but outside local_scope(), we still have:\n"; show_evaluation_context();

we get this output:

local scope: static,-------,lexical evaluation context: static,-------,lexical but outside local_scope(), we still have: evaluation context: static,dynamic,lexical

which shows two different values for \$LOCAL when we call it from two different evaluation contexts.

Lexical scope rules create entities that are only visible in the scope where they're defined. Lexically-scoped entities replace globally (or locally) scoped entities that are defined for the parent, but are not visible to any entity created in, of called from, the same scope.

A function that sets a lexical variable:

sub lexical_scope { my (\$LEXICAL) = "-------"; printf ( "lexical scope: %s \n", join (',', (\$GLOBAL, \$LOCAL, \$LEXICAL)) ); show_evaluation_context(); } sub local_override { local (\$LEXICAL) = 'DYNAMICALLY RE-BOUND VALUE'; lexical_scope(); } local_override(); print "\n"; lexical_scope();

gives us this:

lexical scope: static,dynamic,------- evaluation context: static,dynamic,DYNAMICALLY RE-BOUND VALUE lexical scope: static,dynamic,------- evaluation context: static,dynamic,lexical

which shows that lexicals override both local and global bindings in the scope where they're defined. Unlike global or local bindings, though, lexical bindings don't do anything to the evaluation context outside the scope where they're defined.

Officially, all the variables in the samples above are lvalues, but we haven't made them do anything a pure variable couldn't do. None of the code above assigns a new value to an entity that was defined somewhere else. All it does is create new entities with different values, and bind them to the same names:

• All three variables in show_evaluation_context() are free variables, and so are the three in global_scope().
• \$GLOBAL and \$LEXICAL are free in local_scope(), while \$LOCAL is dynamically bound to the entity defined in that scope.
• \$GLOBAL and \$LOCAL are free in lexical_scope(), while \$LEXICAL is lexically bound to the entity defined in that scope.

As soon as the new entities went out of scope, the free variables returned to their previous bindings. Re-binding free variables to new entities lets us create code that acts like the entity values are changing, without ever changing the values of the entities themselves.

Finally, package scope rules partition the global scope into a set of mutually-exclusive regions. The entities defined in one package are, by default, invisible to entities defined in another package. All packages inherit the ability to refer to each other from the global scope, though. Therefore, a name defined in one package can refer to an entity defined in another by using a fully-qualified identifier:

package Foo; \$VAR1 = "VAR1 in package 'Foo'."; package main; \$VAR1 = "VAR1 in package 'main'."; \$VAR2 = "VAR2 in package 'main'."; print \$Foo::VAR1, "\n"; print \$VAR1, "\n"; print \$VAR2, "\n";

produces the output:

VAR1 in package 'Foo'. VAR1 in package 'main'. VAR2 in package 'main'.

### variables and referential transparency:

The principle of referential transparency says that the same entity should always produce the same value, no matter where or when it ends up being evaluated. All the code samples shown above observe that principle. The bindings between free variables and entities change, but the entities themselves keep the same value for the life of the program.

The very idea of referential transparency seems crazy to most programmers who are used to thinking of variables as lvalues. The way they see it, the whole point of a variable is to store intermediate values while the program works its way through a calculation. Even so, for most problems, lvalues are just a convenience. In theoretical terms, we can get the same effect by re-binding a free variable to a series of new entities.

In practical terms, lvalues can be one heck of a convenience, and when used with a smattering of discipline, can make code more readable while still observing the general spirit of referential transparency.

For example.. the following snippet of code:

my \$total = 0; for \$i (1..10) { \$total += \$i; } print \$total, "\n";

isn't referentially transparent, because we assign a new value to \$total each time we pass through the loop. The referentially transparent way to do the same job would be:

sub sum_of { my (\$item, @list) = @_; return ((defined \$item) ? \$item + sum_of (@list) : 0); } print sum_of (1..10), "\n";

which is mathematically correct, but harder to read.

Functional programming advocates will say that the latter version is just as easy to read once you've learned the idioms of the functional style, but I'm sorry, it just ain't so. Recursive structures are almost always harder to understand than iterative ones because they make you keep track of several layers of bindings. As the recursion gets deeper and more entities participate in the evaluation context, it gets easier and easier to lose track of something. Recursive structures have a long and well-documented history of famous logicians making embarrassing bloopers.

That's not to say lvalues are automatically better, though. It only takes a few undisciplined assignments to turn a seemingly innocent piece of code into a vicious and evil mess. And functional programming types are correct when they say that the problems bad lvalues create are far worse than the problems bad recursion can create.

The problems with lvalues all share a common theme, though: timing.

When you use referentially transparent variables, the order of evaluation doesn't matter. Every entity will return the same value, so as long as you have all your bindings worked out, everything's fine.

Lvalues don't work that way, though. The order of evaluation is very important. You have to keep your evaluations in synch with your mutations (changes to the entity's value), or the whole thing will go blooey. You don't want to want to rearrange the sequence:

1. make sure X is greater than Y
2. subtract Y from X
3. make sure X is greater than Z
4. subtract Z from X

as:

1. make sure X is greater than Y
2. make sure X is greater than Z
3. subtract Y from X
4. subtract Z from X

if you want to guarantee that X will always be greater than zero.

The more widely you scatter evaluations and mutations of an lvalue, the more chances you have for something to fall out of synch. If something does fall out of synch, it will usually do so at runtime, in response to a set of operating conditions you may or may not be able to reproduce. The result is a bug which will be very hard to isolate and fix.

By the same reasoning, though, the more tightly you focus your work with an lvalue, the better your chances of staying in synch.

As a general rule of thumb, it's okay to manipulate an lvalue as long as:

• you only do it in the same scope where the lvalue was defined,
• and you never refer to a previous value after a mutation is complete.

If you do that properly, you should be able to encapsulate that chunk of code, and use it as a referentially transparent function. The little summation loop that uses \$total, above, obeys both of those rules. I'd personally use that version over the recursive version every time.

If you can't encapsulate a chunk of code and create a referentially transparent function, that's a warning sign that should tell you to look for design problems.

# Surrender:

Okay, that's enough. This post only covers about a third of the topics in my original outline, it took twice as long to write as I'd planned, and it's about twice as long as I expected it to be. There's no way I'm going to try dumping any more material on you this week, when we could all be wassailing ourselves into oblivion instead.

Spend some time getting used to the vocabulary, and think about types, scoping, and referential transparency. I'll be back next week to talk about functions.

Replies are listed 'Best First'.
Re: MOPT-03 - identification, part 1
by djantzen (Priest) on Dec 27, 2002 at 04:49 UTC

Poor ascii art follows, but I found it helpful to map this out a little bit (pls /msg me if the formatting appears munged):

/ scope ----- / "outside" the entity/ | abstraction | "inside" the entity/ encapsulation | barrier | encapsulation | | identifier/ binding => | | name reference => | =======> | => value/summary \ \ /scope -----

Scope is created by the abstraction barrier insofar as variables and functions defined within the entity itself are inaccessible without, i.e., a lexical variable within a subroutine is accessible only within that subroutine, not to the caller or to other subroutines called from within that unit of encapsulation.

An identifier or name is bound insofar as it merely points to an entity, whereas a name refers if it points across the abstraction barrier to a specific value or summary of the entity.

To dereference is to pull a value or summary across the abstraction barrier to access it directly (or step across the barrier to it, which I guess is an equivalent metaphor), so  my \$array_ref = [qw/foo bar/] puts the array itself safely behind the barrier, making \$array_ref[0] a violation of encapsulation. Instead, we must say \$array_ref->[0] to cross the boundary. Actually, now that I think about this, talking about Perl references muddies the water, doesn't it? Rather, we just mean the use of any variable which results in a taking-hold of the value referenced. So, a better example might just be ordinary string interpolation.

I suppose that means then that a closure is a way of exposing a scoped variable without bringing it across the abstraction barrier.

I'm still confused about what a summary/value is though. Could you provide a concrete example? Also, does object permanence mean "the same entity" in terms of memory location or content? From the discussion, I'm thinking it's the latter. Or maybe that's just the difference between the entity and it's summary?

In any case, great write up; IMHO the best so far!

NICE summary.. bravo!

• I suppose that means then that a closure is a way of exposing a scoped variable without bringing it across the abstraction barrier.

Sort of.. A closure contains a variable whose binding was set in a specific evaluation context. The variable retains that binding, even after it leaves the context where the binding was defined. A closure makes an entity accessible outside the scope where it was defined, so I'd say it exposes the entity, not the variable per se.

• I'm still confused about what a summary/value is though. Could you provide a concrete example?

Sure.. the string "hello, world.\n" is a value. We can put it into a script:

"hello, world.\n";

but we can't do anything with it. It just sits there. Even if we use it by passing it to a function:

print "hello, world.\n";

it's still inaccessible to the rest of the script. We can't do anything else with "the value we just printed". If we want to use the same value in more than one place, we need to put that value inside an entity, then give that entity a name:

sub string { return ("hello, world.\n"); } print string();

The name 'string' is an identifier, the function it's bound to is an entity, and the string "hello, world.\n" is a value.

• Also, does object permanence mean "the same entity" in terms of memory location or content?

'Memory location' is closest, but that carries all sorts of baggage specific to a given runtime environment.. it can't be a hardware location, because virtual memory systems shuffle things all over the place; it can't be a logical address, because garbage collectors tend to move things around; yadda yadda yadda.

Object permanence means that we can assume the name "string" will stay bound to the function that returns "hello, world.\n" until we explicitly re-bind it. As a counter-example, imagine a buggy compiler that sometimes confuses identifiers of the same length. The script:

sub f1 { return ("hello, world.\n"); } sub pi { return (3.14159); } for (1..10) { print f1(); }

could print any combination of "hello, world.\n"s and "3.14159"s. That would be a violation of object permanence, and would make the compiler useless as anything more than a random number generator.

Object permanence is one of those concepts that's so obvious, and so easy to take for granted, that thinking about it takes work. It's like trying to see air: the whole concept seems meaningless.. unless you're an astronomer, or live in L.A.

Okay, now in applying this further to Perl specifically, is a list (in distinction from an array or a hash) an entity? It seems to me it is not, but rather a value like your string "hello world.\n" that must be placed within an entity in order to be usable.

Also, where do anonymous entities, that is, anonymous subroutines and datastructures, fit in here? Are these entities that are not bound to a name, but rather have only references? Or are such things bound still in the theoretical sense, with the difference between named and anonymous thingies meaningful only at the language level?

Re: MOPT-03 - identification, part 1
by rir (Vicar) on Dec 29, 2002 at 00:36 UTC
Much is built upon strings.

What is a string?

A reply falls below the community's threshold of quality. You may see it by logging in.