|Perl: the Markov chain saw|
MOPT-03 - identification, part 1by mstone (Deacon)
|on Dec 27, 2002 at 02:08 UTC||Need Help??|
Meditations on Programming Theory
Okay, this is where things get complicated.
The concept of identification is only the tip of a tightly-interconnected iceberg. I'm going to start by blasting my way through a bunch of vocabulary so you have some idea of where the major pieces are, then I'll go back and fill in the details.
(and apologies for slipping schedule by 4 days.. I'll try to be back on track with the next one)
Encapsulation means putting something in a container. When we encapsulate something, we create a boundary that separates 'things on the inside' from 'things on the outside'. Then we treat that container, and everything in it, as a single, coherent unit. For want of a better term, we call that unit an entity.
We treat entities differently depending on whether we're looking at them from the inside or from the outside. From the inside, an entity can have any number of pieces. From the outside, an entity is a single, irreducible thing.
The word 'abstraction' means 'taking something away', and in this case, moving from the inside of an entity to its outside takes away information about the entity's internal structure. We abstract an entity's structure by putting those pieces into a container. The boundary between an entity's inside and outside, which defines the limits of what we can see, is called an abstraction barrier.
We can only see the pieces that make up an entity from inside the entity's abstraction barrier. For all intents and purposes, neither the pieces, nor the rules that apply to them, exist outside that barrier. The term scope refers to the region inside the abstraction barrier, where those pieces and rules do exist.
Encapsulation and abstraction are powerful concepts, but we can't actually do much with an entity until we give it a name. The rule that associates a specific name with a specific entity is called a binding, and bindings give us the concept of identification.
Identification gives us two further concepts: specification and object permanence. Specification means we have the power to select a specific entity any time we want it, and object permanence means we can safely assume that we'll get the same entity every time. It's hard to overstate the importance of those two concepts. If we lose either one, we lose the ability to solve any but the most trivial of problems. Most bugs, in fact, boil down to a failure in of one of those two areas. An entity that observes those two principles strictly is said to be referentially transparent, and I'll discuss that idea in more detail later.
Identification is all well and good, but computation still boils down to replacing one string with another. A name is a string, but it's a string we apply from the outside. It tells us that an entity exists, but nothing more. We still need a string that represents whatever's inside the abstraction barrier. We call that string a value.
A value summarizes an entity's internal structure. Summation is a form of abstraction, which means that a summary can pass through an entity's abstraction barrier. The process of finding a string that serves as an abstract summary for whatever lives inside an entity is called evaluation.
At this point we have three items to consider:
We know that the relationship between the entity and its name is called identification, and that the relationship between the entity and its value is called summation. That still leaves the relationship between the name and the value, though. We call that relationship reference. A name identifies its entity, but we say that it refers to the entity's value. The act of replacing a name with its associated value is called dereferencing.
Dereferencing is a type of substitution, but it's different from the kind we saw back in [id://nnnnn|MOPT-02.] We can't write a simple "replace this string with that string" rule to represent it. Dereferencing is the form of substitution that crosses an abstraction barrier, and is much more powerful than the "replace this string with that string" kind.
Now that we have all those terms out on the table, entities tend to fall into one of two general categories: variables and functions. The distinction is an arbitrary one, though. Technically we could say that variables are just a special kind of function, or that functions are just a special kind of variable. Either version works, but it's easier to describe certain ideas if we think of variables and functions as different kinds of entities.
I should start by saying that this kind of variable has very little to do with the '$x' structures that Perl programmers call 'variables'. When programmers think about variables, they tend to think about addressable blocks of mutable storage. Officially, those are called lvalues because they can appear on the left-hand side of an assignment expression. For the purposes of this discussion, lvalues are not variables. Lvalues can do things that variables can't (i.e: violate the principle of referential transparency), which gives rise to unsolvable problems.
As Larry Wall once said, "Perl is a mess because the problem
space is a mess," and it's easy enough to use Perl in a
referentially transparent way.
Variables represent the concept of picking a single value out of a set of possible values. The set of all values a variable can possibly have is called the variable's type, and I'll talk about types in a minute.
We use the term raw variable to describe the situation where we know a variable's type, but not its value. A raw variable is an abstraction of its type, having all the characteristics common to every item of that type, but none of the characteristics that make one item different from another. If you happen to be into set theory, a raw variable is an abstraction which represents the intersection of all characteristics belonging to every member of its type.
Raw variables are useful, because they let us define 'generic' operations that appply to every member of a type. We'll get a lot of mileage out of raw variables when start talking about functions.
variables and type:
As I mentioned before, a type is the set of all values a variable can possibly have. If we take that idea all the way back to the level of human assumptions, a type tells us everything we can safely assume about a variable. A type is an information space, and we can represent that space with an appropriately defined language.
Now before anyone tries to make this point for me: Perl does happen to use a very relaxed type system. The interpreter does a lot of work behind the scenes to make sure a given value supports as many reasonable-to-a-human assumptions as possible. That doesn't make Perl a 'weak and sloppy' language, as advocates of strictly-typed languages sometimes imply, but it doesn't mean Perl programmers can ignore type systems, either.
Perl is a language for big kids, the way unix is an operating system for grownups. Perl acknowledges the fact that defining a type system is just as hard as writing the program itself, and that "one size fits all" type systems tend to fit everyone equally badly. Perl gives programmers both the freedom and the responsibility to define their own type systems wherever those systems happen to be necessary.
From a theoretical standpoint, the most interesting part of that whole rant was the statement, "defining a type system is just as hard as writing the program itself." That's true, but it takes a bit of work to show how and why.
As I mentioned above, a type is an information space, and we can represent that space with a language. We use formal systems to generate languages, so there must be some kind of connection between types and formal systems.
That connection is called a decision problem.
A decision problem is a question you can answer with a simple 'yes' or 'no'. The question "is this value defined for this type?" is a decision problem. Since we represent sets with languages, and generate languages using formal systems, deciding whether a value is defined for a given type boils down to the decision problem, "is this string a theorem?"
As we saw in MOPT-02, the process of deriving theorems is called computation.
Now, as it turns out, it's much harder to prove that a string is a theorem than it is to derive a theorem from an axiom. In fact, it's half-way impossible. All we can do is generate theorems, and compare them to the one we want to prove. Eventually (by which I mean "some time between now and forever"), we'll generate every theorem that can be computed, and if one of them matches our candidate, we can stop with an answer of 'yes'. We don't have any way to stop with an answer of 'no', though.
Theorem proving belongs to a set of problems that are called partially decidable, or recursively enumerable. We'll run into those terms again when we start looking at the kinds of problems we can solve with various tools. For now, it's enough to know that the only way to decide whether a value is defined for some type is to build a set of theorems up from their axioms.
We generally do that using raw variables. We start with a bunch of axioms of the form:
which usually gets collapsed to the form:
and then we start combining those rules to establish the limits of what's possible. We have to know what the operations are, though, and we have to break them down in sufficient detail to prove that X(Y) really does generate Z. By the time we've done that, we've built a microscopically-detailed specification for the actual program.
Type systems are actually quite useful as design tools. They give us a way to keep track of our assumptions without getting bogged down in the details of implementation. They also give us line-by-line suggestions about testing and error handling procedures.
variables and scope:
So.. we know that types can be cool, and that a variable can adopt any value appropriate to its type. The trick of working with values is keeping track of when and where they're defined.
Both the entity that acts as a variable, and the name that acts as that variable's identifier, can be defined within the scope of some other entity (called a parent) (1). If they are, they only have meaning within that parent's abstraction barrier, which means that they're subject to scope limitations.
(1) - A variable's value, of course, will always be defined in the same scope as its name. We have to replace the one with the other, after all.
An identifier that lives in a different scope than the entity it's bound to is called a free variable. Free variables aren't bound to any entity at all by the terms of their definition, so it's up to the compiler and runtime environment to work out a binding. The rules the compiler and runtime environment use to establish that binding are called scope rules.
Perl has four different sets of scope rules:
which cover most of the options available in any language.
Under global scope rules, every occurrence of the same name is bound to the same entity. That entity is defined in the outermost parent, regardless of where the variable is first defined. Global scoping is also called static scoping, because the binding between the identifier and the entity never changes. The compiler can create the entity at compile time, and bind every instance of the identifier it sees to that entity as it works its way through the code.
The following code, which demonstrates global binding:
produces the following ouput:
which is pretty much what we'd expect.
Perl's local scope rules create a new entity every time the local() keyword is executed. Local scoping is also known as dynamic scoping, because a single identifier can be bound to several different entities during the course of a program's lifetime.
The value of a dynamically-scoped variable depends on what's called the variable's evaluation context. A variable inherits its evaluation context from the entity that uses it, not from the entity that defines it. A dynamically scoped variable uses the most recent binding it can find for that identifier. Technically, all of Perl's global variables are dynamically scoped, but only the local() keyword can create a new entity and establish a new binding.
If we use the code above with a function that creates a dynamic (i.e.: local) variable:
we get this output:
which shows two different values for $LOCAL when we call it from two different evaluation contexts.
Lexical scope rules create entities that are only visible in the scope where they're defined. Lexically-scoped entities replace globally (or locally) scoped entities that are defined for the parent, but are not visible to any entity created in, of called from, the same scope.
A function that sets a lexical variable:
gives us this:
which shows that lexicals override both local and global bindings in the scope where they're defined. Unlike global or local bindings, though, lexical bindings don't do anything to the evaluation context outside the scope where they're defined.
Officially, all the variables in the samples above are lvalues, but we haven't made them do anything a pure variable couldn't do. None of the code above assigns a new value to an entity that was defined somewhere else. All it does is create new entities with different values, and bind them to the same names:
As soon as the new entities went out of scope, the free variables returned to their previous bindings. Re-binding free variables to new entities lets us create code that acts like the entity values are changing, without ever changing the values of the entities themselves.
Finally, package scope rules partition the global scope into a set of mutually-exclusive regions. The entities defined in one package are, by default, invisible to entities defined in another package. All packages inherit the ability to refer to each other from the global scope, though. Therefore, a name defined in one package can refer to an entity defined in another by using a fully-qualified identifier:
produces the output:
variables and referential transparency:
The principle of referential transparency says that the same entity should always produce the same value, no matter where or when it ends up being evaluated. All the code samples shown above observe that principle. The bindings between free variables and entities change, but the entities themselves keep the same value for the life of the program.
The very idea of referential transparency seems crazy to most programmers who are used to thinking of variables as lvalues. The way they see it, the whole point of a variable is to store intermediate values while the program works its way through a calculation. Even so, for most problems, lvalues are just a convenience. In theoretical terms, we can get the same effect by re-binding a free variable to a series of new entities.
In practical terms, lvalues can be one heck of a convenience, and when used with a smattering of discipline, can make code more readable while still observing the general spirit of referential transparency.
For example.. the following snippet of code:
isn't referentially transparent, because we assign a new value to $total each time we pass through the loop. The referentially transparent way to do the same job would be:
which is mathematically correct, but harder to read.
Functional programming advocates will say that the latter version is just as easy to read once you've learned the idioms of the functional style, but I'm sorry, it just ain't so. Recursive structures are almost always harder to understand than iterative ones because they make you keep track of several layers of bindings. As the recursion gets deeper and more entities participate in the evaluation context, it gets easier and easier to lose track of something. Recursive structures have a long and well-documented history of famous logicians making embarrassing bloopers.
That's not to say lvalues are automatically better, though. It only takes a few undisciplined assignments to turn a seemingly innocent piece of code into a vicious and evil mess. And functional programming types are correct when they say that the problems bad lvalues create are far worse than the problems bad recursion can create.
The problems with lvalues all share a common theme, though: timing.
When you use referentially transparent variables, the order of evaluation doesn't matter. Every entity will return the same value, so as long as you have all your bindings worked out, everything's fine.
Lvalues don't work that way, though. The order of evaluation is very important. You have to keep your evaluations in synch with your mutations (changes to the entity's value), or the whole thing will go blooey. You don't want to want to rearrange the sequence:
if you want to guarantee that X will always be greater than zero.
The more widely you scatter evaluations and mutations of an lvalue, the more chances you have for something to fall out of synch. If something does fall out of synch, it will usually do so at runtime, in response to a set of operating conditions you may or may not be able to reproduce. The result is a bug which will be very hard to isolate and fix.
By the same reasoning, though, the more tightly you focus your work with an lvalue, the better your chances of staying in synch.
As a general rule of thumb, it's okay to manipulate an lvalue as long as:
If you do that properly, you should be able to encapsulate that chunk of code, and use it as a referentially transparent function. The little summation loop that uses $total, above, obeys both of those rules. I'd personally use that version over the recursive version every time.
If you can't encapsulate a chunk of code and create a referentially transparent function, that's a warning sign that should tell you to look for design problems.
Okay, that's enough. This post only covers about a third of the topics in my original outline, it took twice as long to write as I'd planned, and it's about twice as long as I expected it to be. There's no way I'm going to try dumping any more material on you this week, when we could all be wassailing ourselves into oblivion instead.
Spend some time getting used to the vocabulary, and think about types, scoping, and referential transparency. I'll be back next week to talk about functions.