I’ve certainly struggled with all this myself, trying to come up with a clean and consistent way to talk about these things. I applaud the effort, because of how confused people get over all these things. I have a sneaking suspicion that it’s our own fault that folks get confused, although I can’t pin my nebulous feeling down any better than I have just now stated.

One difficulty you’re having is that you are sometimes talking about abstract strings but at others about the properties of how physical memory is laid out. That is probably too much to ask for all in one go. Even if that is your real goal here, I would exercise some caution in the order of presentation.

If you (initially, and perhaps always) limit the discussion to abstract strings alone, then I do believe that a consistent set of terms can be derived, mostly along the lines that you initially pursue.


A Perl string is an ordered sequence (like a list or an array) of zero or more individual scalar values. These scalar values are sometimes called code points when the number is what is emphasized, but more often called characters, albeit somewhat misleadingly. The word ‘character’ has glyphic connotations, or even typewriter-keystroke connotations. It’s certainly a massively common shorthand, and perhaps even a reasonably serviceable one, but it it is not without its pitfalls.

A code point that fits within 8 bits is sometimes called a byte. A code point that fits within 21 bits is sometimes called a Unicode code point. Perl’s code points are not limited to 21 bits, but to the size of your system’s largest unsigned integer, probably either 32 or 64 bits.

Unicode recognizes only two abstractions: code points and grapheme sequences. Both are determinable programmatically. A code point corresponds to what the programmer is apt to think a character to mean, being an individual scalar element in a string.

However, a grapheme sequence is more apt to correspond to what the end-user thinks of as a character, because it looks like a single glyph. For example, the letter b with an acute accent is a grapheme that the user will think of as just one solitary character, whereas the programmer is apt to think of as a sequence of two distinct code points. A very common grapheme that requires two code points is the sequence of a carriage return immediately followed by a line feed.


As you see, I’ve completely dodged the whole UTF8-flag thing. By giving definititions of string components for just code points alone (and graphemes built up of code points) as the fundamental contituent string components, not about bytes and characters, I (try to) avoid the thornier issues.

I do not believe the UTF8-flag should be part of the initial presentation, which should have at its heart the simple abstract scalar elements — here, code points, meaning ‘character’ numbers — of which all Perl strings are made. Abstract code points are the indivisible atoms from which our molecular strings are composed.

It is my feeling that you have to present a clear picture of how Perl strings work in the abstract before you can get to messy and complicated matters of serialization schemes in physical memory or on disk. For those who need to talk about serializations, which I stress is comparatively few, then and only then can you further elaborate the dirty parts for this much smaller audience.

But I fear you are going to run into serious trouble if you do so atop on existing notional framework that has pre-existing and conflicting senses for ‘byte’ and ‘character’. Those two terms have too many meanings in other programming languages, so if one stays clear of them, one avoids people thinking they are things they are not.

Best perhaps to leave it at code point, and perhaps hem a bit about grapheme clusters. That’s all that matters in an abstract string; physical memory is a different matter, of course.


In reply to Re: Jargon relating to Perl strings by tchrist
in thread Jargon relating to Perl strings by ikegami

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":