|Problems? Is your data what you think it is?|
Re: Jargon relating to Perl stringsby tchrist (Pilgrim)
|on Jan 17, 2012 at 14:00 UTC||Need Help??|
I’ve certainly struggled with all this myself, trying to come up with a clean and consistent way to talk about these things. I applaud the effort, because of how confused people get over all these things. I have a sneaking suspicion that it’s our own fault that folks get confused, although I can’t pin my nebulous feeling down any better than I have just now stated.
One difficulty you’re having is that you are sometimes talking about abstract strings but at others about the properties of how physical memory is laid out. That is probably too much to ask for all in one go. Even if that is your real goal here, I would exercise some caution in the order of presentation.
If you (initially, and perhaps always) limit the discussion to abstract strings alone, then I do believe that a consistent set of terms can be derived, mostly along the lines that you initially pursue.
A Perl string is an ordered sequence (like a list or an array) of zero or more individual scalar values. These scalar values are sometimes called code points when the number is what is emphasized, but more often called characters, albeit somewhat misleadingly. The word ‘character’ has glyphic connotations, or even typewriter-keystroke connotations. It’s certainly a massively common shorthand, and perhaps even a reasonably serviceable one, but it it is not without its pitfalls.
As you see, I’ve completely dodged the whole UTF8-flag thing. By giving definititions of string components for just code points alone (and graphemes built up of code points) as the fundamental contituent string components, not about bytes and characters, I (try to) avoid the thornier issues.
I do not believe the UTF8-flag should be part of the initial presentation, which should have at its heart the simple abstract scalar elements — here, code points, meaning ‘character’ numbers — of which all Perl strings are made. Abstract code points are the indivisible atoms from which our molecular strings are composed.
It is my feeling that you have to present a clear picture of how Perl strings work in the abstract before you can get to messy and complicated matters of serialization schemes in physical memory or on disk. For those who need to talk about serializations, which I stress is comparatively few, then and only then can you further elaborate the dirty parts for this much smaller audience.
But I fear you are going to run into serious trouble if you do so atop on existing notional framework that has pre-existing and conflicting senses for ‘byte’ and ‘character’. Those two terms have too many meanings in other programming languages, so if one stays clear of them, one avoids people thinking they are things they are not.
Best perhaps to leave it at code point, and perhaps hem a bit about grapheme clusters. That’s all that matters in an abstract string; physical memory is a different matter, of course.