Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Keep It Simple, Stupid
 
PerlMonks  

Re: Jargon relating to Perl strings

by tchrist (Pilgrim)
on Jan 17, 2012 at 14:00 UTC ( #948322=note: print w/ replies, xml ) Need Help??


in reply to Jargon relating to Perl strings

I’ve certainly struggled with all this myself, trying to come up with a clean and consistent way to talk about these things. I applaud the effort, because of how confused people get over all these things. I have a sneaking suspicion that it’s our own fault that folks get confused, although I can’t pin my nebulous feeling down any better than I have just now stated.

One difficulty you’re having is that you are sometimes talking about abstract strings but at others about the properties of how physical memory is laid out. That is probably too much to ask for all in one go. Even if that is your real goal here, I would exercise some caution in the order of presentation.

If you (initially, and perhaps always) limit the discussion to abstract strings alone, then I do believe that a consistent set of terms can be derived, mostly along the lines that you initially pursue.


A Perl string is an ordered sequence (like a list or an array) of zero or more individual scalar values. These scalar values are sometimes called code points when the number is what is emphasized, but more often called characters, albeit somewhat misleadingly. The word ‘character’ has glyphic connotations, or even typewriter-keystroke connotations. It’s certainly a massively common shorthand, and perhaps even a reasonably serviceable one, but it it is not without its pitfalls.

A code point that fits within 8 bits is sometimes called a byte. A code point that fits within 21 bits is sometimes called a Unicode code point. Perl’s code points are not limited to 21 bits, but to the size of your system’s largest unsigned integer, probably either 32 or 64 bits.

Unicode recognizes only two abstractions: code points and grapheme sequences. Both are determinable programmatically. A code point corresponds to what the programmer is apt to think a character to mean, being an individual scalar element in a string.

However, a grapheme sequence is more apt to correspond to what the end-user thinks of as a character, because it looks like a single glyph. For example, the letter b with an acute accent is a grapheme that the user will think of as just one solitary character, whereas the programmer is apt to think of as a sequence of two distinct code points. A very common grapheme that requires two code points is the sequence of a carriage return immediately followed by a line feed.


As you see, I’ve completely dodged the whole UTF8-flag thing. By giving definititions of string components for just code points alone (and graphemes built up of code points) as the fundamental contituent string components, not about bytes and characters, I (try to) avoid the thornier issues.

I do not believe the UTF8-flag should be part of the initial presentation, which should have at its heart the simple abstract scalar elements — here, code points, meaning ‘character’ numbers — of which all Perl strings are made. Abstract code points are the indivisible atoms from which our molecular strings are composed.

It is my feeling that you have to present a clear picture of how Perl strings work in the abstract before you can get to messy and complicated matters of serialization schemes in physical memory or on disk. For those who need to talk about serializations, which I stress is comparatively few, then and only then can you further elaborate the dirty parts for this much smaller audience.

But I fear you are going to run into serious trouble if you do so atop on existing notional framework that has pre-existing and conflicting senses for ‘byte’ and ‘character’. Those two terms have too many meanings in other programming languages, so if one stays clear of them, one avoids people thinking they are things they are not.

Best perhaps to leave it at code point, and perhaps hem a bit about grapheme clusters. That’s all that matters in an abstract string; physical memory is a different matter, of course.


Comment on Re: Jargon relating to Perl strings
Re^2: Jargon relating to Perl strings
by ikegami (Pope) on Jan 17, 2012 at 22:41 UTC

    If you (initially, and perhaps always) limit the discussion to abstract strings alone,

    This reference will primarily be used when reporting instances of The Unicode Bug. The need for terms for the internals is inevitable as the respondent is sure to bring them into the discussion.

    Your passage would be great in documentation, but it does not serve my needs.

    Unicode recognizes only two abstractions: code points and grapheme sequences.

    I specifically avoided touching Unicode. Unicode has well defined terms, and the conversations in which they would be used differ from the conversations in which those I listed would be used.

Re^2: Jargon relating to Perl strings
by JavaFan (Canon) on Jan 19, 2012 at 09:33 UTC
    As you see, I’ve completely dodged the whole UTF8-flag thing. By giving definititions of string components for just code points alone (and graphemes built up of code points) as the fundamental contituent string components, not about bytes and characters, I (try to) avoid the thornier issues.
    And by avoiding the UTF-8 flag/encoding, you're creating confusion. $x = "\xBB"; utf8::upgrade($x);. Now it's not clear to me whether you consider $x to by a byte or not. One can encode 0xBB in 8 bits (and it is encoded in 8 bits in LATIN-1), but its Unicode encoding uses 16. So, if you say A code point that fits within 8 bits is sometimes called a byte, that's ambiguous. Whether or not the code point 0xBB fits in 8 bits depends on its encoding.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://948322]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2014-04-18 02:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (460 votes), past polls