Jargon relating to Perl strings

These are terms I use to refer to certain element or properties of Perl strings. I'm posting this as a reference, not to advocate them.

Basics

Character, or string element

An element of a string, as in what substr($s,$i,1) returns. It's a 72-bit value in theory, but it's limited to the size of a UV in practice.

"Character" is the term used to document Perl functions that work on arbitrary strings (substr, index, reverse, chr, ord, etc) and it's the term used in Wikipedia's definition of "string".

String value

The sequence of string elements in a string (irrespective of the choice of storage format used for that string).

String element semantics

Byte: A string element whose value is understood/expected to be in [0, 255] (irrespective of the choice of storage format used for that string).
Code point, or Unicode code point: A string element whose value is understood/expected to be a Unicode code point (irrespective of the choice of storage format used for that string).

String semantics

Bytes, or string of bytes: A string whose value is understood/expected to be a sequence of values in [0, 255] (irrespective of the choice of storage format used for that string).
Text, or decoded text: A string whose value is understood/expected to be a sequence of Unicode code points (irrespective of the choice of storage format used for that string).

String storage formats

UTF8=0 storage format

The format of the PV in a string whose UTF8 flag is clear (0).

It's unambiguous, but it's quite a mouthful. Some use "byte string", but those who do tend to also use it for what I call "string of bytes".

UTF8=1 storage format

The format of the PV in a string whose UTF8 flag is set (1).

It's unambiguous, but it's quite a mouthful. Some use "character string", but that's incorrect because all strings are made of characters by definition.

Other

The Unicode Bug

If changing the internal storage format of a string changes how a piece of code behaves, that code suffers from The Unicode Bug.

For example, the following code suffers from The Unicode Bug.

use feature qw( say );

use Inline C => <<'__EOI__';

   STRLEN mylength(SV* sv) {
      STRLEN len;
      (void)SvPV(sv, len);
      return len;
   }

__EOI__

$x="\xE9"; utf8::downgrade($x);
$y="\xE9"; utf8::upgrade($y);

say $x eq $y ? "equal" : "not equal";    # equal
say mylength($x);                        # 1
say mylength($y);                        # 2
[download]

Others related terms I've seen used

Byte string: This usually refers to the UTF8=0 storage format, but it could also refere to a string of bytes.
Character string: This usually refers to the UTF8=1 storage format. The term is incorrect since all strings are made of characters by definition.
Byte semantics: This usually refers to how code behaves when given a string in the UTF8=0 storage format, in distinction to how it behaves when given a string in the UTF8=1 storage format. Code that make such a distinction suffer from The Unicode Bug.
Character semantics: This usually refers to how code behaves when given a string in the UTF8=1 storage format, in distinction to how it behaves when given a string in the UTF8=0 storage format. Code that make such a distinction suffer from The Unicode Bug.

Update: Changed "regardless of the value of its UTF8 flag" to something clearer in response to JavaFan's and wrog's comments.
Update: By request, added end tags for DT, DD and P elements even though they are optional.

Comment on Jargon relating to Perl strings Select or Download Code

Replies are listed 'Best First'.
Re: Jargon relating to Perl strings by JavaFan (Canon) on Jan 17, 2012 at 10:53 UTC
I find your definition of byte confusing, and I think most people use it differently. According to your definition, `$x = "\xEC"; utf8::upgrade($x);` [download] now $x consists of a single byte. Even though it requires 16 bits of encoding. Perhaps the confusion comes from saying that for your definition of a byte, the UTF8 flag doesn't matter, yet it refers to a string element, which is defined in terms of substr, for which the UTF8 flag does matter. I'd say that in my example, $x ends up having 2 bytes, but one character. This is also the difference `wc` makes. Of course, you are free to use whatever definition you want -- just do mind that not all people share your definition. Some people prefer not use the term byte at all, just character and octet.	[reply] [d/l] [select]
Re^2: Jargon relating to Perl strings by ikegami (Patriarch) on Jan 17, 2012 at 22:23 UTC
now $x consists of a single byte. Yes, that's what I call a byte. So maybe it's my definition, not my term that's unclear. which is defined in terms of substr, for which the UTF8 flag does* matter* Ah, there's the problem. "The UTF8 flag doesn't matter" means different things to us. For a given string, `substr` will always return the same value regardless of the UTF8 flag, so I say the UTF8 doesn't matter to `substr`. `my $flag_is_0 = "\xC9ric"; utf8::downgrade($flag_is_0); my $flag_is_1 = "\xC9ric"; utf8::upgrade($flag_is_1); say substr($flag_is_0, 0, 1) eq substr($flag_is_1, 0, 1) ?1:0; # 1` [download] I shall endeavor to find something clearer. just do mind that not all people share your definition Thus this post. If I refer them to this post, they can understand what I say even if their definitions are different.	[reply] [d/l] [select]
Re^3: Jargon relating to Perl strings by JavaFan (Canon) on Jan 18, 2012 at 09:23 UTC
Right, we do mean something different with "the UTF-8 doesn't matter". I interpret that as the only difference between the internal representation of the strings is whether the UTF-8 flag is set or not -- but you use it to mean "it doesn't matter whether the internal encoding is UTF-8 or not". use Devel::Peek; my $x = my $y = "\xC9ric"; utf8::upgrade($x); utf8::upgrade($y); utf8::encode($y); Dump($x); Dump($y); # Now $x and $y differ only in the setting of the UTF-8 flag say substr($x, 0, 1) eq substr($y, 0, 1) ? "equal" : "different"; __END__ SV = PV(0x8cd80cc) at 0x8cea9ec REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x8cef6e8 "\303\211ric"\0 [UTF8 "\x{c9}ric"] CUR = 5 LEN = 9 SV = PV(0x8cd803c) at 0x8ceaa28 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x8cf38b8 "\303\211ric"\0 CUR = 5 LEN = 9 different [download]	[reply] [d/l]
Re^4: Jargon relating to Perl strings by ikegami (Patriarch) on Jan 19, 2012 at 01:36 UTC
Re: Jargon relating to Perl strings by tchrist (Pilgrim) on Jan 17, 2012 at 14:00 UTC
I�ve certainly struggled with all this myself, trying to come up with a clean and consistent way to talk about these things. I applaud the effort, because of how confused people get over all these things. I have a sneaking suspicion that it�s our own fault that folks get confused, although I can�t pin my nebulous feeling down any better than I have just now stated. One difficulty you�re having is that you are sometimes talking about abstract strings but at others about the properties of how physical memory is laid out. That is probably too much to ask for all in one go. Even if that is your real goal here, I would exercise some caution in the order of presentation. If you (initially, and perhaps always) limit the discussion to abstract strings alone, then I do believe that a consistent set of terms can be derived, mostly along the lines that you initially pursue. A Perl string is an ordered sequence (like a list or an array) of zero or more individual scalar values. These scalar values are sometimes called code points when the number is what is emphasized, but more often called characters, albeit somewhat misleadingly. The word �character� has glyphic connotations, or even typewriter-keystroke connotations. It�s certainly a massively common shorthand, and perhaps even a reasonably serviceable one, but it it is not without its pitfalls. A code point that fits within 8 bits is sometimes called a byte. A code point that fits within 21 bits is sometimes called a Unicode code point. Perl�s code points are not limited to 21 bits, but to the size of your system�s largest unsigned integer, probably either 32 or 64 bits. Unicode recognizes only two abstractions: code points and grapheme sequences. Both are determinable programmatically. A code point corresponds to what the programmer is apt to think a character to mean, being an individual scalar element in a string. However, a grapheme sequence is more apt to correspond to what the end-user thinks of as a character, because it looks like a single glyph. For example, the letter b with an acute accent is a grapheme that the user will think of as just one solitary character, whereas the programmer is apt to think of as a sequence of two distinct code points. A very common grapheme that requires two code points is the sequence of a carriage return immediately followed by a line feed. As you see, I�ve completely dodged the whole UTF8-flag thing. By giving definititions of string components for just code points alone (and graphemes built up of code points) as the fundamental contituent string components, not about bytes and characters, I (try to) avoid the thornier issues. I do not believe the UTF8-flag should be part of the initial presentation, which should have at its heart the simple abstract scalar elements � here, code points, meaning �character� numbers � of which all Perl strings are made. Abstract code points are the indivisible atoms from which our molecular strings are composed. It is my feeling that you have to present a clear picture of how Perl strings work in the abstract before you can get to messy and complicated matters of serialization schemes in physical memory or on disk. For those who need to talk about serializations, which I stress is comparatively few, then and only then can you further elaborate the dirty parts for this much smaller audience. But I fear you are going to run into serious trouble if you do so atop on existing notional framework that has pre-existing and conflicting senses for �byte� and �character�. Those two terms have too many meanings in other programming languages, so if one stays clear of them, one avoids people thinking they are things they are not. Best perhaps to leave it at code point, and perhaps hem a bit about grapheme clusters. That�s all that matters in an abstract string; physical memory is a different matter, of course.	[reply]
Re^2: Jargon relating to Perl strings by ikegami (Patriarch) on Jan 17, 2012 at 22:41 UTC
If you (initially, and perhaps always) limit the discussion to abstract strings alone, This reference will primarily be used when reporting instances of The Unicode Bug. The need for terms for the internals is inevitable as the respondent is sure to bring them into the discussion. Your passage would be great in documentation, but it does not serve my needs. Unicode recognizes only two abstractions: code points and grapheme sequences. I specifically avoided touching Unicode. Unicode has well defined terms, and the conversations in which they would be used differ from the conversations in which those I listed would be used.	[reply]
Re^2: Jargon relating to Perl strings by JavaFan (Canon) on Jan 19, 2012 at 09:33 UTC
As you see, I�ve completely dodged the whole UTF8-flag thing. By giving definititions of string components for just code points alone (and graphemes built up of code points) as the fundamental contituent string components, not about bytes and characters, I (try to) avoid the thornier issues. And by avoiding the UTF-8 flag/encoding, you're creating confusion. `$x = "\xBB"; utf8::upgrade($x);`. Now it's not clear to me whether you consider `$x` to by a byte or not. One can encode 0xBB in 8 bits (and it is encoded in 8 bits in LATIN-1), but its Unicode encoding uses 16. So, if you say A code point that fits within 8 bits is sometimes called a byte, that's ambiguous. Whether or not the code point 0xBB fits in 8 bits depends on its encoding.	[reply] [d/l] [select]
Re: Jargon relating to Perl strings by wrog (Friar) on Jan 17, 2012 at 21:34 UTC
Under Basics The sequence of string elements in a string. This is not affected by the string's UTF8 flag. This is not what you want to say (it immediately confused me because my first thought was, "This is wrong because if you have a string with non-ASCII characters in it and change its UTF8 flag, that will change the sequence of elements.") What I think you meant to say The sequence of string elements in a string, irrespective of any particular choice of memory representation being used for that string and only bring up the UTF8 flag later. Also, IMHO there needs to be distinct terminology for a grouping of (usually but not always) 8 consecutive bits of physical storage the abstract array element in the case where all elements are expected to be in the range 0-255, regardless of the actual storage format the problem being (as noted by others) that most people associate "byte" with (1). Using it for (2) is unlikely to reduce the confusion out there and not having distinct terminology makes it difficult to talk about storage formats. One could possibly commandeer "octet" for (2) but realize that "octet" originated in the RFC world where a word was needed to refer to physical storage in the specific case where bytes explicitly are known to be 8 bits. On the other hand, most of the stuff in the RFC world is indeed trying to abstract away from specific hardware, so one could justify its usage in a more abstract sense that way. And "octet" does, at least, immediately imply 0-255, unlike "byte" There's also the small matter that it really doesn't make a whole lot of sense to use UTF-8-flag-on format to store something that is composed of octets, even if it is indeed possible to do. Which is why people conflate octet strings with the UTF-8-flag-off format and its 1-1 correspondence between octets and bytes (...and thus why it is indeed important to point out that UTF-8-flag-on octet strings are possible, albeit silly...) Update: yeah I edited this... sorry.	[reply]
Re^2: Jargon relating to Perl strings by ikegami (Patriarch) on Jan 18, 2012 at 02:22 UTC
This is not what you want to say Indeed. I discovered this reading JavaFan's post. What I think you meant to say Perfect. I shall update. a grouping of (usually but not always) 8 consecutive bits of physical storage UTF8=0 storage format. the problem being (as noted by others) that most people associate "byte" with (1) If so, then reading 5 bytes produces a variable number of bytes. That doesn't jive. If I read 5 bytes from a file, what I get if 5 bytes as far as I'm concerned. I'm open to a better word that "byte" for this, but I haven't come across one. One could possibly commandeer "octet"* An octet is simply an 8-bit byte. But since that's what byte means in all relevant circumstances anyway, "octet" is no better than "byte". — Perl doesn't currently support systems with byte sizes other than 8. There's also the small matter that it really doesn't make a whole lot of sense to use UTF-8-flag-on format to store something that is composed of octets, even if it is indeed possible to do. No, but it can happen. Say you have: `# � as C9 in source. print $bin_fh "AX100�X";` [download] And say one day you convert your source files to UTF-8. `# � as C3 E9 in source. use utf8; print $bin_fh "AX100�X";` [download] The code is still fine, yet the string in the latter has UTF8=1.	[reply] [d/l] [select]
Re^3: Jargon relating to Perl strings by wrog (Friar) on Jan 20, 2012 at 02:51 UTC
Perfect. I shall update. except you updated the wrong thing. It's the sentence "This is not affected by the string's UTF8 flag," under Basics->"String Value" that's tripping me (and apparently also javafan) up and that needs to either go away or be changed a grouping of (usually but not always) 8 consecutive bits of physical storage UTF8=0 storage format. No, any storage format. In order to talk about storage formats at all you need a word for the raw underlying bytes whatever they are and however they're to be interpreted, and redefining "byte" to mean something else makes this really difficult. You need a different word, and you're probably right that "octet" isn't a great choice either, so I had another thought: How about one of the following to refer to string elements that are constrained to lie in the 0-255 range? "octetchar" "octet-character" "bytecharacter" "byte-character" "bytechar" as opposed to "general character" or "Unicode character" when the full Unicode (or UV) range is possible. This way you're emphasizing that they're still characters in the sense that everybody agrees on (i.e., they're elements of a string and we're abstracting away from how they're represented). If you then say that a single octetchar can actually be multiple bytes in the UTF8=1 storage format, the meaning is clear.	[reply]
Re^4: Jargon relating to Perl strings by ikegami (Patriarch) on Jan 20, 2012 at 05:42 UTC
Re: Jargon relating to Perl strings by BrowserUk (Patriarch) on Jan 17, 2012 at 23:54 UTC
Any glossary that starts out by trying to redefine the (industry standard) term "byte" to mean something other than the "minimally addressable unit of memory by the processor", invalidates itself from that point on. Just pissing in the wind. As for "the Unicode bug", Unicode is the bug. There's no point in rehashing the arguments, but one thing worth saying is that as long as people like you continue to try and rewrite history is this way, in an attempt to excuse the broken Perl implementation of Unicode, the longer it will be before we can get back to a world of sensible, sane, predictable and intuitively usable semantics. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]


Syntactic Confectionery Delight
	PerlMonks